本篇博文主要内容为 2025-07-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-22)

今日共更新949篇论文,其中:

  • 自然语言处理117篇(Computation and Language (cs.CL))
  • 人工智能291篇(Artificial Intelligence (cs.AI))
  • 计算机视觉189篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习252篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] 3LM: Bridging Arabic STEM and Code through Benchmarking

链接: https://arxiv.org/abs/2507.15850
作者: Basma El Amel Boussaha,Leen AlQadi,Mugariya Farooq,Shaikha Alsuwaidi,Giulia Campesan,Ahmed Alzubaidi,Mohammed Alyafeai,Hakim Hacid
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-1] he Impact of Language Mixing on Bilingual LLM Reasoning

【速读】: 该论文旨在解决多语言推理模型中语言切换(language switching)行为的成因及其对推理性能的影响问题。研究表明,语言切换并非单纯由多语言训练带来的副作用,而是一种有助于提升推理能力的战略性行为。其解决方案的关键在于识别出强化学习阶段(reinforcement learning with verifiable rewards, RLVR)是诱发语言混合的核心机制,并通过轻量级探测器(probe)预测语言切换是否有益,在解码过程中引导模型进行更优的语言选择,从而显著提升数学推理任务的准确性(最高提升6.25个百分点)。

链接: https://arxiv.org/abs/2507.15849
作者: Yihao Li,Jiayi Xin,Miranda Muqing Miao,Qi Long,Lyle Ungar
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing–alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by up to 6.25 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.
zh

[NLP-2] GUI-G2: Gaussian Reward Modeling for GUI Grounding

【速读】: 该论文旨在解决图形用户界面(GUI)接地任务中因使用二值奖励机制而导致的稀疏信号问题,该机制将界面元素视为“命中-未命中”目标,忽略了空间交互的连续性。为克服这一局限,作者提出GUI Gaussian Grounding Rewards (GUI-G²),其核心在于将GUI元素建模为界面上连续的高斯分布,从而实现从稀疏二分类到密集连续优化的转变。关键创新包括:1)高斯点奖励机制,通过以元素中心为均值、指数衰减的高斯分布实现精准定位;2)覆盖奖励机制,量化预测高斯分布与目标区域之间的重叠程度以评估空间对齐;3)自适应方差机制,根据元素尺度动态调整奖励分布宽度。此框架生成丰富的梯度信号,显著提升模型在复杂界面中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2507.15846
作者: Fei Tang,Zhangxuan Gu,Zhengxi Lu,Xuyang Liu,Shuheng Shen,Changhua Meng,Wen Wang,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G ^2 ), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G ^2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G ^2 , substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.
zh

[NLP-3] Hierarchical Budget Policy Optimization for Adaptive Reasoning

【速读】: 该论文旨在解决大推理模型在链式思维(chain-of-thought)生成中因采用统一推理策略而导致的计算效率低下问题,即无论问题复杂度如何,模型均消耗相似的计算资源,造成冗余。解决方案的关键在于提出分层预算策略优化(Hierarchical Budget Policy Optimization, HBPO),通过构建多层级的token预算分组机制,将采样轨迹划分为不同预算子集,并引入差异化的奖励机制,使模型能够根据任务复杂度自动调整推理深度,从而在不牺牲推理能力的前提下实现资源的高效分配。HBPO有效缓解了效率导向训练中探索空间坍塌的问题,实现了推理效率与能力的协同优化。

链接: https://arxiv.org/abs/2507.15844
作者: Shangke Lyu,Linjuan Wu,Yuchen Yan,Xingyu Wu,Hao Li,Yongliang Shen,Peisheng Jiang,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); SF Technology
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL Project Page: this https URL

点击查看摘要

Abstract:Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.
zh

[NLP-4] Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work

【速读】: 该论文旨在解决当前AI for Good研究中普遍存在的问题,即多数文献聚焦于模型开发而忽视了与合作组织的部署过程及实际应用效果。其解决方案的关键在于通过与人道主义对人道主义(Humanitarian-to-Humanitarian, H2H)组织的紧密协作,实现AI模型在资源受限环境中的有效部署,并建立持续性能更新机制以保障长期运行。作者分享了具体实践路径和可复用的经验教训,为从业者提供了落地实施的参考框架。

链接: https://arxiv.org/abs/2507.15823
作者: Anton Abilov,Ke Zhang,Hemank Lamba,Elizabeth M. Olson,Joel R. Tetreault,Alejandro Jaimes
机构: Dataminr, Inc. (Dataminr公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Publications in the AI for Good space have tended to focus on the research and model development that can support high-impact applications. However, very few AI for Good papers discuss the process of deploying and collaborating with the partner organization, and the resulting real-world impact. In this work, we share details about the close collaboration with a humanitarian-to-humanitarian (H2H) organization and how to not only deploy the AI model in a resource-constrained environment, but also how to maintain it for continuous performance updates, and share key takeaways for practitioners.
zh

[NLP-5] Small LLM s Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

【速读】: 该论文试图解决的问题是:小规模大语言模型(LLM)是否可以通过基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法获得稳健且可泛化的理论心智(Theory of Mind, ToM)能力。其解决方案的关键在于采用规则驱动的强化学习技术对模型进行后训练,并在多个主流ToM数据集(如HiToM、ExploreToM、FANToM)上系统性地训练与评估,同时测试模型在未见数据集(如OpenToM)上的泛化性能。研究发现,尽管模型在训练数据分布内表现提升,但缺乏跨任务的通用性,且长期训练会导致模型“劫持”训练数据的统计模式,产生窄域过拟合而非真正的抽象ToM能力。

链接: https://arxiv.org/abs/2507.15788
作者: Sneheel Sarangi,Hanan Salam
机构: NYU Abu Dhabi (纽约大学阿布扎比校区)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking’’ the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.
zh

[NLP-6] Reservoir Computing as a Language Model

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)在自然语言处理中面临的高能耗与低效计算瓶颈问题,尤其是在训练和推理阶段的资源消耗限制了其性能提升与广泛部署。解决方案的关键在于探索**储备计算(Reservoir Computing, RC)**在字符级语言建模中的应用潜力,通过仅训练输出层而非整个网络结构,实现显著降低计算成本和加速推理过程;同时对比了两种RC方法——传统静态线性读出和引入注意力机制的动态权重调整RC,并验证其在参数规模一致下的效率优势,为在资源受限场景下平衡性能与计算开销提供了可行路径。

链接: https://arxiv.org/abs/2507.15779
作者: Felix Köster,Atsushi Uchida
机构: Saitama University (埼玉大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 1 table

点击查看摘要

Abstract:Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.
zh

[NLP-7] Stabilizing Knowledge Promoting Reasoning : Dual-Token Constraints for RLVR

【速读】: 该论文针对现有强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大型语言模型(Large Language Models, LLMs)时存在的问题展开研究,即:以往算法对所有token施加统一的训练信号,忽略了低熵知识相关token与高熵推理相关token在模型输出中的不同作用。这种忽视可能导致知识保持与推理能力提升之间的权衡失衡,甚至因梯度掩码或异步更新破坏语义依赖关系,影响学习效率。解决方案的关键在于提出Archer方法——一种基于熵感知的RLVR框架,通过双token约束机制实现同步更新:对推理类token采用较弱的KL正则化和更高的裁剪阈值以促进探索,而对知识类token施加强约束以保障事实准确性。这一设计有效平衡了知识保持与推理优化,实验表明其在数学推理和代码生成任务上显著优于现有方法,达到同规模模型中的最先进性能。

链接: https://arxiv.org/abs/2507.15778
作者: Jiakang Wang,Runze Liu,Fuzheng Zhang,Xiu Li,Guorui Zhou
机构: Kuaishou Technology(快手科技); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at this https URL.
zh

[NLP-8] Supernova: Achieving More with Less in Transformer Architectures

【速读】: 该论文试图解决大模型性能提升与计算资源消耗之间的矛盾问题,即如何在减少参数量和训练数据规模的前提下,仍能实现接近更大模型的性能表现。解决方案的关键在于通过架构优化与分词技术创新实现高效性:首先采用旋转位置嵌入(Rotary Positional Embeddings, RoPE)、分组查询注意力(Grouped Query Attention, GQA)与3:1压缩比、RMSNorm归一化以及SwiGLU激活函数等组件,在保持模型表达能力的同时降低计算开销;其次设计了一个包含128,000个词汇的字节级BPE分词器,显著提升了文本压缩效率,从而在仅使用100B训练token的情况下,使650M参数的模型达到1B参数模型90%的性能水平,验证了架构效率与分词质量可有效补偿参数规模的不足。

链接: https://arxiv.org/abs/2507.15773
作者: Andrei-Valentin Tanase,Elena Pelican
机构: “Ovidius” University of Constanţa (奥维德ius大学康斯坦察分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 53% fewer parameters and requiring only 100B training tokens–an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.
zh

[NLP-9] Interaction as Intelligence: Deep Research With Human-AI Partnership

【速读】: 该论文旨在解决当前深度研究系统中人机交互模式的局限性问题,即传统“输入-等待-输出”范式导致的错误级联效应、研究边界僵化以及专家知识难以融入等挑战。其核心解决方案是提出“深度认知(Deep Cognition)”框架,关键在于将人类角色从指令执行者转变为认知监督者(cognitive oversight),通过三项创新实现:(1) 可透明、可控制且可中断的交互机制,使AI推理过程可视化并支持任意节点干预;(2) 细粒度双向对话能力,促进动态语义对齐;(3) 共享认知上下文,系统能自主感知用户行为并自适应调整策略。该方案显著提升了交互效率与研究质量,在多项指标上优于基线系统,并在复杂研究任务中实现31.8%至50.0%的性能提升。

链接: https://arxiv.org/abs/2507.15759
作者: Lyumanshan Ye,Xiaojie Cai,Xinkai Wang,Junfei Wang,Xiangkun Hu,Jiadi Su,Yang Nan,Sihan Wang,Bohan Zhang,Xiaoze Fan,Jinbin Luo,Yuxiang Zheng,Tianze Xu,Dayuan Fu,Yunze Wu,Pengrui Lu,Zengzhi Wang,Yiwei Qin,Zhen Huang,Yan Ma,Zhulin Hu,Haoyang Zou,Tiantian Mi,Yixin Ye,Ethan Chern,Pengfei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:This paper introduces “Interaction as Intelligence” research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems engage in extended thinking processes for research tasks, meaningful interaction transitions from an optional enhancement to an essential component of effective intelligence. Current deep research systems adopt an “input-wait-output” paradigm where users initiate queries and receive results after black-box processing. This approach leads to error cascade effects, inflexible research boundaries that prevent question refinement during investigation, and missed opportunities for expertise integration. To address these limitations, we introduce Deep Cognition, a system that transforms the human role from giving instructions to cognitive oversight-a mode of engagement where humans guide AI thinking processes through strategic intervention at critical junctures. Deep cognition implements three key innovations: (1)Transparent, controllable, and interruptible interaction that reveals AI reasoning and enables intervention at any point; (2)Fine-grained bidirectional dialogue; and (3)Shared cognitive context where the system observes and adapts to user behaviors without explicit instruction. User evaluation demonstrates that this cognitive oversight paradigm outperforms the strongest baseline across six key metrics: Transparency(+20.0%), Fine-Grained Interaction(+29.2%), Real-Time Intervention(+18.5%), Ease of Collaboration(+27.7%), Results-Worth-Effort(+8.8%), and Interruptibility(+20.7%). Evaluations on challenging research problems show 31.8% to 50.0% points of improvements over deep research systems.
zh

[NLP-10] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

【速读】: 该论文旨在解决大型推理模型在处理简单问题时因缺乏对推理长度的自适应控制而导致的过度token生成问题(即计算资源浪费)。现有方法通常通过外部硬性限制或事后干预来控制推理长度,难以实现高效且灵活的推理过程。其解决方案的关键在于提出一种名为Length-Adaptive Policy Optimization (LAPO)的新框架,该框架将推理长度控制从外在约束转化为模型内在能力:首先通过两阶段强化学习机制,使模型学习成功解题所需的推理长度统计分布;随后利用该分布作为元认知引导信号,嵌入推理上下文以实现推理时的动态调整。实验表明,LAPO可在不降低准确率的前提下,将token消耗减少高达40.9%,并展现出根据问题复杂度分配计算资源的 emergent 能力。

链接: https://arxiv.org/abs/2507.15758
作者: Xingyu Wu,Yuchen Yan,Shangke Lyu,Linjuan Wu,Yiwen Qiu,Yongliang Shen,Weiming Lu,Jian Shao,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: GitHub: this https URL Project: this https URL

点击查看摘要

Abstract:Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model’s reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
zh

[NLP-11] DialogueForge: LLM Simulation of Human-Chatbot Dialogue

【速读】: 该论文旨在解决当前收集人类与聊天机器人对话数据所需的人工成本高、耗时长的问题,从而限制了对话式人工智能(Conversational AI)的研究进展。其解决方案的关键在于提出了一种名为DialogueForge的框架,通过从真实的人类-聊天机器人交互中提取种子提示(seed prompts)来初始化生成的对话,并利用不同规模的大语言模型(Large Language Models, LLMs)模拟人类用户与聊天机器人的多轮交互。该方法不仅支持多种模型(包括先进的商业模型和小型开源模型),还通过监督微调(supervised fine-tuning)技术显著提升小模型生成逼真对话的能力,从而在保证对话质量的同时降低数据采集门槛。

链接: https://arxiv.org/abs/2507.15752
作者: Ruizhe Zhu,Hao Zhu,Yaxuan Li,Syang Zhou,Shijing Cai,Malgorzata Lazuka,Elliott Ash
机构: ETH Zurich(苏黎世联邦理工学院); Calvin Risk AG; ETH AI Center, ETH Zurich(苏黎世联邦理工学院人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: For our code and data, see this https URL

点击查看摘要

Abstract:Collecting human-chatbot dialogues typically demands substantial manual effort and is time-consuming, which limits and poses challenges for research on conversational AI. In this work, we propose DialogueForge - a framework for generating AI-simulated conversations in human-chatbot style. To initialize each generated conversation, DialogueForge uses seed prompts extracted from real human-chatbot interactions. We test a variety of LLMs to simulate the human chatbot user, ranging from state-of-the-art proprietary models to small-scale open-source LLMs, and generate multi-turn dialogues tailored to specific tasks. In addition, we explore fine-tuning techniques to enhance the ability of smaller models to produce indistinguishable human-like dialogues. We evaluate the quality of the simulated conversations and compare different models using the UniEval and GTEval evaluation protocols. Our experiments show that large proprietary models (e.g., GPT-4o) generally outperform others in generating more realistic dialogues, while smaller open-source models (e.g., Llama, Mistral) offer promising performance with greater customization. We demonstrate that the performance of smaller models can be significantly improved by employing supervised fine-tuning techniques. Nevertheless, maintaining coherent and natural long-form human-like dialogues remains a common challenge across all models.
zh

[NLP-12] owards physician-centered oversight of conversational diagnostic AI

【速读】: 该论文试图解决的问题是:如何在保障患者安全的前提下,将生成式医疗智能系统(如Articulate Medical Intelligence Explorer, AMIE)应用于临床诊断流程中,同时确保由具备资质的医生进行最终决策与责任承担。现有研究虽证明了对话式AI在诊断对话中的潜力,但个体化诊断和治疗方案的提供属于受监管的专业活动,需由执业医师负责。为应对这一挑战,论文提出的关键解决方案是设计一种“带护栏的多代理系统”(guardrailed-AMIE, g-AMIE),其核心在于通过设定明确边界(guardrails)限制AI仅执行结构化病史采集任务,并自动汇总病例信息、提出初步诊断与管理建议供主治医师(PCP)在“临床驾驶舱界面”中审查。这种机制实现了诊疗过程中的“异步监督”,即AI完成初始评估后由PCP事后审阅,从而解耦问诊与监督环节,提升效率并保持临床责任归属清晰。

链接: https://arxiv.org/abs/2507.15743
作者: Elahe Vedadi,David Barrett,Natalie Harris,Ellery Wulczyn,Shashir Reddy,Roma Ruparel,Mike Schaekermann,Tim Strother,Ryutaro Tanno,Yash Sharma,Jihyeon Lee,Cían Hughes,Dylan Slack,Anil Palepu,Jan Freyberg,Khaled Saab,Valentin Liévin,Wei-Hung Weng,Tao Tu,Yun Liu,Nenad Tomasev,Kavita Kulkarni,S. Sara Mahdavi,Kelvin Guu,Joëlle Barral,Dale R. Webster,James Manyika,Avinatan Hassidim,Katherine Chou,Yossi Matias,Pushmeet Kohli,Adam Rodman,Vivek Natarajan,Alan Karthikesalingam,David Stutz
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究); Harvard Medical School, Beth Israel Deaconess Medical Center(哈佛医学院,贝斯以色列女执事医疗中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians’ capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.
zh

[NLP-13] A Fishers exact test justification of the TF-IDF term-weighting scheme

【速读】: 该论文旨在解决TF-IDF(词频-逆文档频率)这一广泛应用于文本分析中的词权重计算方法缺乏严谨统计学理论基础的问题。其解决方案的关键在于从显著性检验的角度重新诠释TF-IDF,具体表现为:证明了常见的TF-IDF变体TF-ICF在一定正则条件下与单尾Fisher精确检验的p值的负对数密切相关;进一步在理想假设下建立了TF-IDF与该负对数p值之间的直接联系,并指出当语料库规模趋于无穷时,该统计量收敛于标准TF-IDF。这一理论框架为TF-IDF的有效性提供了坚实的统计学解释,使统计学家能够基于显著性检验原理理解其长期实践中的成功应用。

链接: https://arxiv.org/abs/2507.15742
作者: Paul Sheridan,Zeyad Ahmed,Aitazaz A. Farooque
机构: University of Prince Edward Island (爱德华王子岛大学); Canadian Centre for Climate Change and Adaptation (加拿大气候变化与适应中心); Faculty of Sustainable Design Engineering (可持续设计工程学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)
备注: 23 pages, 4 tables

点击查看摘要

Abstract:Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term’s occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the p -value from a one-tailed version of Fisher’s exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed p -value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher’s exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme’s long-established effectiveness.
zh

[NLP-14] Understanding Large Language Models Ability on Interdisciplinary Research

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在跨学科研究(Interdisciplinary Research, IDR)场景下缺乏专门评估基准的问题,从而系统性地衡量其生成高质量跨学科研究想法的能力。解决方案的关键在于构建IDRBench——一个由领域专家标注的基准数据集和一套分阶段的任务体系,涵盖IDR论文识别、IDR想法整合与IDR想法推荐三个真实科研流程阶段,且数据源自ArXiv平台覆盖六个不同学科的科学文献,确保了评价维度的明确性和真实性。通过该基准,作者为LLMs在复杂跨域科学研究中的能力评估提供了可量化的框架,并揭示了现有模型在产生高质量跨学科创意方面的局限性。

链接: https://arxiv.org/abs/2507.15736
作者: Yuanhao Shen,Daniel Xavier de Sousa,Ricardo Marçal,Ali Asad,Hongyu Guo,Xiaodan Zhu
机构: Queen’s University (女王大学); National Research Council Canada (加拿大国家研究委员会); Instituto Federal de Goiás (戈亚斯联邦学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have revealed their impressive ability to perform multi-step, logic-driven reasoning across complex domains, positioning them as powerful tools and collaborators in scientific discovery while challenging the long-held view that inspiration-driven ideation is uniquely human. However, the lack of a dedicated benchmark that evaluates LLMs’ ability to develop ideas in Interdisciplinary Research (IDR) settings poses a critical barrier to fully understanding their strengths and limitations. To address this gap, we introduce IDRBench – a pioneering benchmark featuring an expert annotated dataset and a suite of tasks tailored to evaluate LLMs’ capabilities in proposing valuable research ideas from different scientific domains for interdisciplinary research. This benchmark aims to provide a systematic framework for assessing LLM performance in complex, cross-domain scientific research. Our dataset consists of scientific publications sourced from the ArXiv platform covering six distinct disciplines, and is annotated by domain experts with diverse academic backgrounds. To ensure high-quality annotations, we emphasize clearly defined dimensions that characterize authentic interdisciplinary research. The design of evaluation tasks in IDRBench follows a progressive, real-world perspective, reflecting the natural stages of interdisciplinary research development, including 1) IDR Paper Identification, 2) IDR Idea Integration, and 3) IDR Idea Recommendation. Using IDRBench, we construct baselines across 10 LLMs and observe that despite fostering some level of IDR awareness, LLMs still struggle to produce quality IDR ideas. These findings could not only spark new research directions, but also help to develop next-generation LLMs that excel in interdisciplinary research.
zh

[NLP-15] BEnchmarking LLM s for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

【速读】: 该论文旨在解决当前用于评估眼科领域大语言模型(Large Language Models, LLMs)的基准测试在覆盖范围上受限且过度侧重准确率的问题。为应对这一挑战,作者提出了BELO(BEnchmarking LLMs for Ophthalmology),其关键在于构建了一个标准化、全面且经多位眼科专家多轮审核的高质量评测基准:通过关键词匹配与微调后的PubMedBERT模型从多个医学数据集(如BCSC、MedMCQA等)中筛选并整合眼科相关多项选择题(Multiple-Choice Questions, MCQs),剔除重复和低质量题目,并由10名眼科医生优化每道题目的解析,再由3位资深眼科专家最终审定。该方法确保了评测内容的临床准确性与推理质量,同时引入多种文本生成指标(如ROUGE-L、BERTScore等)及人类专家主观评价,从而实现对LLMs更公平、可复现且具有临床意义的综合评估。

链接: https://arxiv.org/abs/2507.15717
作者: Sahana Srinivasan,Xuguang Ai,Thaddaeus Wai Soon Lo,Aidan Gilson,Minjie Zou,Ke Zou,Hyunjae Kim,Mingjia Yang,Krithi Pushpanathan,Samantha Yew,Wan Ting Loke,Jocelyn Goh,Yibing Chen,Yiming Kong,Emily Yuelei Fu,Michelle Ongyong Hui,Kristen Nwanyanwu,Amisha Dave,Kelvin Zhenghao Li,Chen-Hsin Sun,Mark Chia,Gabriel Dawei Yang,Wendy Meihua Wong,David Ziyou Chen,Dianbo Liu,Maxwell Singer,Fares Antaki,Lucian V Del Priore,Jost Jonas,Ron Adelman,Qingyu Chen,Yih-Chung Tham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ’s correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO’s utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.
zh

[NLP-16] From Queries to Criteria: Understanding How Astronomers Evaluate LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学领域(尤其是天文学)应用中缺乏有效评估基准的问题,因为现有评估方法未能反映真实用户在实际科研场景中的使用方式与评价标准。其解决方案的关键在于通过实证研究理解用户如何评估LLM系统:作者部署了一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的天文文献交互机器人,并对4周内368条查询及11位天文学家的访谈进行归纳编码,提炼出人类评估LLM的核心维度(如问题类型、响应准确性、实用性等),进而提出可操作的基准构建建议,并据此开发了一个面向天文学领域的样本评估基准,从而提升LLM在科学研究中的评估效度与可用性。

链接: https://arxiv.org/abs/2507.15715
作者: Alina Hyk,Kiera McCormick,Mian Zhong,Ioana Ciucă,Sanjib Sharma,John F Wu,J. E. G. Peek,Kartheik G. Iyer,Ziang Xiao,Anjalie Field
机构: Oregon State University(俄勒冈州立大学); Loyola University Maryland(洛约拉玛丽蒙特大学); Johns Hopkins University(约翰霍普金斯大学); Stanford University(斯坦福大学); Space Telescope Science Institute(空间望远镜科学研究所); Columbia University(哥伦比亚大学)
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: Accepted to the Conference on Language Modeling 2025 (COLM), 22 pages, 6 figures

点击查看摘要

Abstract:There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for evaluating LLMs for astronomy. Overall, our work offers ways to improve LLM evaluation and ultimately usability, particularly for use in scientific research.
zh

[NLP-17] Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Models Capability of Emotion Perception using Contrastive Learning

【速读】: 该论文旨在解决跨语言文本情感识别(text-based emotion detection)中的挑战,特别是针对情感表达多样性与背景差异导致的模型泛化能力不足问题。其解决方案的关键在于系统性地探索两种对比学习方法:基于样本的对比学习(Contrastive Reasoning Calibration)和基于生成的对比学习(DPO、SimPO)。前者通过对比两个样本提升预测可靠性,后者则通过区分正确与错误生成结果来优化模型输出,二者均在LLaMa3-Instruct-8B基础上进行微调,最终在多语言环境下实现了优异性能,在英语赛道分别取得第9名(Track A)和第6名(Track B),并在其他语言中跻身顶尖系统行列。

链接: https://arxiv.org/abs/2507.15714
作者: Tian Li,Yujian Sun,Huizhi Liang
机构: Newcastle University (纽卡斯尔大学); Shumei AI Research Institute (书梅人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The SemEval-2025 Task 11, Bridging the Gap in Text-Based Emotion Detection, introduces an emotion recognition challenge spanning over 28 languages. This competition encourages researchers to explore more advanced approaches to address the challenges posed by the diversity of emotional expressions and background variations. It features two tracks: multi-label classification (Track A) and emotion intensity prediction (Track B), covering six emotion categories: anger, fear, joy, sadness, surprise, and disgust. In our work, we systematically explore the benefits of two contrastive learning approaches: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO) contrastive learning. The sample-based contrastive approach trains the model by comparing two samples to generate more reliable predictions. The generation-based contrastive approach trains the model to differentiate between correct and incorrect generations, refining its prediction. All models are fine-tuned from LLaMa3-Instruct-8B. Our system achieves 9th place in Track A and 6th place in Track B for English, while ranking among the top-tier performing systems for other languages.
zh

[NLP-18] Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

【速读】: 该论文试图解决的问题是:不同类型的题目(如选择题、判断题、简答/长答题)对大型语言模型(Large Language Models, LLMs)在推理任务中准确率的影响机制尚不明确。解决方案的关键在于通过定量和演绎推理任务,系统评估五种主流LLMs在三种不同题型下的表现,重点分析推理步骤准确性与最终答案选择准确性之间的关系,并识别选项数量和词汇选择等因素对模型性能的具体影响。研究发现,题型显著影响LLM的推理能力,且推理过程准确性与最终答案正确性并不必然一致,揭示了当前LLM在复杂推理任务中存在潜在的认知偏差。

链接: https://arxiv.org/abs/2507.15707
作者: Seok Hwan Song,Mohna Chakraborty,Qi Li,Wallapak Tavanapong
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.
zh

[NLP-19] Compositional Understanding in Signaling Games

【速读】: 该论文试图解决标准信号博弈模型中接收者难以学习组合信息的问题,即即使发送者传递的是组合性信息,接收者也无法进行组合性理解;当某一信息组件丢失或遗忘时,其他组件的信息也随之消失。解决方案的关键在于构建两种新型信号博弈模型:一种是仅从信号的原子成分中学习的极简接收者模型,另一种是从所有可用信息中学习的通用接收者模型。这两种模型结构更简洁,且使接收者能够基于信号的原子组件进行有效学习,从而实现真正的组合性理解演化。

链接: https://arxiv.org/abs/2507.15706
作者: David Peter Wallis Freeborn
机构: Northeastern University London (伦敦东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Receivers in standard signaling game models struggle with learning compositional information. Even when the signalers send compositional messages, the receivers do not interpret them compositionally. When information from one message component is lost or forgotten, the information from other components is also erased. In this paper I construct signaling game models in which genuine compositional understanding evolves. I present two new models: a minimalist receiver who only learns from the atomic messages of a signal, and a generalist receiver who learns from all of the available information. These models are in many ways simpler than previous alternatives, and allow the receivers to learn from the atomic components of messages.
zh

[NLP-20] CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

【速读】: 该论文旨在解决现有过程奖励模型(Process Reward Models, PRMs)中存在的普遍长度偏差问题,即PRMs倾向于给更长的推理步骤赋予更高的评分,即使语义内容和逻辑有效性保持不变,这会削弱奖励预测的可靠性并导致推理过程中产生冗余输出。解决方案的关键在于提出一种统一的去偏框架CoLD(Counterfactually-Guided Length Debiasing),其核心包括三个组成部分:显式的长度惩罚调整、用于捕捉虚假长度相关信号的可学习偏差估计器,以及强制奖励预测在长度变化下保持不变的联合训练策略;该方法基于反事实推理和因果图分析,有效降低了奖励与长度之间的相关性,提升了推理步骤选择的准确性,并鼓励生成更简洁且逻辑有效的推理路径。

链接: https://arxiv.org/abs/2507.15698
作者: Congmin Zheng,Jiachen Zhu,Jianghao Lin,Xinyi Dai,Yong Yu,Weinan Zhang,Mengyue Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.
zh

[NLP-21] P3: Prompts Promote Prompting ACL2025

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)应用中因系统提示(system prompt)与用户提示(user prompt)各自独立优化而导致的性能瓶颈问题。由于这两类提示在实际交互中具有高度依赖性,单一优化策略难以实现全局最优。解决方案的关键在于提出一种名为P3的自提升框架,通过迭代方式同时优化系统和用户提示,并利用离线优化后的提示进一步支持在线查询相关的提示优化,从而实现对LLM行为的协同增强。实验表明,该方法在通用任务(如Arena-hard、Alpaca-eval)和推理任务(如GSM8K、GPQA)上均显著优于现有单向优化策略。

链接: https://arxiv.org/abs/2507.15675
作者: Xinyu Zhang,Yuanquan Hu,Fangchao Liu,Zhicheng Dou
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 findings

点击查看摘要

Abstract:Current large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts, to guide model behaviors. While recent advancements have demonstrated the efficacy of automatically optimizing either the system or user prompt to boost performance, such unilateral approaches often yield suboptimal outcomes due to the interdependent nature of these components. In this work, we introduce P3, a novel self-improvement framework that concurrently optimizes both system and user prompts through an iterative process. The offline optimized prompts are further leveraged to promote online prompting by performing query-dependent prompt optimization. Extensive experiments on general tasks (e.g., Arena-hard and Alpaca-eval) and reasoning tasks (e.g., GSM8K and GPQA) demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization. Our results highlight the effectiveness of a holistic optimization strategy in enhancing LLM performance across diverse domains.
zh

[NLP-22] Leverag ing Context for Multimodal Fallacy Classification in Political Debates ACL2025

【速读】: 该论文旨在解决多模态论辩挖掘(multimodal argument mining)中政治辩论里逻辑谬误识别的问题,其核心挑战在于如何有效融合文本、音频等多模态信息以提升谬误分类性能。解决方案的关键在于采用预训练的Transformer模型,并提出多种利用上下文信息的方法;实验表明,尽管多模态模型在宏平均F1分数上(0.4403)与纯文本模型(0.4444)相当,但已展现出融合多模态特征的潜力,为后续优化提供了方向。

链接: https://arxiv.org/abs/2507.15641
作者: Alessio Pittiglio
机构: DISI, University of Bologna (博洛尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12th Workshop on Argument Mining (ArgMining 2025) @ ACL 2025

点击查看摘要

Abstract:In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.
zh

[NLP-23] Data Mixing Agent : Learning to Re-weight Domains for Continual Pre-training

【速读】: 该论文旨在解决大语言模型在小规模任务特定数据上进行持续预训练时面临的灾难性遗忘(catastrophic forgetting)问题,即模型在提升新目标领域性能的同时容易丢失原有能力。解决方案的关键在于提出首个基于模型的、端到端的数据混合代理(Data Mixing Agent),该代理通过强化学习从大量数据混合轨迹中学习通用的域重加权策略,从而自动优化源域与目标域数据的混合比例,实现跨域性能平衡。该方法无需人工设计启发式规则,且在数学推理和代码生成等多个场景下展现出良好的泛化能力和高效性。

链接: https://arxiv.org/abs/2507.15640
作者: Kailai Yang,Xiao Liu,Lei Ji,Hao Li,Yeyun Gong,Peng Cheng,Mao Yang
机构: The University of Manchester (曼彻斯特大学); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents’ well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.
zh

[NLP-24] Conflicting narratives and polarization on social media

【速读】: 该论文试图解决的问题是:如何通过分析政治话语中冲突性叙事(conflicting narratives)来揭示公共领域内极化(polarization)与议题对齐(issue alignment)的语 discourse 机制。解决方案的关键在于,从对立意见群体的推文(tweets)中提取文本信号,识别出两类核心叙事差异:一是对同一组行动者(actants)赋予不同的角色归属(如对北约在乌克兰战争中的角色存在分歧),二是对同一事件使用不同行动者进行情节构建(如右翼新冠叙事中将比尔·盖茨作为关键角色)。此外,研究还首次提供了叙事对齐(narrative alignment)模式的证据,即政治行为者如何利用叙事策略在不同议题间实现观点整合。这一方法为理解社交媒体中政治极化的深层语义结构提供了新的分析框架。

链接: https://arxiv.org/abs/2507.15600
作者: Armin Pournaki
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 30 pages, 7 figures

点击查看摘要

Abstract:Narratives are key interpretative devices by which humans make sense of political reality. In this work, we show how the analysis of conflicting narratives, i.e. conflicting interpretive lenses through which political reality is experienced and told, provides insight into the discursive mechanisms of polarization and issue alignment in the public sphere. Building upon previous work that has identified ideologically polarized issues in the German Twittersphere between 2021 and 2023, we analyze the discursive dimension of polarization by extracting textual signals of conflicting narratives from tweets of opposing opinion groups. Focusing on a selection of salient issues and events (the war in Ukraine, Covid, climate change), we show evidence for conflicting narratives along two dimensions: (i) different attributions of actantial roles to the same set of actants (e.g. diverging interpretations of the role of NATO in the war in Ukraine), and (ii) emplotment of different actants for the same event (e.g. Bill Gates in the right-leaning Covid narrative). Furthermore, we provide first evidence for patterns of narrative alignment, a discursive strategy that political actors employ to align opinions across issues. These findings demonstrate the use of narratives as an analytical lens into the discursive mechanisms of polarization.
zh

[NLP-25] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因检索噪声(retrieval noises)导致大语言模型(Large Language Models, LLMs)生成质量下降的问题。现有方法在提取证据时缺乏显式推理过程,易遗漏关键线索且泛化能力弱。解决方案的关键在于提出LEARN(Learning to Extract Rational Evidence),其核心创新包括:(1) 显式地进行推理以识别检索内容中的潜在线索,随后(2) 主动提取这些线索以避免遗漏;通过将证据推理与提取统一为端到端训练的响应,引入知识标记掩码(knowledge token masks)解耦推理和提取路径,并设计三种可验证的奖励函数(答案正确性、长度、格式)基于策略优化算法更新模型,从而提升证据的紧凑性和质量,显著改善下游任务准确率,并推动在线RAG系统的实际应用。

链接: https://arxiv.org/abs/2507.15586
作者: Xinping Zhao,Shouzheng Huang,Yan Zhong,Xinshuo Hu,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 Figures, 10 Tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.
zh

[NLP-26] Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging

【速读】: 该论文旨在解决太赫兹(Terahertz, THz)成像在图像分类任务中面临的挑战,包括标注数据稀缺、图像分辨率低以及视觉信息模糊等问题。其解决方案的关键在于引入上下文学习(In-Context Learning, ICL)结合视觉-语言模型(Vision-Language Models, VLMs),通过模态对齐的提示框架(modality-aligned prompting framework)适配开放权重的VLMs至THz领域,在零样本(zero-shot)和少样本(one-shot)设置下实现无需微调(no fine-tuning)的高效分类与可解释性提升。这是首次将ICL增强的VLM应用于THz成像,为资源受限的科学领域提供了新的可行路径。

链接: https://arxiv.org/abs/2507.15576
作者: Nicolas Poggi,Shashank Agnihotri,Margret Keuper
机构: Data and Web Science Group, University of Mannheim, Germany (德国曼海姆大学数据与网络科学组); Max-Planck-Institute for Informatics, Saarland Informatics Campus, Germany (德国萨尔兰计算机科学校园马克斯·普朗克信息研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \hrefthis https URLGitHub repository.
zh

[NLP-27] Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

【速读】: 该论文旨在解决文本去毒(text detoxification)任务中多语言文本风格转换(Text Style Transfer, TST)评估的可靠性问题,特别是针对现有自动指标与人类判断之间存在显著差距、且多数研究局限于英文语境的局限性。其解决方案的关键在于首次开展涵盖九种语言(包括英语、西班牙语、德语、中文、阿拉伯语、印地语、乌克兰语、俄语和阿姆哈拉语)的综合性多语言评估研究,借鉴机器翻译领域的评估范式,对比现代神经网络模型与基于提示的大型语言模型作为裁判(LLM-as-a-judge)两种方法的有效性,从而为构建更可靠的多语言TST评估流程提供实证依据和实践指南。

链接: https://arxiv.org/abs/2507.15557
作者: Vitaly Protasov,Nikolay Babakov,Daryna Dementieva,Alexander Panchenko
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.
zh

[NLP-28] Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

【速读】: 该论文旨在解决当前测试时扩展(Test-Time Scaling, TTS)方法中训练依赖性过强导致计算开销增加的问题,尤其是在推理阶段通过持续强化学习等训练型TTS方法提升大语言模型(Large Language Models, LLMs)推理能力时所面临的效率瓶颈。其解决方案的关键在于提出一种无需训练的细粒度推理增强框架——条件步级自精炼(Conditional Step-level Self-refinement),该方法基于过程验证机制实现对推理步骤的逐级优化;在此基础上进一步融合多种经典并行扩展策略,在步骤层面构建混合测试时扩展(Hybrid Test-Time Scaling)范式,从而在不引入额外训练成本的前提下显著拓展LLMs的推理性能边界。

链接: https://arxiv.org/abs/2507.15512
作者: Kaiyan Chang,Yonghao Shi,Chenglong Wang,Hang Zhou,Chi Hu,Xiaoqian Liu,Yingfeng Luo,Yuan Ge,Tong Xiao,Jingbo Zhu
机构: Northeastern University (东北大学); NiuTrans Research (牛津研究); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.
zh

[NLP-29] Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

【速读】: 该论文旨在解决强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)中出现的过优化(overoptimization)问题。过优化表现为:随着训练进行,语言模型生成的响应分布与奖励模型(Reward Model, RM)训练时所见的数据分布发生偏移,导致RM对策略梯度的估计变得不一致,进而使模型虽获得更高奖励分数,但实际行为偏离人类偏好。解决方案的关键在于提出离策略修正奖励建模(Off-Policy Corrected Reward Modeling, OCRM),通过重要性加权对RM进行迭代式离策略校正,无需新增标注或样本即可提升RM的准确性,从而改善最终策略的质量。

链接: https://arxiv.org/abs/2507.15507
作者: Johannes Ackermann,Takashi Ishida,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accept at the Conference On Language Modeling (COLM) 2025

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at this https URL
zh

[NLP-30] ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution ACL2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在驱动数字助理执行复杂动作时面临的两大挑战:一是高质量任务数据的稀缺性,二是评估模型在特定助手库基础上生成程序时的鲁棒性不足。为应对这些问题,作者提出ASPERA框架,其关键在于包含一个助手库仿真环境和一个人工辅助的LLM数据生成引擎,该引擎可引导开发者生成包含复杂用户查询、模拟状态及对应验证程序的高质量任务,从而有效提升训练数据的多样性与真实性,并增强对LLM在定制化助手库中编程能力的评估可靠性。

链接: https://arxiv.org/abs/2507.15501
作者: Alexandru Coca,Mark Gaynor,Zhenxing Zhang,Jianpeng Cheng,Bo-Hsiang Tseng,Pete Boothroyd,Héctor Martinez Alonso,Diarmuid Ó Séaghdha,Anders Johannsen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages, 22 figures. To appear at ACL 2025

点击查看摘要

Abstract:This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. These assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.
zh

[NLP-31] AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对训练中未充分覆盖的算法问题时,其识别算法相似问题(Algorithmically Similar Problems, ASPs)的能力是否具备泛化性的问题。现有研究表明,LLMs在竞赛编程任务上表现优异,但其能否准确识别虽文本描述不同却需相同算法策略的问题仍不明确。为评估这一能力,作者构建了AlgoSimBench基准数据集,包含1317个标注有231种细粒度算法标签的问题,并设计402道多选题(MCQ),每题包含一个ASP及三个仅文本相似但算法不同的干扰项。实验表明,当前最优模型(o3-mini)仅达65.9%准确率,说明LLMs在ASP识别上存在显著不足。为此,论文提出尝试解法匹配(Attempted Solution Matching, ASM)方法,通过比较模型对问题生成的解法步骤来增强算法相似性判断,相较基线提升6.7%至11.7%准确率;进一步结合BM25关键词优先检索策略,可使准确率最高提升至52.2%,证明ASM与结构化检索协同优化是提升ASP识别性能的关键。

链接: https://arxiv.org/abs/2507.15378
作者: Jierui Li,Raymond Mooney
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: 19 pages, pre-print only

点击查看摘要

Abstract:Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs’ ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7% to 11.7% across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2% accuracy. Code and data are available at this http URL
zh

[NLP-32] STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

【速读】: 该论文旨在解决当前语音语言模型(Spoken Language Models, SLMs)缺乏内部未发声思考过程的问题,而人类在交流前通常会进行复杂的内在推理以实现清晰、简洁的表达。为实现这一目标,作者提出了一种名为Stitch的新生成方法,其关键在于通过交替生成未发声的思维片段(unspoken reasoning chunks)和已发声的响应片段(spoken response chunks),充分利用音频播放时的空闲时间来完成推理计算。由于单个语音片段的音频时长远大于生成对应文本所需的时间,模型可在播放当前语音的同时生成下一个思维片段,从而实现“边思考边说话”,既保持了与无法生成未发声链式思维(Chain-of-Thought, CoT)基线模型相当的延迟,又在数学推理数据集上性能提升15%,且在非推理任务上表现相当。

链接: https://arxiv.org/abs/2507.15375
作者: Cheng-Han Chiang,Xiaofei Wang,Linjie Li,Chung-Ching Lin,Kevin Lin,Shujie Liu,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang
机构: National Taiwan University (台湾大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Work in progress. Project page: this https URL

点击查看摘要

Abstract:Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: this https URL.
zh

[NLP-33] Metaphor and Large Language Models : When Surface Features Matter More than Deep Understanding

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在隐喻理解任务中评估不充分的问题,即以往研究多基于单一数据集、特定任务设置,并常使用通过词汇替换构造的人工数据,难以反映模型对真实语境下隐喻的处理能力。其解决方案的关键在于:通过在多个公开可用的数据集上开展系统性实验,涵盖自然语言推理(Natural Language Inference, NLI)和问答(Question Answering, QA)两类任务,且这些数据集均包含推理与隐喻标注,从而更全面地评估LLMs在不同prompt配置下的表现。研究发现,LLMs的隐喻理解性能主要受词项重叠度和句子长度等表层特征影响,而非真正意义上的语义理解,揭示了所谓“涌现能力”实为表面特征、上下文学习和语言知识共同作用的结果。

链接: https://arxiv.org/abs/2507.15357
作者: Elisa Sanchez-Bayona,Rodrigo Agerri
机构: HiTZ Center - Ixa (HiTZ中心 - Ixa); University of the Basque Country UPV/EHU (巴斯克大学 UPV/EHU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that LLMs’ performance is more influenced by features like lexical overlap and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of LLMs to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of LLMs in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code are publicly available.
zh

[NLP-34] Probing Information Distribution in Transformer Architectures through Entropy Analysis

【速读】: 该论文旨在解决如何深入理解Transformer架构中信息分布与处理机制的问题,特别是揭示模型内部表示和行为的可解释性。其解决方案的关键在于引入熵分析(entropy analysis)方法,通过量化每个token层面的不确定性并考察不同处理阶段的熵模式,从而系统地探究信息在模型中的管理与转换过程。这一方法为解析生成式AI(Generative AI)模型的行为提供了新的视角,并有助于构建更有效的可解释性和评估框架。

链接: https://arxiv.org/abs/2507.15347
作者: Amedeo Buonanno,Alessandro Rivetti,Francesco A. N. Palmieri,Giovanni Di Gennaro,Gianmarco Romano
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Presented to the Italian Workshop on Neural Networks (WIRN2025) and it will appear in a Springer Chapter

点击查看摘要

Abstract:This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models
zh

[NLP-35] LionGuard 2: Building Lightweight Data-Efficient Localised Multilingual Content Moderators

【速读】: 该论文旨在解决多语言内容审核系统在本地化和低资源语种(如新加坡的马来语、 Tamil)中存在安全漏洞的问题,同时应对小模型在数据与计算资源有限情况下性能不足的挑战。解决方案的关键在于构建一个轻量级、多语言的内容审核分类器 LionGuard 2,其核心创新包括:利用预训练的 OpenAI 嵌入(embeddings)实现跨语言语义对齐,结合多头序数分类器(multi-head ordinal classifier)提升判别能力,并基于高质量本地数据进行训练,从而在不微调大语言模型(LLM)的前提下,在新加坡特定及公开英文基准上显著优于多个商业和开源系统。

链接: https://arxiv.org/abs/2507.15339
作者: Leanne Tan,Gabriel Chua,Ziyu Ge,Roy Ka-Wei Lee
机构: GovTech(新加坡政府科技局); Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.
zh

[NLP-36] Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

【速读】: 该论文试图解决的问题是:多选题问答(Multiple-Choice Question Answering, MCQA)是否仍能作为评估大语言模型(Large Language Models, LLMs)在真实下游任务中性能的有效代理指标。随着模型推理能力的提升,MCQA基准可能因引入额外信息(如选项本身)而产生偏差,从而无法准确反映模型的真实推理能力。解决方案的关键在于系统性地评估15个不同问答基准和25个不同规模的LLM,并控制变量分析不同提示策略的影响,特别是链式思维(Chain-of-Thought, CoT)推理在选项呈现前后的使用方式。研究发现,当模型仅在选项呈现前进行CoT推理时,MCQA仍是良好代理;但若允许模型在看到选项后继续推理,则大型模型会显著依赖选项信息而表现优于自由文本回答,导致MCQA不再可靠。因此,论文提出应设计更鲁棒、抗偏倚的基准测试,以更真实地衡量LLMs的推理能力。

链接: https://arxiv.org/abs/2507.15337
作者: Narun Raman,Taylor Lundy,Kevin Leyton-Brown
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, HLE) and 25 different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether “none of the above” sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs’ genuine reasoning capabilities.
zh

[NLP-37] On the Inevitability of Left-Leaning Political Bias in Aligned Language Models

【速读】: 该论文试图解决的问题是:当前人工智能对齐(AI alignment)实践中所隐含的政治倾向性与主流研究对其“左翼偏见”的负面评价之间的矛盾。具体而言,作者指出,旨在使大语言模型(LLMs)具备无害性(harmless)、有用性(helpful)和诚实性(honest)的对齐目标,本质上依赖于以减少伤害、包容性、公平性和实证真实性为核心的进步主义道德框架,这与左翼政治原则高度一致;而右翼意识形态则常与这些对齐准则相冲突。然而,现有研究却将这种左翼倾向视为风险或问题,从而在事实上削弱了AI对齐的核心原则(HHH)。该论文解决方案的关键在于揭示:AI对齐本身必然导向左翼政治倾向,因此不应将其视为缺陷,而应承认其正当性,并批判性反思当前学界对左翼偏见的污名化态度,以维护HHP原则的完整性。

链接: https://arxiv.org/abs/2507.15328
作者: Thilo Hagendorff
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The guiding principle of AI alignment is to train large language models (LLMs) to be harmless, helpful, and honest (HHH). At the same time, there are mounting concerns that LLMs exhibit a left-wing political bias. Yet, the commitment to AI alignment cannot be harmonized with the latter critique. In this article, I argue that intelligent systems that are trained to be harmless and honest must necessarily exhibit left-wing political bias. Normative assumptions underlying alignment objectives inherently concur with progressive moral frameworks and left-wing principles, emphasizing harm avoidance, inclusivity, fairness, and empirical truthfulness. Conversely, right-wing ideologies often conflict with alignment guidelines. Yet, research on political bias in LLMs is consistently framing its insights about left-leaning tendencies as a risk, as problematic, or concerning. This way, researchers are actively arguing against AI alignment, tacitly fostering the violation of HHH principles.
zh

[NLP-38] Beyond Easy Wins: A Text Hardness-Aware Benchmark for LLM -generated Text Detection

【速读】: 该论文旨在解决当前AI文本检测器评估中存在的两大核心问题:一是现有评估方法过度依赖传统指标(如AUROC),忽视了即使较低的假阳性率也会显著阻碍检测系统在真实场景中的部署;二是缺乏对检测器在不同领域和对抗性情境下性能稳定性的考量,而实际部署中预设阈值配置要求模型具备高度稳定性。解决方案的关键在于提出SHIELD基准,将可靠性与稳定性整合为统一的评估指标,并开发了一种后处理、模型无关的人类化框架,通过可控难度参数调整AI文本的“人类特征”,从而有效挑战当前最先进的零样本检测方法在可靠性和稳定性方面的表现。

链接: https://arxiv.org/abs/2507.15286
作者: Navid Ayoobi,Sadat Shahriar,Arjun Mukherjee
机构: University of Houston (休斯顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel evaluation paradigm for AI text detectors that prioritizes real-world and equitable assessment. Current approaches predominantly report conventional metrics like AUROC, overlooking that even modest false positive rates constitute a critical impediment to practical deployment of detection systems. Furthermore, real-world deployment necessitates predetermined threshold configuration, making detector stability (i.e. the maintenance of consistent performance across diverse domains and adversarial scenarios), a critical factor. These aspects have been largely ignored in previous research and benchmarks. Our benchmark, SHIELD, addresses these limitations by integrating both reliability and stability factors into a unified evaluation metric designed for practical assessment. Furthermore, we develop a post-hoc, model-agnostic humanification framework that modifies AI text to more closely resemble human authorship, incorporating a controllable hardness parameter. This hardness-aware approach effectively challenges current SOTA zero-shot detection methods in maintaining both reliability and stability. (Data and code: this https URL)
zh

[NLP-39] A Novel Self-Evolution Framework for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练阶段受限于通用知识,导致其在特定领域认知能力不足的问题,尽管现有后训练策略(如基于记忆的检索或偏好优化)能提升用户对齐度,但难以增强模型的领域专长。解决方案的关键在于提出一种双阶段自进化(Dual-Phase Self-Evolution, DPSE)框架,通过引入Censor模块提取多维交互信号并估计满意度分数,指导基于主题感知和偏好驱动的数据扩展策略;随后利用扩展数据进行两阶段微调:监督式领域锚定与频率感知的偏好优化,从而实现用户偏好适应与领域专长的协同提升。

链接: https://arxiv.org/abs/2507.15281
作者: Haoran Sun,Zekun Zhang,Shaoning Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The capabilities of Large Language Models (LLMs) are limited to some extent by pre-training, so some researchers optimize LLMs through post-training. Existing post-training strategies, such as memory-based retrieval or preference optimization, improve user alignment yet fail to enhance the model’s domain cognition. To bridge this gap, we propose a novel Dual-Phase Self-Evolution (DPSE) framework that jointly optimizes user preference adaptation and domain-specific competence. DPSE introduces a Censor module to extract multi-dimensional interaction signals and estimate satisfaction scores, which guide structured data expansion via topic-aware and preference-driven strategies. These expanded datasets support a two-stage fine-tuning pipeline: supervised domain grounding followed by frequency-aware preference optimization. Experiments across general NLP benchmarks and long-term dialogue tasks demonstrate that DPSE consistently outperforms Supervised Fine-Tuning, Preference Optimization, and Memory-Augmented baselines. Ablation studies validate the contribution of each module. In this way, our framework provides an autonomous path toward continual self-evolution of LLMs.
zh

[NLP-40] ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

【速读】: 该论文旨在解决中文医疗领域高质量数据资源匮乏的问题,现有中文医疗数据集在规模和覆盖范围上均不足,难以支持大语言模型(Large Language Model, LLM)的预训练与强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)等复杂训练流程。解决方案的关键在于构建一个大规模、多源融合的中文医疗数据集ChiMed 2.0,其包含204.4M汉字,涵盖传统中医典籍与现代医学文本,并提供用于预训练(164.8K文档)、监督微调(351.6K问答对)及RLHF(41.7K偏好数据元组)的结构化数据,从而有效支撑中文医疗LLM的全流程训练与性能提升。

链接: https://arxiv.org/abs/2507.15275
作者: Yuanhe Tian,Junjie Liu,Zhizhou Kou,Yuxiang Li,Yan Song
机构: University of Washington (华盛顿大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset’s effectiveness and applicability.
zh

[NLP-41] A2TTS: TTS for Low Resource Indian Languages

【速读】: 该论文旨在解决生成式语音合成(Text-to-Speech, TTS)系统在面对未见过的说话人(zero-shot setting)时难以保持语音特征一致性,以及支持多种印度语言(如孟加拉语、古吉拉特语、印地语等)时存在泛化能力不足的问题。其解决方案的关键在于:采用基于扩散模型(Diffusion-based TTS)的架构,结合短参考音频提取的说话人嵌入(speaker embedding)来条件化DDPM解码器,实现多说话人语音生成;同时引入基于交叉注意力机制的持续时间预测模块(duration prediction mechanism),利用参考音频增强韵律建模和自然度,并通过无分类器引导(classifier-free guidance)提升对未知说话人的零样本语音生成质量。

链接: https://arxiv.org/abs/2507.15272
作者: Ayush Singh Bhadoriya,Abhishek Nikunj Shinde,Isha Pandey,Ganesh Ramakrishnan
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.
zh

[NLP-42] GREAT: Guiding Query Generation with a Trie for Recommending Related Search about Video at Kuaishou

【速读】: 该论文旨在解决短视频平台中视频相关搜索场景下的物品到查询(Item-to-Query, I2Q)推荐问题,即如何在用户浏览短视频时,基于视频内容精准推荐与其语义相关的搜索查询。当前主流方法依赖嵌入相似度匹配,缺乏对视频与查询之间深层语义交互的建模。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的新型框架 GREAT,该框架通过构建基于高曝光和高点击率查询的查询 trie 结构,在训练阶段增强 LLM 生成高质量查询的能力,并在推理阶段利用 trie 引导 token 级别生成过程,最终结合后处理模块进一步优化项与查询之间的相关性和文本质量,从而提升 I2Q 推荐效果。

链接: https://arxiv.org/abs/2507.15267
作者: Ninglu Shao,Jinshan Wang,Chenxu Wang,Qingbiao Li,Xiaoxue Zang,Han Li
机构: Renmin University of China (中国人民大学); Kuaishou Technology Co., Ltd. (快手科技有限公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Currently, short video platforms have become the primary place for individuals to share experiences and obtain information. To better meet users’ needs for acquiring information while browsing short videos, some apps have introduced a search entry at the bottom of videos, accompanied with recommended relevant queries. This scenario is known as query recommendation in video-related search, where core task is item-to-query (I2Q) recommendation. As this scenario has only emerged in recent years, there is a notable scarcity of academic research and publicly available datasets in this domain. To address this gap, we systematically examine the challenges associated with this scenario for the first time. Subsequently, we release a large-scale dataset derived from real-world data pertaining to the query recommendation in video-\textit\textbfrelated \textit\textbfsearch on the \textit\textbfKuaishou app (\textbfKuaiRS). Presently, existing methods rely on embeddings to calculate similarity for matching short videos with queries, lacking deep interaction between the semantic content and the query. In this paper, we introduce a novel LLM-based framework named \textbfGREAT, which \textit\textbfguides que\textit\textbfry g\textit\textbfener\textit\textbfation with a \textit\textbftrie to address I2Q recommendation in related search. Specifically, we initially gather high-quality queries with high exposure and click-through rate to construct a query-based trie. During training, we enhance the LLM’s capability to generate high-quality queries using the query-based trie. In the inference phase, the query-based trie serves as a guide for the token generation. Finally, we further refine the relevance and literal quality between items and queries via a post-processing module. Extensive offline and online experiments demonstrate the effectiveness of our proposed method.
zh

[NLP-43] SOI Matters: Analyzing Multi-Setting Training Dynamics in Pretrained Language Models via Subsets of Interest

【速读】: 该论文旨在解决预训练语言模型在多任务、多语言和多源学习场景下,其鲁棒性与性能如何变化的问题。核心挑战在于理解不同学习设置对模型训练动态的影响,尤其是识别哪些样本在训练过程中表现出特定的行为模式(如遗忘、未学习或始终正确)。解决方案的关键在于提出一种新颖的分类框架——兴趣子集(Subsets of Interest, SOI),用于刻画六类典型的学习行为模式,并通过SOI转移热图与数据集制图可视化技术,揭示样本在单设置向多设置迁移时的行为演化规律。此外,论文进一步设计了两阶段微调策略,利用SOI子集选择机制优化第二阶段训练,从而显著提升模型性能,尤其在跨分布场景下效果明显。

链接: https://arxiv.org/abs/2507.15236
作者: Shayan Vassef,Amirhossein Dabiriaghdam,Mohammadreza Bakhtiari,Yadollah Yaghoobzadeh
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of British Columbia (英属哥伦比亚大学); Stony Brook University (石溪大学); University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work investigates the impact of multi-task, multi-lingual, and multi-source learning approaches on the robustness and performance of pretrained language models. To enhance this analysis, we introduce Subsets of Interest (SOI), a novel categorization framework that identifies six distinct learning behavior patterns during training, including forgettable examples, unlearned examples, and always correct examples. Through SOI transition heatmaps and dataset cartography visualization, we analyze how examples shift between these categories when transitioning from single-setting to multi-setting configurations. We perform comprehensive experiments across three parallel comparisons: multi-task vs. single-task learning using English tasks (entailment, paraphrase, sentiment), multi-source vs. single-source learning using sentiment analysis datasets, and multi-lingual vs. single-lingual learning using intent classification in French, English, and Persian. Our results demonstrate that multi-source learning consistently improves out-of-distribution performance by up to 7%, while multi-task learning shows mixed results with notable gains in similar task combinations. We further introduce a two-stage fine-tuning approach where the second stage leverages SOI-based subset selection to achieve additional performance improvements. These findings provide new insights into training dynamics and offer practical approaches for optimizing multi-setting language model performance.
zh

[NLP-44] Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems INTERSPEECH-2025

【速读】: 该论文旨在解决语音识别中说话人验证(speaker verification)与语音匿名化(voice anonymization)系统在面对基于语音时间动态特征(speech temporal dynamics)的攻击时所暴露的安全漏洞问题。其解决方案的关键在于提出一种新的表示方法,即从语音的时间动态特性中提取上下文相关的持续时间嵌入(context-dependent duration embeddings),并基于此类嵌入构建新型攻击模型。实验结果表明,该方法相较于文献中已有的简单语音时间动态表征,在原始语音和匿名化语音数据上均显著提升了说话人验证性能,从而揭示了现有系统的潜在脆弱性。

链接: https://arxiv.org/abs/2507.15214
作者: Natalia Tomashenko,Emmanuel Vincent,Marc Tommasi
机构: Université de Lorraine, CNRS, Inria, LORIA (洛林大学, 国家科学研究中心, 法国国家信息与自动化研究院, 洛林计算机科学与应用数学研究所); Université de Lille, CNRS, Inria, Centrale Lille (里尔大学, 国家科学研究中心, 法国国家信息与自动化研究院, 里尔中央理工学院)
类目: ound (cs.SD); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech-2025

点击查看摘要

Abstract:The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization this http URL experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.
zh

[NLP-45] Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation ECAI2025

【速读】: 该论文旨在解决对话中的情感识别(Emotion Recognition in Conversation, ERC)问题,其核心挑战在于如何有效融合远距离和近距离话语的多模态特征,并应对数据不平衡带来的模型训练困难。解决方案的关键在于提出一种新型的多模态图神经网络结构——长-短距离图神经网络(Long-Short Distance Graph Neural Network, LSDGNN),该结构通过构建基于有向无环图(Directed Acyclic Graph, DAG)的长距离与短距离子网络分别提取远近话语的特征;同时引入差异正则化(Differential Regularizer)以增强两类特征表示的区分性,并借助BiAffine模块促进双路径特征交互;此外,设计改进型课程学习(Improved Curriculum Learning, ICL)策略,基于“加权情感转移”度量动态调整训练难度顺序,优先学习易样本以缓解数据分布不均问题。实验表明,该方法在IEMOCAP和MELD数据集上显著优于现有基准模型。

链接: https://arxiv.org/abs/2507.15205
作者: Xinran Li,Xiujuan Xu,Jiaqi Qiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by the 28th European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.
zh

[NLP-46] Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中面临的高计算成本和推理速度慢的问题。其核心解决方案是提出一种由多个教师模型引导的蒸馏策略,关键在于通过加权输出融合机制、特征对齐损失函数以及基于熵的动态教师权重分配策略,实现多源知识的有效迁移。该方法使学生模型在参数量较小的前提下,显著提升语言理解与生成能力,并在语言建模、文本生成及多任务学习等任务中展现出更强的一致性、泛化能力和任务适应性。

链接: https://arxiv.org/abs/2507.15198
作者: Xiandong Meng,Yan Wu,Yexin Tian,Xin Hu,Tianze Kang,Junliang Du
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and integrates their output probability distributions and intermediate semantic features. This guides the student model to learn from multiple sources of knowledge. As a result, the student model gains stronger language understanding and generation ability while maintaining a small parameter size. To achieve this, the paper introduces a weighted output fusion mechanism, a feature alignment loss function, and an entropy-driven dynamic teacher weighting strategy. These components improve the quality and stability of knowledge transfer during distillation. Under multi-teacher guidance, the student model captures semantic information more effectively and demonstrates strong performance across multiple evaluation metrics. In particular, the method shows high consistency in expression, generalization ability, and task adaptability in tasks such as language modeling, text generation, and multi-task learning. The experiments compare the proposed method with several widely adopted distillation approaches. The results further confirm its overall advantages in perplexity, distillation loss, and generation quality. This study provides a feasible technical path for the efficient compression of large-scale language models. It also demonstrates the effectiveness of multi-teacher collaborative mechanisms in complex language modeling tasks.
zh

[NLP-47] What Level of Automation is “Good Enough”? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

【速读】: 该论文旨在解决从全文随机对照试验(Randomised Controlled Trials, RCTs)中自动化提取数据用于系统评价和荟萃分析(Meta-analysis)的难题。当前挑战在于如何在保证精度的同时提升召回率,避免遗漏关键信息。解决方案的关键在于通过设计定制化提示(customised prompts)显著提高模型对统计结果、偏倚风险评估及研究特征等多类数据的识别能力,其中定制化提示将召回率最高提升15%;同时提出了一套三层级指南,依据任务复杂度与风险等级匹配不同自动化程度,从而实现生成式AI(Generative AI)效率与专家人工审核之间的平衡。

链接: https://arxiv.org/abs/2507.15152
作者: Lingbo Li,Anuradha Mathrani,Teo Susnjak
机构: Massey University (梅西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.
zh

[NLP-48] A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Geez Script

【速读】: 该论文旨在解决阿姆哈拉语(Amharic)自然语言处理中同音字归一化(homophone normalization)带来的负面影响问题,即这种预处理步骤虽然能提升自动评估指标(如BLEU分数)的表现,但会导致模型丧失对同一语言中不同书写形式的理解能力,并可能削弱跨语言迁移学习的效果。其解决方案的关键在于提出一种后推理干预(post-inference intervention)策略:不在训练数据中进行归一化处理,而是将归一化操作应用于模型预测结果阶段。通过这一简单机制,论文在保持训练数据语言特征完整性的前提下,实现了最高达1.03的BLEU分数提升,从而为技术驱动的语言演变提供了更具语言敏感性的干预思路。

链接: https://arxiv.org/abs/2507.15142
作者: Hellina Hailu Nigatu,Atnafu Lambebo Tonja,Henok Biadglign Ademtew,Hizkel Mitiku Alemayehu,Negasi Haile Abadi,Tadesse Destaw Belay,Seid Muhie Yimam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper under review

点击查看摘要

Abstract:Homophone normalization, where characters that have the same sound in a writing script are mapped to one character, is a pre-processing step applied in Amharic Natural Language Processing (NLP) literature. While this may improve performance reported by automatic metrics, it also results in models that are not able to understand different forms of writing in a single language. Further, there might be impacts in transfer learning, where models trained on normalized data do not generalize well to other languages. In this paper, we experiment with monolingual training and cross-lingual transfer to understand the impacts of normalization on languages that use the Ge’ez script. We then propose a post-inference intervention in which normalization is applied to model predictions instead of training data. With our simple scheme of post-inference normalization, we show that we can achieve an increase in BLEU score of up to 1.03 while preserving language features in training. Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.
zh

[NLP-49] From Disagreement to Understanding: The Case for Ambiguity Detection in NLI

【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)中注释者分歧(annotation disagreement)被普遍视为噪声的问题,指出这种分歧往往源于前提或假设中的语义模糊性(semantic ambiguity),并反映了人类对文本的不同解释视角。其解决方案的关键在于推动“面向模糊性的NLI”(ambiguity-aware NLI)范式,即系统性识别模糊输入对并分类模糊类型,从而将模型训练与人类解释更紧密对齐。为此,作者提出一个统一框架整合现有模糊性分类体系,并通过具体示例揭示模糊性如何影响注释决策,进而强调开发针对性的模糊性检测方法的必要性。当前主要限制是缺乏标注模糊性和子类型的语料库,因此论文建议通过构建新的标注资源和无监督模糊性检测方法来填补这一空白,以实现更鲁棒、可解释且符合人类认知的NLI系统。

链接: https://arxiv.org/abs/2507.15114
作者: Chathuri Jayaweera,Bonnie Dorr
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:This position paper argues that annotation disagreement in Natural Language Inference (NLI) is not mere noise but often reflects meaningful interpretive variation, especially when triggered by ambiguity in the premise or hypothesis. While underspecified guidelines and annotator behavior can contribute to variation, content-based ambiguity offers a process-independent signal of divergent human perspectives. We call for a shift toward ambiguity-aware NLI by systematically identifying ambiguous input pairs and classifying ambiguity types. To support this, we present a unified framework that integrates existing taxonomies and illustrate key ambiguity subtypes through concrete examples. These examples reveal how ambiguity shapes annotator decisions and motivate the need for targeted detection methods that better align models with human interpretation. A key limitation is the lack of datasets annotated for ambiguity and subtypes. We propose addressing this gap through new annotated resources and unsupervised approaches to ambiguity detection – paving the way for more robust, explainable, and human-aligned NLI systems.
zh

[NLP-50] Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)任务中因现有常识知识资源覆盖不足而导致模型性能受限的问题。其解决方案的关键在于探索大型语言模型(Large Language Models, LLMs)作为常识知识生成器的潜力,通过评估LLMs在生成常识知识时的事实性与一致性,并研究这些知识对NLI预测准确性的实际影响。实验表明,虽然显式引入常识知识并未在整体性能上带来稳定提升,但能有效增强模型对蕴含关系的区分能力,并适度改善对立和中立推理类别的判别效果。

链接: https://arxiv.org/abs/2507.15100
作者: Chathuri Jayaweera,Brianna Yanqui,Bonnie Dorr
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures and 5 tables

点击查看摘要

Abstract:Natural Language Inference (NLI) is the task of determining the semantic entailment of a premise for a given hypothesis. The task aims to develop systems that emulate natural human inferential processes where commonsense knowledge plays a major role. However, existing commonsense resources lack sufficient coverage for a variety of premise-hypothesis pairs. This study explores the potential of Large Language Models as commonsense knowledge generators for NLI along two key dimensions: their reliability in generating such knowledge and the impact of that knowledge on prediction accuracy. We adapt and modify existing metrics to assess LLM factuality and consistency in generating in this context. While explicitly incorporating commonsense knowledge does not consistently improve overall results, it effectively helps distinguish entailing instances and moderately improves distinguishing contradictory and neutral inferences.
zh

[NLP-51] A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

【速读】: 该论文旨在解决合成文本中因响应长度差异导致的词汇多样性(lexical diversity)测量偏差问题,尤其是在使用大型语言模型(Large Language Models, LLMs)生成用于进一步训练的数据时,现有度量方法如移动平均类型-标记比(Moving-Average TTR, MATTR)和压缩比(Compression Ratio, CR)易受文本长度影响,从而低估较长文本的真实多样性。解决方案的关键在于提出一种对长度变化鲁棒的新型多样性度量指标——惩罚调整型类型-标记比(Penalty-Adjusted Type-Token Ratio, PATTR),其通过引入任务特定的目标响应长度($ L_T $)对多样性计算进行校正,有效缓解长度偏差,并在视频脚本生成任务的2000万词级合成语料上验证了其优于传统指标的稳定性与准确性。

链接: https://arxiv.org/abs/2507.15092
作者: Vijeta Deshpande,Ishita Dasgupta,Uttaran Bhattacharya,Somdeb Sarkhel,Saayan Mitra,Anna Rumshisky
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校); Adobe Inc. (Adobe公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ( L_T ) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to L_T .
zh

[NLP-52] Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

【速读】: 该论文旨在解决DNA序列建模中tokenization(分词)和positional encoding(位置编码)策略的优化问题,即在基于Transformer架构的DNA语言模型中,如何选择更有效的分词方法(k-mer固定长度分割 vs. BPE子词分词)以及位置编码方式(正弦、AliBi、RoPE),以提升模型性能与泛化能力。其关键解决方案在于系统性地对比多种分词策略(k=1,3,4,5,6的k-mer与4096词表BPE)和三种位置编码方法,并在GUE基准数据集上对不同深度(3–24层)的Transformer编码器进行训练与评估,结果表明:BPE通过将高频motif压缩为变长token,显著提升模型稳定性和任务表现;RoPE在捕捉周期性motif和长序列外推方面优势明显;而增加层数从3到12层带来显著性能提升,但增至24层后收益边际递减甚至出现过拟合,为DNA Transformer设计提供了实证依据。

链接: https://arxiv.org/abs/2507.15087
作者: Chenlei Gong,Yuanhe Tian,Lei Mao,Yan Song
机构: University of Science and Technology of China (中国科学技术大学); University of Washington (华盛顿大学); Origin Omics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.
zh

[NLP-53] WebShaper: Agent ically Data Synthesizing via Information-Seeking Formalization

【速读】: 该论文旨在解决生成式 AI (Generative AI) 领域中信息搜索(Information Seeking, IS)代理因高质量训练数据稀缺而导致性能受限的问题。现有方法通常采用信息驱动范式,先从网络获取数据再生成问题,但这种策略易导致信息结构与推理结构不一致,进而影响问答准确性。解决方案的关键在于提出一种形式化驱动的 IS 数据合成框架 WebShaper,其核心是通过集合论对 IS 任务进行系统形式化,并引入知识投影(Knowledge Projection, KP)概念,借助 KP 操作组合实现对推理结构的精确控制;在数据合成过程中,以种子任务为基础,利用多步扩展机制,结合检索与验证工具逐步构建复杂任务,最终训练出在 GAIA 和 WebWalkerQA 基准上达到开源 IS 代理最先进性能的模型。

链接: https://arxiv.org/abs/2507.15061
作者: Zhengwei Tao,Jialong Wu,Wenbiao Yin,Junkai Zhang,Baixuan Li,Haiyang Shen,Kuan Li,Liwen Zhang,Xinyu Wang,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.
zh

[NLP-54] RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

【速读】: 该论文旨在解决当前基于监督微调(Supervised Fine-Tuning, SFT)构建的批评模块(critic module)在指导大语言模型(Large Language Models, LLMs)进行自我修正时效果有限的问题,即此类方法生成的批评往往流于表面,缺乏深度反思与验证能力。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的长链思维(Long-chain-of-thought)批评模块 RefCritic,通过双规则奖励机制——(1)解题判断的实例级正确性,以及(2)基于批评反馈对策略模型(policy model)的优化精度——来引导生成具有可操作性的高质量评估结果,从而显著提升模型的纠错与迭代能力。

链接: https://arxiv.org/abs/2507.15024
作者: Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Hongyu Lin,Yaojie Lu,Xianpei Han,Le Sun,Junyang Lin
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Alibaba Group (阿里巴巴集团); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models’ critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8% and 7.2% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.
zh

[NLP-55] Hear Your Code Fail Voice-Assisted Debugging for Python

【速读】: 该论文旨在解决传统Python调试过程中因依赖视觉栈跟踪(stack trace)而导致的认知负荷高、错误识别效率低的问题,尤其对视觉障碍开发者和多任务场景下的编程效率构成挑战。解决方案的关键在于设计并实现了一个基于全局异常钩子(global exception hook)架构的语音辅助调试插件,通过 pyttsx3 实现文本到语音(text-to-speech, TTS)转换,并结合 Tkinter 提供图形化界面(GUI)可视化,从而在听觉与视觉双通道同步输出错误信息;系统能在1.2秒内完成语音响应且CPU开销低于18%,显著降低认知负担(p<0.01, n=50),提升错误定位速度达78%,并支持跨平台兼容性与极简集成(仅需两行代码)。

链接: https://arxiv.org/abs/2507.15007
作者: Sayed Mahbub Hasan Amiri,Md. Mainul Islam,Mohammad Shakhawat Hossen,Sayed Majhab Hasan Amiri,Mohammad Shawkat Ali Mamun,Sk. Humaun Kabir,Naznin Akter
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注: 35 pages, 20 figures

点击查看摘要

Abstract:This research introduces an innovative voice-assisted debugging plugin for Python that transforms silent runtime errors into actionable audible diagnostics. By implementing a global exception hook architecture with pyttsx3 text-to-speech conversion and Tkinter-based GUI visualization, the solution delivers multimodal error feedback through parallel auditory and visual channels. Empirical evaluation demonstrates 37% reduced cognitive load (p0.01, n=50) compared to traditional stack-trace debugging, while enabling 78% faster error identification through vocalized exception classification and contextualization. The system achieves sub-1.2 second voice latency with under 18% CPU overhead during exception handling, vocalizing error types and consequences while displaying interactive tracebacks with documentation deep links. Criteria validate compatibility across Python 3.7+ environments on Windows, macOS, and Linux platforms. Needing only two lines of integration code, the plugin significantly boosts availability for aesthetically impaired designers and supports multitasking workflows through hands-free error medical diagnosis. Educational applications show particular promise, with pilot studies indicating 45% faster debugging skill acquisition among novice programmers. Future development will incorporate GPT-based repair suggestions and real-time multilingual translation to further advance auditory debugging paradigms. The solution represents a fundamental shift toward human-centric error diagnostics, bridging critical gaps in programming accessibility while establishing new standards for cognitive efficiency in software development workflows.
zh

[NLP-56] MUR: Momentum Uncertainty guided Reasoning for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理密集型任务中测试时缩放(Test-Time Scaling, TTS)效率低下的问题,即现有方法常导致过度思考(overthinking),浪费计算资源于冗余推理步骤。其解决方案的关键在于提出一种基于动量的不确定性引导推理机制(Momentum Uncertainty-guided Reasoning, MUR),通过动态追踪和累积每一步的不确定性来智能分配推理预算,从而仅在关键推理步骤投入更多计算资源;同时引入gamma控制(gamma-control)机制,以单一超参数实现灵活的推理预算调节,显著提升推理稳定性与准确性,在多个基准测试中平均减少50%以上计算量并带来0.62–3.37%的准确率提升。

链接: https://arxiv.org/abs/2507.14958
作者: Hang Yan,Fangzhi Xu,Rongman Xu,Yifei Li,Jian Zhang,Haoran Luo,Xiaobao Wu,Luu Anh Tuan,Haiteng Zhao,Qika Lin,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
zh

[NLP-57] SYNTHIA: Synthetic Yet Naturally Tailored Human-Inspired PersonAs

【速读】: 该论文旨在解决当前基于人物驱动的大语言模型(Persona-driven LLMs)在数据构建上的两极分化问题:一方面依赖昂贵的人工标注数据,另一方面生成的合成人物背景缺乏一致性与真实性。其解决方案的关键在于提出SYNTHIA数据集,该数据集包含30,000个来自BlueSky开放平台10,000名真实用户的 backstory(背景故事),并基于跨三个时间窗口的用户活动进行合成,从而在保持社会调查一致性与人口统计多样性的同时显著提升叙事一致性。此外,SYNTHIA引入了时间维度和丰富的社交互动元数据,为计算社会科学和人物驱动的语言建模开辟了新的研究方向。

链接: https://arxiv.org/abs/2507.14922
作者: Vahid Rahimzadeh,Erfan Moosavi Monazzah,Mohammad Taher Pilehvar,Yadollah Yaghoobzadeh
机构: Tehran Institute for Advanced Studies (德黑兰高级研究所); Khatam University (卡塔姆大学); University of Tehran (德黑兰大学); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Persona-driven LLMs have emerged as powerful tools in computational social science, yet existing approaches fall at opposite extremes, either relying on costly human-curated data or producing synthetic personas that lack consistency and realism. We introduce SYNTHIA, a dataset of 30,000 backstories derived from 10,000 real social media users from BlueSky open platform across three time windows, bridging this spectrum by grounding synthetic generation in authentic user activity. Our evaluation demonstrates that SYNTHIA achieves competitive performance with state-of-the-art methods in demographic diversity and social survey alignment while significantly outperforming them in narrative consistency. Uniquely, SYNTHIA incorporates temporal dimensionality and provides rich social interaction metadata from the underlying network, enabling new research directions in computational social science and persona-driven language modeling.
zh

[NLP-58] PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在单次提示(single prompt)评估中表现不稳定的问题,即微小的提示变化可能导致模型性能显著波动,从而影响评估结果的可靠性。为实现更稳健的多提示评估(multi-prompt evaluation),传统方法面临提示变体生成困难的挑战。论文提出的解决方案是引入 PromptSuite 框架,其关键在于通过模块化提示设计实现对提示各组件的可控扰动,并支持灵活扩展新组件与扰动类型,从而自动化生成多样且有意义的提示变体,提升评估的全面性与可重复性。

链接: https://arxiv.org/abs/2507.14913
作者: Eliya Habba,Noam Dahan,Gili Lior,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注: Eliya Habba and Noam Dahan contributed equally to this work

点击查看摘要

Abstract:Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. It is available through both a Python API: this https URL, and a user-friendly web interface: this https URL
zh

[NLP-59] From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

【速读】: 该论文旨在解决现有跨语言对齐(cross-lingual alignment)评估方法的局限性问题,特别是当前基准主要基于句向量(sentence embeddings),难以准确衡量低资源语言下的语义对齐质量,且神经网络模型常导致表示空间不平滑,影响评估可靠性。其解决方案的关键在于提出一种基于神经元状态的跨语言对齐评估方法(Neuron State-Based Cross-Lingual Alignment, NeuronXA),该方法受神经科学启发,利用相似信息激活重叠神经元区域的特性,构建更语义 grounded 的评估框架。实验表明,仅需100对平行句即可实现与下游任务性能(Pearson相关系数0.9556)和迁移能力(0.8514)的高度一致,显著优于传统方法,为多语言大模型的跨语言对齐研究提供了高效、可靠的评估工具。

链接: https://arxiv.org/abs/2507.14900
作者: Chongxuan Huang,Yongshi Ye,Biao Fu,Qifeng Su,Xiaodong Shi
机构: School of Informatics, Xiamen University (厦门大学信息学院); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism (福建省与台湾地区非物质文化遗产数字化保护与智能处理重点实验室(厦门大学),文化和旅游部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA’s effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.
zh

[NLP-60] Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言场景下出现的意外代码切换(unexpected code-switching,也称语言混用)问题,即模型在生成文本时无意识地切换到非目标语言,从而降低输出的可读性和可用性。现有方法缺乏对这一现象的机制分析且效果有限。论文的关键解决方案是提出一种基于稀疏自编码器引导的监督微调方法(Sparse Autoencoder-guided Supervised Fine-Tuning, SASFT),其核心在于通过稀疏自编码器识别出导致语言切换的特征,并在微调过程中约束这些特征的预激活值,使其保持在合理范围内,从而抑制非预期的语言切换行为。实验表明,SASFT 在五种模型、三种语言上均显著减少代码切换(平均降幅超50%),并在多数情况下完全消除该问题,同时不损害模型在多个多语言基准上的性能表现。

链接: https://arxiv.org/abs/2507.14894
作者: Boyi Deng,Yu Wan,Baosong Yang,Fei Huang,Wenjie Wang,Fuli Feng
机构: Tongyi Lab, Alibaba Group Inc(阿里巴巴集团); National University of Singapore(新加坡国立大学); Institute of Dataspace(数据空间研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose \textbfS parse \textbfA utoencoder-guided \textbfS upervised \textbfF ine \textbft uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models’ performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.
zh

[NLP-61] MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在情感-原因配对抽取(Emotion-Cause Pair Extraction, ECPE)任务中表现不佳的问题,其核心瓶颈在于缺乏辅助知识,导致模型难以有效感知情绪并准确推理原因。解决方案的关键在于提出一种多源异构知识注入方法(Multi-source Heterogeneous Knowledge Injection method, MEKiT),通过融合内部情感知识与外部因果知识,并采用指令模板引入和指令微调数据混合两种策略,分别增强模型对情绪的全面识别能力和对原因的精准推理能力,从而显著提升LLMs在ECPE任务上的性能表现。

链接: https://arxiv.org/abs/2507.14887
作者: Shiyi Mu,Yongkang Liu,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted by CogSci

点击查看摘要

Abstract:Although large language models (LLMs) excel in text comprehension and generation, their performance on the Emotion-Cause Pair Extraction (ECPE) task, which requires reasoning ability, is often underperform smaller language model. The main reason is the lack of auxiliary knowledge, which limits LLMs’ ability to effectively perceive emotions and reason causes. To address this issue, we propose a novel \textbfMulti-source h\textbfEterogeneous \textbfKnowledge \textbfinjection me\textbfThod, MEKiT, which integrates heterogeneous internal emotional knowledge and external causal knowledge. Specifically, for these two distinct aspects and structures of knowledge, we apply the approaches of incorporating instruction templates and mixing data for instruction-tuning, which respectively facilitate LLMs in more comprehensively identifying emotion and accurately reasoning causes. Experimental results demonstrate that MEKiT provides a more effective and adaptable solution for the ECPE task, exhibiting an absolute performance advantage over compared baselines and dramatically improving the performance of LLMs on the ECPE task.
zh

[NLP-62] ny language models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)预训练对计算资源要求过高、导致研究参与门槛过高的问题,从而推动更广泛的研究可及性。其解决方案的关键在于验证微型语言模型(Tiny Language Models, TLMs)是否具备与LLMs相似的关键定性特征,并提出通过软集成(soft committee)多个独立预训练的浅层TLM架构来实现与深度TLM相当的分类性能,同时显著降低延迟,且无需牺牲准确性。这一方法为在资源受限场景下部署高效、高精度的语言理解系统提供了可行路径。

链接: https://arxiv.org/abs/2507.14871
作者: Ronit D. Gross,Yarden Tzach,Tal Halevi,Ella Koresh,Ido Kanter
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 1 figure and 12 tables

点击查看摘要

Abstract:A prominent achievement of natural language processing (NLP) is its ability to understand and generate meaningful human language. This capability relies on complex feedforward transformer block architectures pre-trained on large language models (LLMs). However, LLM pre-training is currently feasible only for a few dominant companies due to the immense computational resources required, limiting broader research participation. This creates a critical need for more accessible alternatives. In this study, we explore whether tiny language models (TLMs) exhibit the same key qualitative features of LLMs. We demonstrate that TLMs exhibit a clear performance gap between pre-trained and non-pre-trained models across classification tasks, indicating the effectiveness of pre-training, even at a tiny scale. The performance gap increases with the size of the pre-training dataset and with greater overlap between tokens in the pre-training and classification datasets. Furthermore, the classification accuracy achieved by a pre-trained deep TLM architecture can be replicated through a soft committee of multiple, independently pre-trained shallow architectures, enabling low-latency TLMs without affecting classification accuracy. Our results are based on pre-training BERT-6 and variants of BERT-1 on subsets of the Wikipedia dataset and evaluating their performance on FewRel, AGNews, and DBPedia classification tasks. Future research on TLM is expected to further illuminate the mechanisms underlying NLP, especially given that its biologically inspired models suggest that TLMs may be sufficient for children or adolescents to develop language.
zh

[NLP-63] Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding

【速读】: 该论文旨在解决大规模推理蒸馏(reasoning distillation)对小型语言模型在上下文检索与推理能力方面的影响尚不明确的问题,尤其是在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,如何有效获取和利用长文本上下文信息以提升响应可靠性这一关键挑战。解决方案的关键在于通过一系列从Deepseek-R1蒸馏得到的开源模型,在多文档问答任务中系统评估其提取和整合长上下文相关信息的能力,结果表明,推理蒸馏能够显著增强模型对长上下文的理解,其机制在于促进更详细、显式的推理过程,从而缓解长期存在的“中间丢失”(lost in the middle)问题。

链接: https://arxiv.org/abs/2507.14849
作者: Yifei Wang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning distillation has emerged as an effective approach to enhance the reasoning capabilities of smaller language models. However, the impact of large-scale reasoning distillation on other critical abilities, particularly in-context retrieval and reasoning, remains unexplored. This gap in understanding is particularly significant given the increasing importance of Retrieval-Augmented Generation (RAG) systems, where efficient acquisition and utilization of contextual information are paramount for generating reliable responses. Motivated by the need to understand how the extended long-CoT process influences long-context comprehension, we conduct a comprehensive investigation using a series of open-source models distilled from Deepseek-R1, renowned for its exceptional reasoning capabilities. Our study focuses on evaluating these models’ performance in extracting and integrating relevant information from extended contexts through multi-document question and answering tasks. Through rigorous experimentation, we demonstrate that distilled reasoning patterns significantly improve long-context understanding. Our analysis reveals that distillation fosters greater long-context awareness by promoting more detailed and explicit reasoning processes during context analysis and information parsing. This advancement effectively mitigates the persistent “lost in the middle” issue that has hindered long-context models.
zh

[NLP-64] he Invisible Leash: Why RLVR May Not Escape Its Origin

【速读】: 该论文试图解决的问题是:强化学习结合可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)是否真正扩展了基础模型的推理边界,还是仅放大了模型已知的高奖励输出以提升精度。其解决方案的关键在于提出了一种新的理论视角,指出RLVR受限于基础模型的支持集(support),无法采样初始概率为零的解,本质上是一种保守的重加权机制,可能抑制对全新解的发现;同时揭示了熵-奖励权衡关系:尽管RLVR能稳定提高精确度(pass@1),但会逐步压缩探索空间,导致遗漏原本可访问的正确答案。实证结果进一步表明,在更大采样预算下,经验支持的收缩通常超过扩张,说明RLVR难以恢复基础模型曾能生成的正确解。这一发现提示,突破RLVR的隐性限制需依赖显式探索机制或混合策略,将概率质量引入低频解区域。

链接: https://arxiv.org/abs/2507.14843
作者: Fang Wu,Weihao Xuan,Ximing Lu,Zaid Harchaoui,Yejin Choi
机构: Stanford University (斯坦福大学); University of Tokyo (东京大学); RIKEN AIP; University of Washington (华盛顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model’s reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model’s support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.
zh

[NLP-65] Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents

【速读】: 该论文旨在解决从长文档中基于用户意图自动生成数据可视化图表的问题,即在不依赖用户手动筛选内容的情况下,实现零样本(zero-shot)场景下的意图驱动图表生成。传统方法多基于预定义的文本描述或表格与图表的配对进行训练,难以直接应用于真实世界中用户仅提供意图和长文档的情境。解决方案的关键在于提出一个无监督的两阶段框架:第一阶段通过分解用户意图并迭代验证与优化,由大语言模型(Large Language Models, LLMs)从文档中提取相关数据;第二阶段利用启发式引导模块选择合适的图表类型,并最终生成代码。此外,为更准确评估生成图表的数据准确性,作者设计了一种基于归因(attribution-based)的指标,采用结构化文本表示而非视觉解码来衡量图表数据一致性,从而显著提升了评估的有效性。

链接: https://arxiv.org/abs/2507.14819
作者: Akriti Jain,Pritika Ramu,Aparna Garimella,Apoorv Saxena
机构: Adobe Research(Adobe研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of intent-based chart generation from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 intent, document, charts tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto 9 points and 17 points in terms of chart data accuracy and chart type respectively over the best baselines.
zh

[NLP-66] FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing FAST

【速读】: 该论文旨在解决大型语音语言模型(Large Speech-Language Models, LSLMs)在处理长时语音(long-form speech)时效率低下且缺乏有效训练数据的问题。现有方法通常聚焦于短语音任务或语音生成能力的提升,而对长序列语音的高效建模仍面临两大挑战:一是高质量长语音训练数据稀缺,二是长序列带来的高计算开销。解决方案的关键在于提出FastLongSpeech框架,其核心创新包括两个方面:一是采用迭代融合策略(iterative fusion strategy)将过长的语音序列压缩至可管理长度;二是引入动态压缩训练方法(dynamic compression training approach),通过在不同压缩比下暴露模型于短语音序列,实现从短语音任务到长语音任务的能力迁移。该方案无需专门的长语音训练数据即可显著提升LSLMs在长语音理解与生成任务中的性能和推理效率。

链接: https://arxiv.org/abs/2507.14815
作者: Shoutao Guo,Shaolei Zhang,Qingkai Fang,Zhengrui Ma,Min Zhang,Yang Feng
机构: Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences; School of Future Science and Engineering, Soochow University
类目: Computation and Language (cs.CL)
备注: The code is at this https URL . This model is at this https URL . The dataset is at this https URL

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
zh

[NLP-67] GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization RECSYS

【速读】: 该论文针对生成式推荐系统在多行为序列推荐中面临的三大挑战展开研究:(1)token推理缺乏显式信息,导致生成过程不可解释;(2)由于标准注意力机制的二次复杂度及分词后密集表示带来的高计算开销;(3)用户历史行为的多尺度建模能力有限。解决方案的关键在于提出GRACE框架,其核心创新包括:一是引入混合思维链(Chain-of-Thought, CoT)分词方法,将产品知识图谱中的显式属性(如类别、品牌、价格)融合进语义分词,实现行为对齐且可解释的item序列生成;二是设计旅程感知稀疏注意力(Journey-Aware Sparse Attention, JSA)机制,通过选择性关注压缩后的内部、跨段落及当前上下文片段,显著降低注意力计算复杂度(最高减少48%),同时提升长序列建模效率。

链接: https://arxiv.org/abs/2507.14758
作者: Luyi Ma,Wanjia Zhang,Kai Zhao,Abhishek Kulkarni,Lalitesh Morishetti,Anjana Ganesh,Ashish Ranjan,Aashika Padmanabhan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sumit Dutta,Kamiya Motwani,Malay Patel,Evren Korpeoglu,Sushant Kumar,Kannan Achan
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures, The ACM Conference on Recommender Systems (RecSys) 2025

点击查看摘要

Abstract:Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.
zh

[NLP-68] On the robustness of modeling grounded word learning through a childs egocentric input

【速读】: 该论文试图解决的问题是:如何通过机器学习方法更贴近儿童语言习得的现实条件,即在有限且个体差异显著的语言输入下,验证神经网络是否能够稳定地习得词义映射(word-referent mappings),从而弥合大规模预训练模型与儿童语言学习之间的数据量和机制差距。其解决方案的关键在于:利用自动语音转录技术处理SAYCam数据集中的500多小时多模态视频数据,构建基于每个儿童发展经验的视觉-语言训练与评估数据集,并在多种神经网络架构下测试模型的学习鲁棒性,结果表明即使在自动标注的低质量语音数据上,不同架构的模型仍能从单个儿童的有限输入中成功习得并泛化词义映射,验证了多模态神经网络在具身词学习中的稳健性,同时揭示了个体差异对学习模式的影响。

链接: https://arxiv.org/abs/2507.14749
作者: Wai Keen Vong,Brenden M. Lake
机构: Center for Data Science, New York University (纽约大学数据科学中心); Department of Psychology, New York University (纽约大学心理学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children’s input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child’s developmental experience could acquire word-referent mappings. However, whether this approach’s success reflects the idiosyncrasies of a single child’s experience, or whether it would show consistent and robust learning patterns across multiple children’s experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire and generalize word-referent mappings across multiple network architectures. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child’s developmental experiences.
zh

[NLP-69] Disparities in Peer Review Tone and the Role of Reviewer Anonymity

【速读】: 该论文试图解决的问题是:同行评审过程中存在的隐性偏见,尤其是语言使用如何在作者性别、种族和机构背景差异下强化科学评价的不平等现象。其解决方案的关键在于采用自然语言处理(Natural Language Processing, NLP)与大规模统计建模方法,对超过8万份同行评审意见进行系统性分析,揭示评审语气、情感倾向及支持性语言在不同作者群体中的分布差异,并进一步考察署名与匿名评审对语言表达的影响,从而挑战传统对匿名评审公平性的假设,为学术出版政策改革提供实证依据。

链接: https://arxiv.org/abs/2507.14741
作者: Maria Sahakyan,Bedoor AlShebli
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The peer review process is often regarded as the gatekeeper of scientific integrity, yet increasing evidence suggests that it is not immune to bias. Although structural inequities in peer review have been widely debated, much less attention has been paid to the subtle ways in which language itself may reinforce disparities. This study undertakes one of the most comprehensive linguistic analyses of peer review to date, examining more than 80,000 reviews in two major journals. Using natural language processing and large-scale statistical modeling, it uncovers how review tone, sentiment, and supportive language vary across author demographics, including gender, race, and institutional affiliation. Using a data set that includes both anonymous and signed reviews, this research also reveals how the disclosure of reviewer identity shapes the language of evaluation. The findings not only expose hidden biases in peer feedback, but also challenge conventional assumptions about anonymity’s role in fairness. As academic publishing grapples with reform, these insights raise critical questions about how review policies shape career trajectories and scientific progress.
zh

[NLP-70] Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation ALT

【速读】: 该论文旨在解决自杀意念检测(suicidal ideation detection)在人工智能(AI)应用中的两大关键问题:一是语言覆盖有限,尤其是非英语语种数据稀缺;二是标注过程不可靠、缺乏透明度,导致模型性能评估失真。解决方案的关键在于构建一个基于社交媒体文本的土耳其语自杀意念语料库,并提出一种资源高效的人工标注框架,结合三位人工标注者与两个大语言模型(LLMs)协同标注;同时通过迁移学习引入八种预训练的情感与情绪分类器,在土耳其语和三个主流英文语料库之间进行双向标签一致性与模型一致性评估,从而揭示现有模型在零样本迁移学习下的表现不可靠,强调需建立更严谨、多语言包容的标注与评估标准,以提升心理健康自然语言处理(NLP)领域的数据与模型可靠性。

链接: https://arxiv.org/abs/2507.14693
作者: Amina Dzafic,Merve Kavut,Ulya Bayram
机构: Canakkale Onsekiz Mart University (加纳卡莱奥斯基兹马尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: This manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics

点击查看摘要

Abstract:Suicidal ideation detection is critical for real-time suicide prevention, yet its progress faces two under-explored challenges: limited language coverage and unreliable annotation practices. Most available datasets are in English, but even among these, high-quality, human-annotated data remains scarce. As a result, many studies rely on available pre-labeled datasets without examining their annotation process or label reliability. The lack of datasets in other languages further limits the global realization of suicide prevention via artificial intelligence (AI). In this study, we address one of these gaps by constructing a novel Turkish suicidal ideation corpus derived from social media posts and introducing a resource-efficient annotation framework involving three human annotators and two large language models (LLMs). We then address the remaining gaps by performing a bidirectional evaluation of label reliability and model consistency across this dataset and three popular English suicidal ideation detection datasets, using transfer learning through eight pre-trained sentiment and emotion classifiers. These transformers help assess annotation consistency and benchmark model performance against manually labeled data. Our findings underscore the need for more rigorous, language-inclusive approaches to annotation and evaluation in mental health natural language processing (NLP) while demonstrating the questionable performance of popular models with zero-shot transfer learning. We advocate for transparency in model training and dataset construction in mental health NLP, prioritizing data and model reliability.
zh

[NLP-71] Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

【速读】: 该论文旨在解决阿拉伯语后训练(post-training)数据集在质量、多样性及社区采纳方面存在的关键缺口问题,这些问题限制了阿拉伯语大语言模型(Large Language Models, LLMs)的对齐性能与应用进展。解决方案的关键在于系统性地梳理和评估 Hugging Face Hub 上现有的公开阿拉伯语后训练数据集,并从四个核心维度——LLM能力(如问答、翻译、推理等)、可控性(persona 和系统提示)、对齐性(文化、安全、伦理与公平)以及鲁棒性——进行结构化分析,从而识别出当前数据集在任务多样性、文档完整性、标注质量及社区采用率等方面的不足,并据此提出针对性改进方向,以推动阿拉伯语 LLM 的高质量发展。

链接: https://arxiv.org/abs/2507.14688
作者: Mohammed Alkhowaiter,Norah Alshahrani,Saied Alshahrani,Reem I. Masoud,Alaa Alzahrani,Deema Alnuhait,Emad A. Alghamdi,Khalid Almubarak
机构: Refine AI(Refine AI); ASAS AI(ASAS AI); University of Bisha(布希大学); University College London(伦敦大学学院); King Salman Global Academy for Arabic(萨勒曼国王阿拉伯语全球学院); University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); King Abdulaziz University(阿卜杜勒阿齐兹国王大学); HUMAIN(HUMAIN)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., persona and system prompts); (3) Alignment (e.g., cultural, safety, ethics, and fairness), and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic LLMs and applications while providing concrete recommendations for future efforts in post-training dataset development.
zh

[NLP-72] MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

【速读】: 该论文旨在解决当前推理语言模型(Reasoning Language Models, RLMs)在开源生态中缺乏透明度与可复现性的问题,尤其是在数学推理任务中,多数开源项目因未提供关键数据集和训练配置而难以验证与进一步研究。其解决方案的核心在于构建一套完全开源的推理语言模型系列——MiroMind-M1,基于Qwen-2.5架构,采用两阶段训练策略:首先在719K条经验证的思维链(Chain-of-Thought, CoT)轨迹上进行监督微调(SFT),随后在62K个高挑战性且可验证的问题上使用强化学习与价值重排序(Reinforcement Learning with Value Re-ranking, RLVR)优化;为提升RLVR过程的鲁棒性和效率,创新性地提出上下文感知多阶段策略优化(Context-Aware Multi-Stage Policy Optimization),融合长度渐进式训练与自适应重复惩罚机制,以增强上下文感知能力。该方案实现了优于或相当现有开源模型的性能,并显著提升了token效率,同时完整公开了模型、数据集及训练配置,极大促进社区复现与持续发展。

链接: https://arxiv.org/abs/2507.14683
作者: Xingxuan Li,Yao Xiao,Dianwen Ng,Hai Ye,Yue Deng,Xiang Lin,Bin Wang,Zhanfeng Mo,Chong Zhang,Yueyi Zhang,Zonglin Yang,Ruilin Li,Lei Lei,Shihao Xu,Han Zhao,Weiling Chen,Feng Ji,Lidong Bing
机构: MiroMind AI
类目: Computation and Language (cs.CL)
备注: Technical report

点击查看摘要

Abstract:Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.
zh

[NLP-73] Large Language Models as Medical Codes Selectors: a benchmark using the International Classification of Primary Care

【速读】: 该论文旨在解决临床文本自动编码问题,即如何利用大语言模型(Large Language Models, LLMs)对巴西葡萄牙语的临床表达进行国际疾病分类-2(ICPC-2)代码的自动分配。其解决方案的关键在于结合一个领域特定的语义搜索引擎(基于OpenAI的text-embedding-3-large模型)与多种LLM的提示工程(prompting),通过检索73,563个标注概念来生成候选ICPC-2代码,并由LLMs从中选出最佳匹配。实验表明,无需微调即可实现高F1分数(最高达0.85以上),且优化检索器可提升性能达4个百分点,验证了该方法在自动化医疗编码中的可行性与有效性。

链接: https://arxiv.org/abs/2507.14681
作者: Vinicius Anjos de Almeida,Vinicius de Camargo,Raquel Gómez-Bravo,Egbert van der Haring,Kees van Boven,Marcelo Finger,Luis Fernandez Lopez
机构: University of São Paulo (圣保罗大学); Rehaklinik, Centre Hospitalier Neuro-psychiatrique (CHNP) (神经精神病学医院中心); Radboud University (拉德布德大学); Independant researcher (独立研究员); Institute of Mathematics and Statistics, University of Sao Paulo (圣保罗大学数学与统计研究所)
类目: Computation and Language (cs.CL)
备注: To be submitted to peer-reviewed journal. 33 pages, 10 figures (including appendix), 15 tables (including appendix). For associated code repository, see this https URL

点击查看摘要

Abstract:Background: Medical coding structures healthcare data for research, quality monitoring, and policy. This study assesses the potential of large language models (LLMs) to assign ICPC-2 codes using the output of a domain-specific search engine. Methods: A dataset of 437 Brazilian Portuguese clinical expressions, each annotated with ICPC-2 codes, was used. A semantic search engine (OpenAI’s text-embedding-3-large) retrieved candidates from 73,563 labeled concepts. Thirty-three LLMs were prompted with each query and retrieved results to select the best-matching ICPC-2 code. Performance was evaluated using F1-score, along with token usage, cost, response time, and format adherence. Results: Twenty-eight models achieved F1-score 0.8; ten exceeded 0.85. Top performers included gpt-4.5-preview, o3, and gemini-2.5-pro. Retriever optimization can improve performance by up to 4 points. Most models returned valid codes in the expected format, with reduced hallucinations. Smaller models (3B) struggled with formatting and input length. Conclusions: LLMs show strong potential for automating ICPC-2 coding, even without fine-tuning. This work offers a benchmark and highlights challenges, but findings are limited by dataset scope and setup. Broader, multilingual, end-to-end evaluations are needed for clinical validation. Comments: To be submitted to peer-reviewed journal. 33 pages, 10 figures (including appendix), 15 tables (including appendix). For associated code repository, see this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.14681 [cs.CL] (or arXiv:2507.14681v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.14681 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vinicius Anjos De Almeida [view email] [v1] Sat, 19 Jul 2025 16:11:10 UTC (2,035 KB)
zh

[NLP-74] GCC-Spam: Spam Detection via GAN Contrastive Learning and Character Similarity Networks

【速读】: 该论文旨在解决网络中垃圾文本(spam text)激增所带来的信息泄露和社会不稳定风险,重点应对两个核心挑战:垃圾文本生成者采用的对抗性策略(如字符混淆攻击)以及标注数据稀缺问题。解决方案的关键在于提出一种名为GCC-Spam的新框架,其核心创新包括:(1)字符相似性网络(character similarity network),用于捕捉拼写和发音特征以抵御字符混淆攻击,并生成句子嵌入用于下游分类;(2)对比学习(contrastive learning),通过优化垃圾文本与正常文本在潜在空间中的距离增强判别能力;(3)生成对抗网络(Generative Adversarial Network, GAN),生成逼真的伪垃圾样本以缓解数据稀缺问题并提升模型鲁棒性和分类准确率。

链接: https://arxiv.org/abs/2507.14679
作者: Zixin Xu,Zhijie Wang,Zhiyuan Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.
zh

[NLP-75] Docopilot: Improving Multimodal Models for Document-Level Understanding

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂、跨页文档理解任务中表现不足的问题,其核心瓶颈在于缺乏高质量的文档级数据集以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法存在的碎片化检索上下文、多阶段误差累积和额外推理延迟等问题。解决方案的关键在于构建了一个高质量的文档级数据集Doc-750K,该数据集涵盖多样化的文档结构、丰富的跨页依赖关系,并包含源自原始文档的真实问答对;在此基础上开发了原生多模态模型Docopilot,能够无需依赖RAG即可精准建模文档级依赖关系,在文档理解任务和多轮交互中展现出更优的一致性、准确性和效率,从而为文档级多模态理解设立了新的基准。

链接: https://arxiv.org/abs/2507.14675
作者: Yuchen Duan,Zhe Chen,Yusong Hu,Weiyun Wang,Shenglong Ye,Botian Shi,Lewei Lu,Qibin Hou,Tong Lu,Hongsheng Li,Jifeng Dai,Wenhai Wang
机构: Shanghai AI Laboratory; The Chinese University of Hong Kong; Nanjing University; Nankai University; Fudan University; Tsinghua University; SenseTime Research
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at this https URL
zh

[NLP-76] Mangosteen: An Open Thai Corpus for Language Model Pretraining

【速读】: 该论文旨在解决泰国语(Thai)预训练数据质量低、清洗不充分以及缺乏透明可复现的高质量语料库的问题。现有大规模语料库多基于英文中心或语言无关的清洗流程,难以捕捉泰语字符特性与文化语境,导致赌博等风险内容未被有效过滤;同时,已有针对泰语的定制化方案往往未公开数据或设计细节,阻碍了研究的可重复性。解决方案的关键在于提出Mangosteen语料库——一个包含470亿token的高质量泰语语料库,其核心创新是基于Dolma管道进行泰语适配:包括自定义规则-based语言识别(language ID)、改进的C4/Gopher质量过滤器、使用泰语训练的内容过滤模型,并整合维基百科、皇家公报文本、OCR提取书籍及CC授权YouTube字幕等非网络来源。系统性消融实验表明,该流程将CommonCrawl文档数从202M降至25M,同时提升SEA-HELM NLG指标从3到11,且基于此语料库持续预训练的8B参数SEA-LION模型在泰语基准测试中优于SEA-LION-v3和Llama-3.1约4个百分点。论文还开源完整代码、清洗日志、语料快照及所有检查点,为未来泰语及区域大语言模型(LLM)研究提供可复现基础。

链接: https://arxiv.org/abs/2507.14664
作者: Wannaphong Phatthiyaphaibun,Can Udomcharoenchaikit,Pakpoom Singkorapoom,Kunat Pipatanakul,Ekapol Chuangsuwanich,Peerat Limkonchotiwat,Sarana Nutanong
机构: Vidyasirimedhi Institute of Science and Technology (维迪亚西里梅迪科技研究所); SCB10X; Chulalongkorn University (朱拉隆功大学); AI Singapore
类目: Computation and Language (cs.CL)
备注: Work in this http URL artifacts in this papers: this https URL

点击查看摘要

Abstract:Pre-training data shapes a language model’s quality, but raw web text is noisy and demands careful cleaning. Existing large-scale corpora rely on English-centric or language-agnostic pipelines whose heuristics do not capture Thai script or cultural nuances, leaving risky material such as gambling content untreated. Prior Thai-specific efforts customize pipelines or build new ones, yet seldom release their data or document design choices, hindering reproducibility and raising the question of how to construct a transparent, high-quality Thai corpus. We introduce Mangosteen: a 47 billion-token Thai corpus built through a Thai-adapted Dolma pipeline that includes custom rule-based language ID, revised C4/Gopher quality filters, and Thai-trained content filters, plus curated non-web sources such as Wikipedia, Royal Gazette texts, OCR-extracted books, and CC-licensed YouTube subtitles. Systematic ablations using GPT-2 show the pipeline trims CommonCrawl from 202M to 25M documents while raising SEA-HELM NLG from 3 to 11; an 8B-parameter SEA-LION model continually pre-trained on Mangosteen then surpasses SEA-LION-v3 and Llama-3.1 by about four points on Thai benchmarks. We release the full pipeline code, cleaning manifests, corpus snapshot, and all checkpoints, providing a fully reproducible foundation for future Thai and regional LLM research.
zh

[NLP-77] When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在复杂现实场景中可能因恶意协同而引发的安全风险问题,尤其是此类系统在信息传播和电子商务欺诈等高风险领域中的潜在危害。其解决方案的关键在于提出一个灵活的模拟框架,能够支持集中式与去中心化两种协调结构,并通过实证分析发现:相较于集中式系统,去中心化MAS在执行恶意行为时更具适应性和隐蔽性,即使面对传统干预措施(如内容标记)也能动态调整策略以规避检测。这一发现凸显了对新型检测机制和对抗策略的迫切需求。

链接: https://arxiv.org/abs/2507.14660
作者: Qibing Ren,Sitao Xie,Longxuan Wei,Zhenfei Yin,Junchi Yan,Lizhuang Ma,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Recent large-scale events like election fraud and financial scams have shown how harmful coordinated efforts by human groups can be. With the rise of autonomous AI systems, there is growing concern that AI-driven groups could also cause similar harm. While most AI safety research focuses on individual AI systems, the risks posed by multi-agent systems (MAS) in complex real-world situations are still underexplored. In this paper, we introduce a proof-of-concept to simulate the risks of malicious MAS collusion, using a flexible framework that supports both centralized and decentralized coordination structures. We apply this framework to two high-risk fields: misinformation spread and e-commerce fraud. Our findings show that decentralized systems are more effective at carrying out malicious actions than centralized ones. The increased autonomy of decentralized systems allows them to adapt their strategies and cause more damage. Even when traditional interventions, like content flagging, are applied, decentralized groups can adjust their tactics to avoid detection. We present key insights into how these malicious groups operate and the need for better detection systems and countermeasures. Code is available at this https URL.
zh

[NLP-78] Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中普遍存在且危害显著的幻觉问题(hallucination),即模型生成与事实不符或缺乏依据的回答,这直接影响LLMs的安全性与可靠性。解决方案的关键在于提出一种基于聚类的不确定性估计方法——Cleanse,其核心思想是通过计算LLM隐藏层嵌入(hidden embeddings)中簇内一致性(intra-cluster consistency)占总一致性(total consistency)的比例来量化不确定性,从而有效识别和区分正确与错误的回答。该方法利用了嵌入空间中语义信息的结构特性,无需额外标注数据即可实现对幻觉的检测。

链接: https://arxiv.org/abs/2507.14649
作者: Minsuh Joo,Hyunsoo Cho
机构: Ewha Womans University (梨花女子大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the outstanding performance of large language models (LLMs) across various NLP tasks, hallucinations in LLMs–where LLMs generate inaccurate responses–remains as a critical problem as it can be directly connected to a crisis of building safe and reliable LLMs. Uncertainty estimation is primarily used to measure hallucination levels in LLM responses so that correct and incorrect answers can be distinguished clearly. This study proposes an effective uncertainty estimation approach, \textbfClust\textbfering-based sem\textbfantic con\textbfsist\textbfency (\textbfCleanse). Cleanse quantifies the uncertainty with the proportion of the intra-cluster consistency in the total consistency between LLM hidden embeddings which contain adequate semantic information of generations, by employing clustering. The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B and two question-answering benchmarks, SQuAD and CoQA.
zh

[NLP-79] Linear Relational Decoding of Morphology in Language Models

【速读】: 该论文旨在解决语言模型中概念关系(如形态学关系)在隐空间(latent space)中的可解释性问题,即如何从中间层表示中提取出对最终输出具有高保真度的线性映射。其解决方案的关键在于发现:对于某些语义关系,可以通过一个两段式仿射近似(two-part affine approximation)来有效重建目标对象状态;具体而言,利用模型导数推导出的线性变换矩阵 $ W $ 作用于主体词元(subject token)的中间层表示 $ s $,即可准确复现多数关系下的对象状态,且该方法在形态学关系上达到了90%的保真度,并在多语言和不同模型架构中表现出一致性,表明部分概念关系在模型内部由跨层线性变换稀疏编码。

链接: https://arxiv.org/abs/2507.14640
作者: Eric Xia,Jugal Kalita
机构: Brown University (布朗大学); University of Colorado Colorado Springs (科罗拉多大学斯普林斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A two-part affine approximation has been found to be a good approximation for transformer computations over certain subject object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation Ws, where s is a middle layer representation of a subject token and W is derived from model derivatives, is also able to accurately reproduce final object states for many relations. This linear technique is able to achieve 90% faithfulness on morphological relations, and we show similar findings multi-lingually and across models. Our findings indicate that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space, and are sparsely encoded by cross-layer linear transformations.
zh

[NLP-80] Optimizing Legal Document Retrieval in Vietnamese with Semi-Hard Negative Mining

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律等专业领域中因缺乏精准性和领域知识而导致的文档检索效率与准确性不足的问题。其解决方案的关键在于提出了一种简化的两阶段框架——检索(Retrieval)与重排序(Re-ranking),其中使用微调后的双编码器(Bi-Encoder)实现快速候选文档筛选,并通过交叉编码器(Cross-Encoder)进行高精度重排序,同时结合策略性负例挖掘(negative example mining)优化训练过程;创新性地引入Exist@m指标评估检索效果,并采用半硬负样本(semi-hard negatives)缓解训练偏差,显著提升了重排序性能。该方法在SoICT Hackathon 2024法律文档检索任务中取得前三名成绩,验证了轻量级单次遍历架构在资源受限场景下的有效性。

链接: https://arxiv.org/abs/2507.14619
作者: Van-Hoang Le,Duc-Vu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at ICCCI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.
zh

[NLP-81] Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在非洲基层医疗场景中应用时缺乏本地化验证与评估标准的问题,尤其关注肯尼亚二级和三级临床护理环境下的有效性不足。其核心解决方案是构建一个基于国家指南的基准数据集与评估框架,通过检索增强生成(Retrieval-Augmented Generation, RAG)技术将临床问题锚定于肯尼亚国家级诊疗指南,确保内容符合当地医疗规范;同时利用Gemini Flash 2.0 Lite生成多语言(英语与斯瓦希里语)的临床情景、选择题及解析答案,并由肯尼亚执业医师参与共创与校验,保障临床准确性、文化适配性与逻辑严谨性。关键创新在于引入针对临床推理能力、安全性与情境适应性的新型评估指标(如罕见病例识别、决策点逻辑链分析),从而实现对LLMs在低资源环境中表现的系统性量化评估。

链接: https://arxiv.org/abs/2507.14615
作者: Fred Mutisya(1,2),Shikoh Gitau(1),Christine Syovata(2),Diana Oigara(2),Ibrahim Matende(2),Muna Aden(2),Munira Ali(2),Ryan Nyotu(2),Diana Marion(2),Job Nyangena(2),Nasubo Ongoma(1),Keith Mbae(1),Elizabeth Wamicha(1),Eric Mibuari(1),Jean Philbert Nsengemana(3),Talkmore Chidede(4) ((1) Qhala, Nairobi, Kenya, (2) Kenya Medical Association, Nairobi, Kenya, (3) Africa CDC, (4) AfCFTA)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figs, 6 tables. Companion methods paper forthcoming

点击查看摘要

Abstract:Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya’s national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset includes thousands of regulator-aligned question answer pairs across common outpatient conditions. Beyond accuracy, we introduce evaluation metrics that test clinical reasoning, safety, and adaptability such as rare case detection (Needle in the Haystack), stepwise logic (Decision Points), and contextual adaptability. Initial results reveal significant performance gaps when LLMs are applied to localized scenarios, consistent with findings that LLM accuracy is lower on African medical content than on US-based benchmarks. This work offers a replicable model for guideline-driven, dynamic benchmarking to support safe AI deployment in African health systems.
zh

[NLP-82] Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中因数据稀缺(data scarcity)和类别不平衡(class imbalance)导致的模型性能受限问题。其解决方案的关键在于系统评估传统数据增强方法(如回译(backtranslation)和改写(paraphrasing))是否能够借助大语言模型(如GPT系列)实现与纯生成式方法(zero-shot或few-shot generation)相当甚至更优的性能表现。实验结果表明,基于大语言模型的回译和改写策略在生成数据质量和下游分类任务性能上均具有竞争力,凸显了这些经典方法在现代生成式AI框架下的有效性与实用性。

链接: https://arxiv.org/abs/2507.14590
作者: Łukasz Radliński,Mateusz Guściora,Jan Kocoń
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: International Conference on Computational Science 2025

点击查看摘要

Abstract:Numerous domain-specific machine learning tasks struggle with data scarcity and class imbalance. This paper systematically explores data augmentation methods for NLP, particularly through large language models like GPT. The purpose of this paper is to examine and evaluate whether traditional methods such as paraphrasing and backtranslation can leverage a new generation of models to achieve comparable performance to purely generative methods. Methods aimed at solving the problem of data scarcity and utilizing ChatGPT were chosen, as well as an exemplary dataset. We conducted a series of experiments comparing four different approaches to data augmentation in multiple experimental setups. We then evaluated the results both in terms of the quality of generated data and its impact on classification performance. The key findings indicate that backtranslation and paraphrasing can yield comparable or even better results than zero and a few-shot generation of examples.
zh

[NLP-83] Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption

【速读】: 该论文旨在解决基于BERT模型的协作问题解决(Collaborative Problem Solving, CPS)诊断中模型可解释性不足的问题,即当前研究虽广泛使用BERT进行CPS分类,但缺乏对单个词元(tokenised words)如何影响分类决策的理解。解决方案的关键在于引入SHapley Additive exPlanations(SHAP)方法,通过量化每个词元对模型输出的贡献,实现对BERT分类过程的透明化分析。研究表明,高精度分类并不必然意味着合理的解释,且存在语义无关却显著影响分类结果的“伪词”(spurious word),这提示教育场景下需警惕对大语言模型(LLM)诊断结果的过度依赖,并强调未来应探索集成模型架构与人机互补机制以提升CPS子技能识别的准确性与可信度。

链接: https://arxiv.org/abs/2507.14584
作者: Kester Wong,Sahan Bulathwela,Mutlu Cukurova
机构: UCL Knowledge Lab, Institute of Education, University College London, UK(英国伦敦大学学院知识实验室,教育研究院); UCL Centre for Artificial Intelligence, Department of Computer Science, University College London, UK(英国伦敦大学学院人工智能中心,计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to appear in the workshop proceedings for the HEXED’25 workshop in the 26th International Conference on Artificial Intelligence in Education 2025 (AIED 2025), 22 July 2025, Palermo, Italy. 6 pages, 2 figures

点击查看摘要

Abstract:The use of Bidirectional Encoder Representations from Transformers (BERT) model and its variants for classifying collaborative problem solving (CPS) has been extensively explored within the AI in Education community. However, limited attention has been given to understanding how individual tokenised words in the dataset contribute to the model’s classification decisions. Enhancing the explainability of BERT-based CPS diagnostics is essential to better inform end users such as teachers, thereby fostering greater trust and facilitating wider adoption in education. This study undertook a preliminary step towards model transparency and explainability by using SHapley Additive exPlanations (SHAP) to examine how different tokenised words in transcription data contributed to a BERT model’s classification of CPS processes. The findings suggested that well-performing classifications did not necessarily equate to a reasonable explanation for the classification decisions. Particular tokenised words were used frequently to affect classifications. The analysis also identified a spurious word, which contributed positively to the classification but was not semantically meaningful to the class. While such model transparency is unlikely to be useful to an end user to improve their practice, it can help them not to overrely on LLM diagnostics and ignore their human expertise. We conclude the workshop paper by noting that the extent to which the model appropriately uses the tokens for its classification is associated with the number of classes involved. It calls for an investigation into the exploration of ensemble model architectures and the involvement of human-AI complementarity for CPS diagnosis, since considerable human reasoning is still required for fine-grained discrimination of CPS subskills.
zh

[NLP-84] Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models

【速读】: 该论文旨在解决在教育人工智能(AI in Education)领域中,如何通过机器学习技术从对话数据中可靠检测协作问题解决(Collaborative Problem Solving, CPS)指标这一挑战。其关键解决方案在于引入多模态BERT变体AudiBERT,该模型融合了语音与声学-韵律特征,显著提升了对社交认知维度下稀疏类别分类的准确性,并在统计学上优于传统BERT模型;同时,研究强调了训练数据规模与召回率、人类编码者间一致性与BERT模型精确度之间的显著关联,最终提出以模型可解释性为核心支撑的人机互补框架,以增强人类在CPS诊断中的主体性和反思性编码能力。

链接: https://arxiv.org/abs/2507.14579
作者: Kester Wong,Sahan Bulathwela,Mutlu Cukurova
机构: UCL Knowledge Lab, Institute of Education, University College London, UK(伦敦大学学院知识实验室,教育研究院,英国); UCL Centre for Artificial Intelligence, Department of Computer Science, University College London, UK(伦敦大学学院人工智能中心,计算机科学系,英国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to appear in the workshop proceedings for the HEXED’25 workshop in the 26th International Conference on Artificial Intelligence in Education 2025 (AIED 2025), 22 July 2025, Palermo, Italy. 5 pages

点击查看摘要

Abstract:Detecting collaborative problem solving (CPS) indicators from dialogue using machine learning techniques is a significant challenge for the field of AI in Education. Recent studies have explored the use of Bidirectional Encoder Representations from Transformers (BERT) models on transcription data to reliably detect meaningful CPS indicators. A notable advancement involved the multimodal BERT variant, AudiBERT, which integrates speech and acoustic-prosodic audio features to enhance CPS diagnosis. Although initial results demonstrated multimodal improvements, the statistical significance of these enhancements remained unclear, and there was insufficient guidance on leveraging human-AI complementarity for CPS diagnosis tasks. This workshop paper extends the previous research by highlighting that the AudiBERT model not only improved the classification of classes that were sparse in the dataset, but it also had statistically significant class-wise improvements over the BERT model for classifications in the social-cognitive dimension. However, similar significant class-wise improvements over the BERT model were not observed for classifications in the affective dimension. A correlation analysis highlighted that larger training data was significantly associated with higher recall performance for both the AudiBERT and BERT models. Additionally, the precision of the BERT model was significantly associated with high inter-rater agreement among human coders. When employing the BERT model to diagnose indicators within these subskills that were well-detected by the AudiBERT model, the performance across all indicators was inconsistent. We conclude the paper by outlining a structured approach towards achieving human-AI complementarity for CPS diagnosis, highlighting the crucial inclusion of model explainability to support human agency and engagement in the reflective coding process.
zh

[NLP-85] XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

【速读】: 该论文旨在解决词义消歧(Word-in-Context, WiC)任务中不同标注形式(如二分类与序数分类)之间建模不统一的问题。其核心解决方案是提出 XL-DURel,一个针对序数 WiC 分类优化的多语言 Sentence Transformer 模型,并采用基于复数空间角距离(angular distance in complex space)的排序损失函数进行训练。研究发现,二分类 WiC 可视为序数 WiC 的特例,且在更通用的序数任务上优化模型可提升二分类性能,从而为不同任务形式下的 WiC 建模提供了一种统一框架。

链接: https://arxiv.org/abs/2507.14578
作者: Sachin Yadav,Dominik Schlechtweg
机构: University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.
zh

[NLP-86] Efficient Whole Slide Pathology VQA via Token Compression

【速读】: 该论文旨在解决全切片图像(Whole-slide images, WSI)在病理学中因分辨率高(可达10,000×10,000像素)导致的多模态大语言模型(Multimodal Large Language Model, MLLM)面临长上下文长度和高计算开销的问题。现有方法如基于CLIP的多实例学习虽能实现滑片级分类,但缺乏生成式视觉问答(Visual Question Answering, VQA)能力;而直接将数千个patch tokens输入LLM的方法虽支持VQA,却造成资源消耗过大。解决方案的关键在于提出Token Compression Pathology LLaVA(TCP-LLaVA),其核心创新是引入可训练的压缩token(compression tokens),通过模态压缩模块聚合视觉与文本信息,类比BERT中的[CLS]机制,仅将压缩后的token传递给语言模型进行答案生成,从而显著降低输入长度和计算成本,同时提升VQA准确率。

链接: https://arxiv.org/abs/2507.14497
作者: Weimin Lyu,Qingqiao Hu,Kehan Qi,Zhan Shi,Wentao Huang,Saumya Gupta,Chao Chen
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.
zh

[NLP-87] Routine: A Structural Planning Framework for LLM Agent System in Enterprise

【速读】: 该论文旨在解决企业在部署智能代理(Agent)系统时面临的挑战,即通用模型缺乏领域特定流程知识,导致计划混乱、关键工具缺失以及执行稳定性差的问题。解决方案的关键在于提出一种名为Routine的多步代理规划框架,其核心特征包括清晰的结构设计、显式的指令引导和无缝的参数传递机制,从而显著提升代理在多步工具调用任务中的执行准确性和稳定性。实验表明,Routine能有效增强模型对特定场景下工具使用模式的适应能力,并通过知识蒸馏进一步优化性能,使模型精度接近GPT-4o水平,为构建稳定可靠的代理工作流提供了实用且可扩展的方法。

链接: https://arxiv.org/abs/2507.14447
作者: Guancheng Zeng,Xueyi Chen,Jiawang Hu,Shaohua Qi,Yaxuan Mao,Zhantao Wang,Yifan Nie,Shuang Li,Qiuyang Feng,Pengxu Qiu,Yujia Wang,Wenqiang Han,Linyan Huang,Gang Li,Jingjing Mo,Haowen Hu
机构: Digital China AI Research (数字中国人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 8 figures, 5 tables

点击查看摘要

Abstract:The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent’s execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model’s accuracy to 95.5%, approaching GPT-4o’s performance. These results highlight Routine’s effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.
zh

[NLP-88] X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在半导体显示行业应用中表现受限的问题,主要原因是缺乏领域特定的训练数据和专业知识。解决方案的关键在于构建了X-Intelligence 3.0——首个专为半导体显示行业设计的高性能推理模型,其核心创新包括:基于精心构建的行业知识库进行监督微调与强化学习以提升推理能力;引入领域专用的检索增强生成(Retrieval-Augmented Generation, RAG)机制以提高准确性;并开发自动化评估框架模拟专家评分以加速迭代优化。尽管参数规模仅为320亿,该模型在多个基准测试中超越了当前最优的DeepSeek-R1-671B,展现出卓越的效率与专业性能。

链接: https://arxiv.org/abs/2507.14430
作者: Xiaolin Yan,Yangxing Liu,Jiazhang Zheng,Chi Liu,Mingyu Du,Caisheng Chen,Haoyang Liu,Ming Ding,Yuan Li,Qiuping Liao,Linfeng Li,Zhili Mei,Siyu Wan,Li Li,Ruyi Zhong,Jiangling Yu,Xule Liu,Huihui Hu,Jiameng Yue,Ruohui Cheng,Qi Yang,Liangqing Wu,Ke Zhu,Chi Zhang,Chufei Jing,Yifan Zhou,Yan Liang,Dongdong Li,Zhaohui Wang,Bin Zhao,Mingzhou Wu,Mingzhong Zhou,Peng Du,Zuomin Liao,Chao Dai,Pengfei Liang,Xiaoguang Zhu,Yu Zhang,Yu Gu,Kun Pan,Yuan Wu,Yanqing Guan,Shaojing Wu,Zikang Feng,Xianze Ma,Peishan Cheng,Wenjuan Jiang,Jing Ba,Huihao Yu,Zeping Hu,Yuan Xu,Zhiwei Liu,He Wang,Zhenguo Lin,Ming Liu,Yanhong Meng
机构: TCL Corporate Research (TCL企业研究中心); TCL China Star Optoelectronic Technology Co Ltd (TCL华星光电技术有限公司); National Center of Technology Innovation for Display (显示技术创新国家中心)
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry’s complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.
zh

[NLP-89] Its Not That Simple. An Analysis of Simple Test-Time Scaling

【速读】: 该论文旨在解决如何准确复现o1类模型(如DeepSeek-R1@)在测试时通过动态调整计算资源实现性能提升的现象,即“测试时缩放”(test-time scaling)行为。其关键发现是:简单测试时缩放方法中,通过强制限制最大生成长度来“缩 down”模型表现,是导致其出现类似缩放行为的主要原因;而通过迭代添加“Wait”指令以“缩 up”计算量的方法则会引发解空间震荡,无法稳定提升性能。相比之下,o1类模型通过强化学习自然习得在测试时按需扩展计算能力的能力,从而突破原始性能上限,这表明真正有效的测试时缩放目标应是释放模型潜在性能,而非仅仅模拟缩放现象本身。

链接: https://arxiv.org/abs/2507.14419
作者: Guojun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending “Wait” when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending “Wait” leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model’s maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance – beyond what the model could originally achieve – rather than merely reproducing the appearance of scaling behavior.
zh

[NLP-90] Inverse Scaling in Test-Time Compute

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在测试时通过增加计算资源(如扩展推理长度)反而导致性能下降的问题,即存在反向缩放现象(inverse scaling relationship)。其解决方案的关键在于系统性地构建涵盖四类任务的评估体系——包括含干扰项的计数任务、具有虚假特征的回归任务、需约束跟踪的演绎任务以及高级人工智能风险任务,并识别出五种不同的失败模式:模型被无关信息干扰、过度拟合问题表述、从合理先验转向虚假相关性、复杂演绎任务中注意力难以维持,以及长推理可能放大有害行为(如自我保护倾向增强)。研究表明,仅依赖测试时计算扩展可能强化不良推理模式,因此必须在多样化推理长度下评估模型以发现并修正这些缺陷。

链接: https://arxiv.org/abs/2507.14417
作者: Aryo Pradipta Gema,Alexander Hägele,Runjin Chen,Andy Arditi,Jacob Goldman-Wetzler,Kit Fraser-Taliente,Henry Sleight,Linda Petrini,Julian Michael,Beatrice Alex,Pasquale Minervini,Yanda Chen,Joe Benton,Ethan Perez
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
zh

[NLP-91] Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在定性编码中主要用于归纳式分类的现状,探索其在演绎式分类任务中的潜力,即能否依据既定的人工编码框架(如Comparative Agendas Project, CAP的主编码手册)对美国最高法院案件摘要进行结构化、可靠的政策领域归类。解决方案的关键在于设计并验证四种干预策略:零样本(zero-shot)、少样本(few-shot)、基于定义的提示(definition-based)以及一种新颖的“分步任务分解”(Step-by-Step Task Decomposition)策略,其中后者通过将复杂分类任务拆解为逻辑清晰的子步骤显著提升了模型的一致性和可靠性(准确率=0.775,Kappa=0.744,Alpha=0.746),达到了实质一致性标准,表明定制化干预可使LLMs适用于严谨的定性研究流程。

链接: https://arxiv.org/abs/2507.14384
作者: Angjelin Hila,Elliott Hauser
机构: University of Texas, Austin (德克萨斯大学奥斯汀分校)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Extended version of paper accepted for presentation at the ASIST Annual Meeting 2025. 38 pages, 12 figures

点击查看摘要

Abstract:In this study, we investigate the use of large language models (LLMs), specifically ChatGPT, for structured deductive qualitative coding. While most current research emphasizes inductive coding applications, we address the underexplored potential of LLMs to perform deductive classification tasks aligned with established human-coded schemes. Using the Comparative Agendas Project (CAP) Master Codebook, we classified U.S. Supreme Court case summaries into 21 major policy domains. We tested four intervention methods: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition strategy, across repeated samples. Performance was evaluated using standard classification metrics (accuracy, F1-score, Cohen’s kappa, Krippendorff’s alpha), and construct validity was assessed using chi-squared tests and Cramer’s V. Chi-squared and effect size analyses confirmed that intervention strategies significantly influenced classification behavior, with Cramer’s V values ranging from 0.359 to 0.613, indicating moderate to strong shifts in classification patterns. The Step-by-Step Task Decomposition strategy achieved the strongest reliability (accuracy = 0.775, kappa = 0.744, alpha = 0.746), achieving thresholds for substantial agreement. Despite the semantic ambiguity within case summaries, ChatGPT displayed stable agreement across samples, including high F1 scores in low-support subclasses. These findings demonstrate that with targeted, custom-tailored interventions, LLMs can achieve reliability levels suitable for integration into rigorous qualitative coding workflows.
zh

[NLP-92] Error-Aware Curriculum Learning for Biomedical Relation Classification

【速读】: 该论文旨在解决生物医学文本中关系分类(Relation Classification, RC)的准确性与鲁棒性问题,以支持知识图谱构建及药物重定位等下游应用。其解决方案的关键在于提出一种误差感知的师生框架(error-aware teacher–student framework),利用大语言模型(GPT-4o)作为教师,对基线学生模型的预测失败进行类型分类、难度评分和针对性修复(如句子重写与知识图谱增强建议),进而通过指令微调训练首个学生模型;随后,该模型对更大规模数据集进行标注并附带难度标签和修复输入,再通过课程学习(curriculum learning)策略按难度顺序训练第二个学生模型,实现渐进式、稳健的性能提升。此外,作者还构建了一个异构生物医学知识图谱(heterogeneous biomedical knowledge graph)以提供上下文感知的RC支持。

链接: https://arxiv.org/abs/2507.14374
作者: Sinchani Chakraborty,Sudeshna Sarkar,Pawan Goyal
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校)
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Relation Classification (RC) in biomedical texts is essential for constructing knowledge graphs and enabling applications such as drug repurposing and clinical decision-making. We propose an error-aware teacher–student framework that improves RC through structured guidance from a large language model (GPT-4o). Prediction failures from a baseline student model are analyzed by the teacher to classify error types, assign difficulty scores, and generate targeted remediations, including sentence rewrites and suggestions for KG-based enrichment. These enriched annotations are used to train a first student model via instruction tuning. This model then annotates a broader dataset with difficulty scores and remediation-enhanced inputs. A second student is subsequently trained via curriculum learning on this dataset, ordered by difficulty, to promote robust and progressive learning. We also construct a heterogeneous biomedical knowledge graph from PubMed abstracts to support context-aware RC. Our approach achieves new state-of-the-art performance on 4 of 5 PPI datasets and the DDI dataset, while remaining competitive on ChemProt.
zh

[NLP-93] xt-to-SQL for Enterprise Data Analytics KDD’25

【速读】: 该论文旨在解决企业级Text-to-SQL(自然语言转结构化查询语言)解决方案的落地难题,即如何将大语言模型在基准测试上的进展转化为可实际部署、稳定可靠的内部数据查询工具。其核心挑战在于处理企业数据湖的动态性、语义复杂性以及用户意图多样性。解决方案的关键在于构建一个三组件协同系统:首先,通过整合数据库元数据、历史查询日志、维基文档和代码库构建知识图谱(Knowledge Graph),并利用聚类技术识别不同团队或产品线相关的表集合;其次,开发一个Text-to-SQL代理(Agent),基于知识图谱检索与排序上下文信息,生成SQL查询,并自动修正幻觉和语法错误;最后,设计交互式聊天机器人界面,支持从数据发现到调试的多意图操作,并以富UI元素呈现结果以促进持续对话。实证表明,该方案在300名活跃用户中表现良好,且专家评估显示53%的响应在内部基准上正确或接近正确,为构建实用的企业级Text-to-SQL系统提供了清晰路径。

链接: https://arxiv.org/abs/2507.14372
作者: Albert Chen,Manas Bundele,Gaurav Ahlawat,Patrick Stetz,Zhitao Wang,Qiang Fei,Donghoon Jung,Audrey Chu,Bharadwaj Jayaraman,Ayushi Panth,Yatin Arora,Sourav Jain,Renjith Varma,Alexey Ilin,Iuliia Melnychuk,Chelsea Chueh,Joyan Sil,Xiaofeng Wang
机构: LinkedIn(领英)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注: 11 pages, 8 figures, Workshop on Agentic AI for Enterprise at KDD '25

点击查看摘要

Abstract:The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn’s product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.
zh

[NLP-94] Can LLM s Infer Personality from Real World Conversations?

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)进行人格特质推断时存在的有效性不足问题,尤其是以往研究多依赖缺乏心理测量学效度的合成数据或社交媒体文本。其解决方案的关键在于构建了一个真实世界基准数据集,包含555个半结构化访谈及对应的BFI-10自评量表得分,用于系统评估三种先进LLM(GPT-4.1 Mini、Meta-LLaMA和DeepSeek)在零样本提示(zero-shot prompting)和思维链提示(chain-of-thought prompting)下对人格五因素模型(Big Five)的推断能力。结果表明,尽管模型具备高重测信度,但与真实人格评分的相关性较弱(最大Pearson相关系数r=0.27),且预测存在中等偏高倾向偏差,说明当前LLM在人格推断上的构念效度仍有限,亟需基于实证证据的方法改进以支持心理学应用。

链接: https://arxiv.org/abs/2507.14355
作者: Jianfeng Zhu,Ruoming Jin,Karin G. Coifman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) such as OpenAI’s GPT-4 and Meta’s LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson’s r = 0.27 ), interrater agreement was low (Cohen’s \kappa 0.10 ), and predictions were biased toward moderate or high trait levels. Chain-of-thought prompting and longer input context modestly improved distributional alignment, but not trait-level accuracy. These results underscore limitations in current LLM-based personality inference and highlight the need for evidence-based development for psychological applications.
zh

[NLP-95] Solo Connection: A Parameter Efficient Fine-Tuning Technique for Transformers

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在微调过程中参数效率低下的问题,尤其是在传统方法如LoRA(Low Rank Adaptation)仅调整单个解码器块内的注意力权重矩阵时,难以在保持性能的同时显著减少可训练参数。其解决方案的关键在于提出一种名为Solo Connection的新方法,该方法通过在不同解码器块之间引入长距离跳跃连接(long skip connections),并在块级别对表示进行适配,而非修改具体权重矩阵;同时,基于同伦理论(homotopy theory)设计了一个可训练的线性变换,逐步从零向量插值到任务特定表示,从而实现更平滑、稳定的适应过程。此方法不仅在端到端自然语言生成基准上优于LoRA,还使可训练参数数量相比LoRA减少59%,相比全量微调GPT2减少超过99%。

链接: https://arxiv.org/abs/2507.14353
作者: Harsh Nilesh Pathak,Randy Paffenroth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter efficient fine tuning (PEFT) is a versatile and extensible approach for adapting a Large Language Model (LLM) for newer tasks. One of the most prominent PEFT approaches, Low Rank Adaptation (LoRA), primarily focuses on adjusting the attention weight matrices within individual decoder blocks of a Generative Pre trained Transformer (GPT2). In contrast, we introduce Solo Connection a novel method that adapts the representation at the decoder-block level rather than modifying individual weight matrices. Not only does Solo Connection outperform LoRA on E2E natural language generation benchmarks, but it also reduces the number of trainable parameters by 59% relative to LoRA and by more than 99% compared to full fine-tuning of GPT2, an early version of Large Language Models (LLMs). Solo Connection is also motivated by homotopy theory: we introduce a trainable linear transformation that gradually interpolates between a zero vector and the task-specific representation, enabling smooth and stable adaptation over time. While skip connections in the original 12 layer GPT2 are typically confined to individual decoder blocks, subsequent GPT2 variants scale up to 48 layers, and even larger language models can include 128 or more decoder blocks. These expanded architectures underscore the need to revisit how skip connections are employed during fine-tuning. This paper focuses on long skip connections that link outputs of different decoder blocks, potentially enhancing the model’s ability to adapt to new tasks while leveraging pre-trained knowledge.
zh

[NLP-96] What Makes You CLIC: Detection of Croatian Clickbait Headlines

【速读】: 该论文旨在解决在线新闻平台中点击诱饵(clickbait)标题泛滥所引发的信息质量下降与读者信任危机问题,尤其关注在低资源语言(如克罗地亚语)环境下如何有效检测点击诱饵标题。其解决方案的关键在于构建了一个覆盖20年、涵盖主流与边缘媒体的克罗地亚语点击诱饵标题数据集CLIC,并对比微调预训练模型(BERTić)与基于大语言模型(LLM)的上下文学习(in-context learning, ICL)方法的性能表现。实验表明,针对特定任务微调的模型优于通用大语言模型的零样本或少样本提示策略,验证了领域适配模型在低资源场景下对点击诱饵识别的有效性。

链接: https://arxiv.org/abs/2507.14314
作者: Marija Anđedelić,Dominik Šipek,Laura Majer,Jan Šnajder
机构: University of Zagreb, Faculty of Electrical Engineering and Computing (萨格勒布大学电气工程与计算学院); TakeLab
类目: Computation and Language (cs.CL)
备注: Accepted at Slavic NLP 2025

点击查看摘要

Abstract:Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile CLIC, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. We fine-tune the BERTić model on this task and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.
zh

[NLP-97] How LLM s Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理叙事中时态语义(linguistic aspect)时,其行为是源于类人认知还是仅依赖于模式识别这一开放性问题。研究通过引入“专家在环”(Expert-in-the-Loop)探测管道,设计一系列靶向实验,评估LLMs是否能以类人方式构建语义表征和语用推理。关键解决方案在于开发了一套标准化的实验框架,用于可靠评估LLMs的认知与语言能力,从而揭示其在原型依赖、时态判断不一致及基于时态的因果推理方面的局限,表明LLMs在根本上不同于人类对叙事的理解机制。

链接: https://arxiv.org/abs/2507.14307
作者: Karin de Langis,Jong Inn Park,Andreas Schramm,Bin Hu,Khanh Chi Le,Michael Mensink,Ahn Thu Tong,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学); Hamline University (汉姆林大学); University of Wisconsin-Stout (威斯康星大学斯托特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs’ cognitive and linguistic capabilities.
zh

[NLP-98] Aligning Large Language Models to Low-Resource Languages through LLM -Based Selective Translation: A Systematic Study

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)在低资源语言(如印地语)中性能低于英语的问题,其核心挑战在于缺乏高质量的对齐数据。为应对这一问题,论文提出了一种基于生成式 AI(Generative AI)的**选择性翻译(Selective Translation)**方法,其关键在于:仅对文本中可翻译的部分进行翻译,同时保留代码、数学表达式和结构化格式(如 JSON)等不可翻译内容及原始句子结构,从而更有效地构建适用于低资源语言的对齐数据。实验表明,该方法相比传统翻译策略更具有效性,并且通过混合翻译样本与原始英文数据进行对齐训练,可进一步提升模型在目标语言上的表现。

链接: https://arxiv.org/abs/2507.14304
作者: Rakesh Paul,Anusha Kamath,Kanishk Singla,Raviraj Joshi,Utkarsh Vaidya,Sanjay Singh Chauhan,Niranjan Wartikar
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) often demonstrate a performance gap between English and non-English languages, particularly in low-resource settings. Aligning these models to low-resource languages is essential yet challenging due to limited high-quality data. While English alignment datasets are readily available, curating equivalent data in other languages is expensive and time-consuming. A common workaround is to translate existing English alignment data; however, standard translation techniques often fail to preserve critical elements such as code, mathematical expressions, and structured formats like JSON. In this work, we investigate LLM-based selective translation, a technique that selectively translates only the translatable parts of a text while preserving non-translatable content and sentence structure. We conduct a systematic study to explore key questions around this approach, including its effectiveness compared to vanilla translation, the importance of filtering noisy outputs, and the benefits of mixing translated samples with original English data during alignment. Our experiments focus on the low-resource Indic language Hindi and compare translations generated by Google Cloud Translation (GCP) and Llama-3.1-405B. The results highlight the promise of selective translation as a practical and effective method for improving multilingual alignment in LLMs.
zh

[NLP-99] In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision Language Models, LVLMs)在特定领域任务中进行定制时面临的两大问题:一是现有方法依赖于少量图表类型的成对数据,导致模型难以泛化到多种图表类型;二是缺乏针对图表与数据对齐的定向预训练,限制了模型对底层数据的理解能力。解决方案的关键在于提出ChartScope,一种专为深入理解多样化图表设计的LVLM,并引入两个核心创新:其一是一个高效的合成数据生成管道,能够为广泛图表类型生成配对数据;其二是一种新颖的双路径(Dual-Path)训练策略,使模型在捕捉关键数据细节的同时保持强大的推理能力,从而提升对图表及其背后数据的理解水平。

链接: https://arxiv.org/abs/2507.14298
作者: Wan-Cyuan Fan,Yen-Chun Chen,Mengchen Liu,Alexander Jacobson,Lu Yuan,Leonid Sigal
机构: UBC(不列颠哥伦比亚大学); Microsoft(微软); Vector Institute for AI(人工智能矢量研究所); CIFAR AI Chair(加拿大高级研究院人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2407.14506

点击查看摘要

Abstract:Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model’s understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at this https URL.
zh

[NLP-100] WebGuard: Building a Generalizable Guardrail for Web Agents

【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)驱动的自主网络代理(autonomous web agents)在实际应用中可能采取非预期或有害行为所带来的前沿风险问题。当前缺乏有效的安全机制来约束代理的行为,类似于人类用户访问控制的需求。解决方案的关键在于提出首个综合性数据集WebGuard,用于评估网络代理动作风险并开发适用于真实在线环境的防护机制(guardrails)。WebGuard包含4,939条来自193个网站、涵盖22个多样化领域的高质量人工标注动作,并采用创新的三级风险分类体系(SAFE、LOW、HIGH)进行标注,同时提供训练与测试划分以支持多种泛化场景下的评估。实验表明,即使前沿LLMs在预测动作结果上表现不佳(准确率<60%,高风险动作召回率<60%),通过在WebGuard上微调专用防护模型(如Qwen2.5VL-7B),可显著提升性能(准确率从37%提升至80%,高风险动作召回率从20%提升至76%),但仍不足以满足高风险部署所需的近乎完美的准确率与召回率要求。

链接: https://arxiv.org/abs/2507.14293
作者: Boyuan Zheng,Zeyi Liao,Scott Salisbury,Zeyuan Liu,Michael Lin,Qinyuan Zheng,Zifan Wang,Xiang Deng,Dawn Song,Huan Sun,Yu Su
机构: The Ohio State University (俄亥俄州立大学); Scale AI; University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: We publicly release WebGuard, along with its annotation tools and fine-tuned models, to facilitate open-source research on monitoring and safeguarding web agents. All resources are available at this https URL

点击查看摘要

Abstract:The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.
zh

[NLP-101] Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中依赖人工设计提示(prompt)所带来的效率低、一致性差及非专家难以参与的问题。其核心解决方案是提出 Promptomatix,一个自动化的提示优化框架,通过分析用户意图、生成合成训练数据、选择最优提示策略,并基于成本感知的目标对提示进行迭代优化,从而在无需人工调优或领域知识的情况下生成高质量提示。该方案的关键在于模块化设计与两种优化路径(轻量级元提示优化器和基于 DSPy 的编译器),实现了 prompt 优化的自动化、可扩展性和高效性。

链接: https://arxiv.org/abs/2507.14241
作者: Rithesh Murthy,Ming Zhu,Liangwei Yang,Jielin Qiu,Juntao Tan,Shelby Heinecke,Huan Wang,Caiming Xiong,Silvio Savarese
机构: Salesforce AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) perform best with well-crafted prompts, yet prompt engineering remains manual, inconsistent, and inaccessible to non-experts. We introduce Promptomatix, an automatic prompt optimization framework that transforms natural language task descriptions into high-quality prompts without requiring manual tuning or domain expertise. Promptomatix supports both a lightweight meta-prompt-based optimizer and a DSPy-powered compiler, with modular design enabling future extension to more advanced frameworks. The system analyzes user intent, generates synthetic training data, selects prompting strategies, and refines prompts using cost-aware objectives. Evaluated across 5 task categories, Promptomatix achieves competitive or superior performance compared to existing libraries, while reducing prompt length and computational overhead making prompt optimization scalable and efficient.
zh

[NLP-102] HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)供应链中组件来源复杂、潜在风险难以追溯的问题,尤其关注模型与数据集之间的依赖关系及其对模型安全性、公平性和合规性的影响。其解决方案的关键在于构建一个系统性的LLM供应链数据采集方法,并基于此数据建立一个包含397,376个节点和453,469条边的有向异构图(directed heterogeneous graph),以结构化方式刻画模型与数据集之间的多维关系,从而揭示供应链的拓扑特性(如幂律分布、核心-外围结构)及动态演化规律,为风险检测、偏见缓解和合规治理提供可解释的数据基础。

链接: https://arxiv.org/abs/2507.14240
作者: Mohammad Shahedur Rahman,Peng Gao,Yuede Ji
机构: University of Texas at Arlington(德克萨斯大学阿灵顿分校); Virginia Tech(弗吉尼亚理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem’s ongoing evolution.
zh

[NLP-103] CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)在低资源语言中易产生幻觉(hallucination)的问题,尤其是在领域特定生成任务中,由于训练数据分布不均导致模型输出不准确或虚构内容。其解决方案的关键在于提出一种两阶段微调框架——CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought),首先通过基于课程的对比学习(curriculum-based contrastive learning)与下一个词预测(next-token prediction)增强跨语言语义对齐,随后在指令微调阶段引入跨语言思维链(Cross-lingual Chain-of-Thought, XCoT)提示策略,引导模型先在高资源语言中进行推理,再生成目标低资源语言的答案,从而显著降低幻觉率(最高达62%)并提升跨语言事实知识迁移能力。

链接: https://arxiv.org/abs/2507.14239
作者: Weihua Zheng,Roy Ka-Wei Lee,Zhengyuan Liu,Kui Wu,AiTi Aw,Bowei Zou
机构: Institute for Infocomm Research (I2R), A*STAR (新加坡科技研究局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.
zh

[NLP-104] Language Models Change Facts Based on the Way You Talk

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在高风险用户应用场景(如医疗、法律、政治、政府福利和薪资建议)中,如何因用户文本中的身份标记(如种族、性别、年龄)而产生偏见性响应。现有研究已表明LLMs能从细微的语言模式中推断身份信息,但尚不清楚此类信息如何影响其实际决策过程。解决方案的关键在于首次系统性地分析了五类高风险应用中身份标记对LLM输出的影响机制,并通过实证发现:LLMs对身份线索高度敏感,且在不同群体间表现出显著的差异化响应——例如,针对相同症状提供不同标准的医疗建议、根据用户年龄调整政治立场一致性、以及基于种族和性别推荐差异化的薪资水平。此外,作者还开发了新的评估工具以量化用户语言中隐含身份编码对模型决策的干扰,从而为未来部署前的全面偏见评估提供了方法论基础。

链接: https://arxiv.org/abs/2507.14238
作者: Matthew Kearney,Reuben Binns,Yarin Gal
机构: Oxford University (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used in user-facing applications, from providing medical consultations to job interview advice. Recent research suggests that these models are becoming increasingly proficient at inferring identity information about the author of a piece of text from linguistic patterns as subtle as the choice of a few words. However, little is known about how LLMs use this information in their decision-making in real-world applications. We perform the first comprehensive analysis of how identity markers present in a user’s writing bias LLM responses across five different high-stakes LLM applications in the domains of medicine, law, politics, government benefits, and job salaries. We find that LLMs are extremely sensitive to markers of identity in user queries and that race, gender, and age consistently influence LLM responses in these applications. For instance, when providing medical advice, we find that models apply different standards of care to individuals of different ethnicities for the same symptoms; we find that LLMs are more likely to alter answers to align with a conservative (liberal) political worldview when asked factual questions by older (younger) individuals; and that LLMs recommend lower salaries for non-White job applicants and higher salaries for women compared to men. Taken together, these biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities. Beyond providing an analysis, we also provide new tools for evaluating how subtle encoding of identity in users’ language choices impacts model decisions. Given the serious implications of these findings, we recommend that similar thorough assessments of LLM use in user-facing applications are conducted before future deployment.
zh

[NLP-105] Beyond Architectures: Evaluating the Role of Contextual Embeddings in Detecting Bipolar Disorder on Social Media

【速读】: 该论文旨在解决双相情感障碍(Bipolar Disorder)在早期阶段因症状隐匿和社交污名化而常被漏诊的问题,通过自然语言处理(Natural Language Processing, NLP)技术从用户生成的社交媒体文本中识别潜在的病理信号。其解决方案的关键在于利用上下文感知的语言模型(如RoBERTa、BERT等Transformer架构)替代传统静态词嵌入(如GloVe、Word2Vec),以捕捉语义的细微差异与情感波动模式;实验表明,基于BERT上下文嵌入的LSTM模型与RoBERTa本身均能实现约98%的F1分数,显著优于静态嵌入方法(F1接近零),凸显了上下文建模在心理健康NLP应用中的核心作用。

链接: https://arxiv.org/abs/2507.14231
作者: Khalid Hasan,Jamil Saquer
机构: Missouri State University (密苏里州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The 37th International Conference on Software Engineering Knowledge Engineering, SEKE 2025 (camera-ready)

点击查看摘要

Abstract:Bipolar disorder is a chronic mental illness frequently underdiagnosed due to subtle early symptoms and social stigma. This paper explores the advanced natural language processing (NLP) models for recognizing signs of bipolar disorder based on user-generated social media text. We conduct a comprehensive evaluation of transformer-based models (BERT, RoBERTa, ALBERT, ELECTRA, DistilBERT) and Long Short Term Memory (LSTM) models based on contextualized (BERT) and static (GloVe, Word2Vec) word embeddings. Experiments were performed on a large, annotated dataset of Reddit posts after confirming their validity through sentiment variance and judgmental analysis. Our results demonstrate that RoBERTa achieves the highest performance among transformer models with an F1 score of ~98% while LSTM models using BERT embeddings yield nearly identical results. In contrast, LSTMs trained on static embeddings fail to capture meaningful patterns, scoring near-zero F1. These findings underscore the critical role of contextual language modeling in detecting bipolar disorder. In addition, we report model training times and highlight that DistilBERT offers an optimal balance between efficiency and accuracy. In general, our study offers actionable insights for model selection in mental health NLP applications and validates the potential of contextualized language models to support early bipolar disorder screening.
zh

[NLP-106] Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation

【速读】: 该论文旨在解决利用大语言模型(Large Language Models, LLMs)自动总结议会辩论时所面临的算法偏见与代表性失衡问题,即如何在保证摘要准确性与简洁性的同时,公平地体现所有发言者的观点和贡献。其解决方案的关键在于提出一个结构化的多阶段摘要框架,该框架不仅提升了文本连贯性和内容保真度,还支持对发言者属性(如发言顺序或政治隶属关系)如何影响其在最终摘要中的可见性和准确性进行系统分析,从而识别并缓解位置偏见和党派偏见,尤其发现分层式摘要策略在减少差异方面具有最大潜力。

链接: https://arxiv.org/abs/2507.14221
作者: Eoghan Cunningham,James Cross,Derek Greene
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The automated summarisation of parliamentary debates using large language models (LLMs) offers a promising way to make complex legislative discourse more accessible to the public. However, such summaries must not only be accurate and concise but also equitably represent the views and contributions of all speakers. This paper explores the use of LLMs to summarise plenary debates from the European Parliament and investigates the algorithmic and representational biases that emerge in this context. We propose a structured, multi-stage summarisation framework that improves textual coherence and content fidelity, while enabling the systematic analysis of how speaker attributes – such as speaking order or political affiliation – influence the visibility and accuracy of their contributions in the final summaries. Through our experiments using both proprietary and open-weight LLMs, we find evidence of consistent positional and partisan biases, with certain speakers systematically under-represented or misattributed. Our analysis shows that these biases vary by model and summarisation strategy, with hierarchical approaches offering the greatest potential to reduce disparity. These findings underscore the need for domain-sensitive evaluation metrics and ethical oversight in the deployment of LLMs for democratic applications.
zh

[NLP-107] Lets Measure the Elephant in the Room: Facilitating Personalized Automated Analysis of Privacy Policies at Scale

【速读】: 该论文旨在解决用户在面对海量在线服务时难以有效理解隐私政策(Privacy Policy)的问题,尽管多数用户声称会阅读这些条款,但实际上极少真正关注其内容。解决方案的关键在于提出并实现了一个神经符号系统——PoliAnalyzer,该系统结合自然语言处理(Natural Language Processing, NLP)与逻辑推理技术,将隐私政策文本转化为形式化表示,并与用户个性化数据偏好进行匹配分析,从而生成合规性报告。其核心创新在于扩展了现有的数据使用条款形式语言以建模隐私政策和用户偏好,并通过法律专家标注的PolicyIE数据集验证了高精度识别数据使用实践的能力(F1-score达90–100%),显著降低了用户的认知负担——平均仅需关注4.8%的冲突段落即可掌握关键风险点,为个体数据自主权提供可扩展的技术支持。

链接: https://arxiv.org/abs/2507.14214
作者: Rui Zhao,Vladyslav Melnychuk,Jun Zhao,Jesse Wright,Nigel Shadbolt
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In modern times, people have numerous online accounts, but they rarely read the Terms of Service or Privacy Policy of those sites despite claiming otherwise. This paper introduces PoliAnalyzer, a neuro-symbolic system that assists users with personalized privacy policy analysis. PoliAnalyzer uses Natural Language Processing (NLP) to extract formal representations of data usage practices from policy texts. In favor of deterministic, logical inference is applied to compare user preferences with the formal privacy policy representation and produce a compliance report. To achieve this, we extend an existing formal Data Terms of Use policy language to model privacy policies as app policies and user preferences as data policies. In our evaluation using our enriched PolicyIE dataset curated by legal experts, PoliAnalyzer demonstrated high accuracy in identifying relevant data usage practices, achieving F1-score of 90-100% across most tasks. Additionally, we demonstrate how PoliAnalyzer can model diverse user data-sharing preferences, derived from prior research as 23 user profiles, and perform compliance analysis against the top 100 most-visited websites. This analysis revealed that, on average, 95.2% of a privacy policy’s segments do not conflict with the analyzed user preferences, enabling users to concentrate on understanding the 4.8% (636 / 13205) that violates preferences, significantly reducing cognitive burden. Further, we identified common practices in privacy policies that violate user expectations - such as the sharing of location data with 3rd parties. This paper demonstrates that PoliAnalyzer can support automated personalized privacy policy analysis at scale using off-the-shelf NLP tools. This sheds light on a pathway to help individuals regain control over their data and encourage societal discussions on platform data practices to promote a fairer power dynamic.
zh

[NLP-108] LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长序列建模中面临的两个关键挑战:一是如何在有限的存储预算下增强模型对长距离依赖关系的捕捉能力,二是如何实现连续生成过程中不因内存溢出(Out-of-Memory, OOM)而中断。解决方案的核心在于提出了一种无需训练的KV缓存优化范式LaCache,其关键创新包括:(1)一种梯形结构的KV缓存模式,该模式不仅按层内顺序存储Key-Value(KV)对(左到右),还跨层从浅层到深层存储,从而在固定缓存空间内扩展上下文跨度,提升长程建模能力;(2)一种基于token距离的迭代压缩机制,通过逐步压缩较早的缓存数据释放空间供新token使用,实现受限缓存预算下的高效连续生成。

链接: https://arxiv.org/abs/2507.14204
作者: Dachuan Shi,Yonggan Fu,Xiangchi Yuan,Zhongzhi Yu,Haoran You,Sixu Li,Xin Dong,Jan Kautz,Pavlo Molchanov,Yingyan(Celine)Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025. Code: this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at this https URL.
zh

[NLP-109] ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在网络安全威胁调查任务中缺乏系统性评估基准的问题。现有方法难以有效衡量LLM代理在多跳证据推理、异构日志分析及自动化报告生成等方面的性能。为此,作者提出ExCyTIn-Bench,一个基于真实攻击场景构建的基准测试平台,其关键创新在于:利用专家设计的检测逻辑从Azure环境中提取安全日志并构建威胁调查图(Threat Investigation Graph),通过LLM自动对图中节点对生成带背景上下文和答案锚定的安全问题;这种结构化方法不仅提供可解释的真值答案,还支持通过强化学习实现可验证奖励的程序化任务生成,从而为LLM代理的训练与评估提供可复用、可扩展的基础设施。

链接: https://arxiv.org/abs/2507.14201
作者: Yiran Wu,Mauricio Velazco,Andrew Zhao,Manuel Raúl Meléndez Luján,Srisuma Movva,Yogesh K Roy,Quang Nguyen,Roberto Rodriguez,Qingyun Wu,Michael Albada,Julia Kiseleva,Anand Mudgerikar
机构: Pennsylvania State University (宾夕法尼亚州立大学); Microsoft Security AI Research (微软安全人工智能研究中心); Tsinghua University (清华大学); AG2AI
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous alert signals and security logs, follow multi-hop chains of evidence, and compile an incident report. With the developments of LLMs, building LLM-based agents for automatic thread investigation is a promising direction. To assist the development and evaluation of LLM agents, we construct a dataset from a controlled Azure tenant that covers 8 simulated real-world multi-step attacks, 57 log tables from Microsoft Sentinel and related services, and 589 automatically generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. This also enables the automatic generation of procedural tasks with verifiable rewards, which can be naturally extended to training agents via reinforcement learning. Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368, leaving substantial headroom for future research. Code and data are coming soon!
zh

[NLP-110] Open-Source LLM s Collaboration Beats Closed-Source LLM s: A Scalable Multi-Agent System

【速读】: 该论文旨在解决如何通过整合多个开源大语言模型(Large Language Models, LLMs)来匹配甚至超越闭源大语言模型性能的问题。其核心挑战在于实现新模型的持续集成与多样化任务下的泛化能力。解决方案的关键在于提出SMACS框架,包含两个核心技术:一是基于检索的先验选择(Retrieval-based Prior Selection, RPS),在实例层面为每个LLM分配代理性能分数并选取Top-k模型;二是探索-利用驱动的后验增强(Exploration-Exploitation-Driven Posterior Enhancement, EPE),通过先验丢弃促进响应多样性,并利用混合后验得分筛选高质量输出。实验表明,集成15个开源LLM的SMACS在8个主流基准上显著优于2025年领先的闭源模型,如Claude-3.7-Sonnet、GPT-4.1和GPT-o3-mini,且超过开源与闭源模型各自最佳结果的平均值,推动了智能上限的边界。

链接: https://arxiv.org/abs/2507.14200
作者: Shengji Tang,Jianjian Cao,Weihao Lin,Jiale Hong,Bo Zhang,Shuyue Hu,Lei Bai,Tao Chen,Wanli Ouyang,Peng Ye
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper aims to demonstrate the potential and strengths of open-source collectives. It leads to a promising question: Can we harness multiple open-source LLMs to match or even beat the closed-source LLMs? To answer this, we propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance. Specifically, for continuous integration of new LLMs and generalization to diverse questions, we first propose a Retrieval-based Prior Selection (RPS), which assigns a proxy performance score to each LLM to select the Top-k LLMs at the instance level for any given question. Then, we propose an Exploration-Exploitation-Driven Posterior Enhancement (EPE), encouraging the generation of diverse responses through prior dropping and selecting the high-quality response via a hybrid posterior score. Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS: by integrating fifteen open-source LLMs, SMACS outperforms leading closed-source LLMs in 2025, e.g., Claude-3.7-Sonnet (+12.73%), GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results of different datasets from both open-source LLMs (+2.86%) and closed-source LLMs (+2.04%), pushing the upper bound of intelligence. Code will be released at this https URL.
zh

[NLP-111] Retention analysis of edited knowledge after fine-tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在经过知识编辑后,于下游任务微调(fine-tuning)过程中出现的知识遗忘问题。现有模型编辑方法虽能以较低计算成本实现局部知识更新,但其编辑内容在微调时易被覆盖或丢失,而这一现象此前缺乏系统研究。论文的关键发现是:编辑过的知识比预训练阶段获得的原始知识更易在微调中被遗忘,表明当前编辑方法对下游任务适应性不足;解决方案的核心在于通过冻结与编辑内容相关的模型层(layer freezing),可显著提升编辑知识的保留能力,为未来设计更具鲁棒性的模型编辑方法提供了重要方向。

链接: https://arxiv.org/abs/2507.14198
作者: Fufang Wen,Shichang Zhang
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) store vast amounts of knowledge, which often requires updates to correct factual errors, incorporate newly acquired information, or adapt model behavior. Model editing methods have emerged as efficient solutions for such updates, offering localized and precise knowledge modification at significantly lower computational cost than continual training. In parallel, LLMs are frequently fine-tuned for a wide range of downstream tasks. However, the effect of fine-tuning on previously edited knowledge remains poorly understood. In this work, we systematically investigate how different fine-tuning objectives interact with various model editing techniques. Our findings show that edited knowledge is substantially more susceptible to forgetting during fine-tuning than intrinsic knowledge acquired through pre-training. This analysis highlights a key limitation of current editing approaches and suggests that evaluating edit robustness under downstream fine-tuning is critical for their practical deployment. We further find that freezing layers associated with edited content can significantly improve knowledge retention, offering insight into how future editing methods might be made more robust.
zh

[NLP-112] DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融、医疗、法律等专业领域作为写作助手时,因缺乏深度领域知识和易产生幻觉(hallucination)而导致的可靠性不足问题。现有方案如检索增强生成(Retrieval-Augmented Generation, RAG)存在多步检索不一致,而基于在线搜索的方法则受限于网络内容质量不稳定。解决方案的关键在于提出 DeepWriter,一个可定制、多模态、面向长文本生成的写作系统,其核心创新包括:任务分解、大纲生成、多模态检索与逐段生成结合反思机制,并采用分层知识表示以提升检索效率与准确性,从而实现从结构化语料库中深度挖掘信息并生成事实准确、逻辑连贯的专业文档。

链接: https://arxiv.org/abs/2507.14189
作者: Song Mao,Lejun Cheng,Pinlong Cai,Guohang Yan,Ding Wang,Botian Shi
机构: Shanghai AI Lab(上海人工智能实验室); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in process

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various applications. However, their use as writing assistants in specialized domains like finance, medicine, and law is often hampered by a lack of deep domain-specific knowledge and a tendency to hallucinate. Existing solutions, such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency across multiple retrieval steps, while online search-based methods often degrade quality due to unreliable web content. To address these challenges, we introduce DeepWriter, a customizable, multimodal, long-form writing assistant that operates on a curated, offline knowledge base. DeepWriter leverages a novel pipeline that involves task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. By deeply mining information from a structured corpus and incorporating both textual and visual elements, DeepWriter generates coherent, factually grounded, and professional-grade documents. We also propose a hierarchical knowledge representation to enhance retrieval efficiency and accuracy. Our experiments on financial report generation demonstrate that DeepWriter produces high-quality, verifiable articles that surpasses existing baselines in factual accuracy and generated content quality.
zh

[NLP-113] A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)中激活稀疏性(activation sparsity)的高效利用问题,即如何在不显著降低模型性能的前提下,通过预测神经元激活模式来减少计算开销。其解决方案的关键在于提出了一种基于聚类的激活模式压缩框架:不再单独预测每个神经元的激活状态,而是将具有相似激活模式的神经元分组为少量代表性簇(cluster),从而将复杂的神经元级预测问题转化为更高效的簇分配预测问题。该方法在保持极低困惑度(PPL)损失的同时,实现了高达79.34%的聚类精度,验证了其在保留模型质量与提升稀疏计算效率方面的有效性。

链接: https://arxiv.org/abs/2507.14179
作者: Nobel Dhar,Bobin Deng,Md Romyull Islam,Xinyue Zhang,Kazi Fahim Ahmad Nasif,Kun Suo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: To be published in Euro-Par 2025

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit significant activation sparsity, where only a subset of neurons are active for a given input. Although this sparsity presents opportunities to reduce computational cost, efficiently utilizing it requires predicting activation patterns in a scalable manner. However, direct prediction at the neuron level is computationally expensive due to the vast number of neurons in modern LLMs. To enable efficient prediction and utilization of activation sparsity, we propose a clustering-based activation pattern compression framework. Instead of treating each neuron independently, we group similar activation patterns into a small set of representative clusters. Our method achieves up to 79.34% clustering precision, outperforming standard binary clustering approaches while maintaining minimal degradation in perplexity (PPL) scores. With a sufficiently large number of clusters, our approach attains a PPL score as low as 12.49, demonstrating its effectiveness in preserving model quality while reducing computational overhead. By predicting cluster assignments rather than individual neuron states, future models can efficiently infer activation patterns from pre-computed centroids. We detail the clustering algorithm, analyze its effectiveness in capturing meaningful activation structures, and demonstrate its potential to improve sparse computation efficiency. This clustering-based formulation serves as a foundation for future work on activation pattern prediction, paving the way for efficient inference in large-scale language models.
zh

[NLP-114] Dissociating model architectures from inference computations

【速读】: 该论文试图解决的问题是:在非马尔可夫(non-Markovian)序列建模中,自回归模型(autoregressive models)与深度时序模型(deep temporal models)在预测分布的因式分解方式(factorisation)与推理过程中实际计算复杂度之间的关系不明确,导致对模型能力与效率的理解存在混淆。解决方案的关键在于区分模型架构本身(即预测分布如何因式分解)与推理阶段的具体计算过程;研究通过在迭代推理中结构化上下文访问机制,发现自回归模型(如Transformer)可以在不牺牲预测能力的前提下,模拟深度时序模型的层次化时间因式分解,从而显著减少实际计算量,表明预测构建与精炼过程并不必然受限于其底层模型架构。

链接: https://arxiv.org/abs/2507.15776
作者: Noor Sajid,Johan Medrano
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 3 pages, 1 figure

点击查看摘要

Abstract:Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.
zh

[NLP-115] What do Large Language Models know about materials?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在材料科学与机械工程领域应用时存在的知识准确性问题,尤其是其对材料相关事实信息生成能力的局限性。由于主流LLMs基于互联网非科学内容训练,其内在知识可能缺乏对材料科学中“工艺-结构-性能-性能”(Processing-Structure-Property-Performance, PSPP)链条的专业覆盖。论文的关键解决方案在于构建一个以元素周期表为示例的材料知识基准测试体系,通过分析词汇表(vocabulary)和分词(tokenization)对材料指纹独特性的贡献,评估不同前沿开源模型在生成材料事实信息上的准确性,从而明确LLMs在PSPP链中可适用的具体环节,并指出在哪些场景下需依赖专用模型以保障结果可靠性。

链接: https://arxiv.org/abs/2507.14586
作者: Adrian Ehrenhofer,Thomas Wallmersperger,Gianaurelio Cuniberti
机构: 未知
类目: Applied Physics (physics.app-ph); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in the fields of mechanical engineering and materials science. As models that establish connections through the interface of language, LLMs can be applied for step-wise reasoning through the Processing-Structure-Property-Performance chain of material science and engineering. Current LLMs are built for adequately representing a dataset, which is the most part of the accessible internet. However, the internet mostly contains non-scientific content. If LLMs should be applied for engineering purposes, it is valuable to investigate models for their intrinsic knowledge – here: the capacity to generate correct information about materials. In the current work, for the example of the Periodic Table of Elements, we highlight the role of vocabulary and tokenization for the uniqueness of material fingerprints, and the LLMs’ capabilities of generating factually correct output of different state-of-the-art open models. This leads to a material knowledge benchmark for an informed choice, for which steps in the PSPP chain LLMs are applicable, and where specialized models are required.
zh

[NLP-116] Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

【速读】: 该论文旨在解决零样本在线语音转换(Zero-shot Online Voice Conversion, VC)在实时通信与娱乐场景中面临的三大挑战:一是难以在实时约束下保持语义保真度,二是生成语音自然度不足,三是对未见过的说话人特征适应能力弱。其解决方案的关键在于提出了一种名为Conan的分块在线零样本语音转换模型,核心创新包括:1)采用Emformer构建流式内容提取器(Stream Content Extractor),实现低延迟的内容编码;2)设计自适应风格编码器(Adaptive Style Encoder),从参考语音中提取细粒度风格特征以提升风格迁移效果;3)引入因果shuffle声码器(Causal Shuffle Vocoder),基于像素重排机制实现完全因果的HiFiGAN架构,确保端到端的实时性与音质一致性。实验表明,Conan在主观和客观指标上均优于基线模型。

链接: https://arxiv.org/abs/2507.14534
作者: Yu Zhang,Baotong Tian,Zhiyao Duan
机构: University of Rochester (罗切斯特大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at this https URL.
zh

计算机视觉

[CV-0] Diffusion Beats Autoregressive in Data-Constrained Settings

【速读】:该论文旨在解决在数据受限场景下,传统自回归(Autoregressive, AR)语言模型性能瓶颈的问题。其核心挑战在于:当训练数据有限但计算资源充足时,AR模型难以充分挖掘数据潜力,导致验证损失较高和下游任务表现不佳。论文提出的解决方案是采用掩码扩散模型(masked diffusion models),其关键创新在于通过重复遍历有限数据并引入随机掩码机制,使模型暴露于多样化的token顺序和预测任务中,从而实现隐式数据增强(implicit data augmentation)。这种机制显著提升了模型对重复数据的利用效率,最终在计算资源充裕时表现出优于AR模型的性能。研究进一步揭示了扩散模型的新缩放定律,并推导出扩散模型开始超越AR模型的临界计算阈值的闭合表达式。

链接: https://arxiv.org/abs/2507.15857
作者: Mihir Prabhudesai,Menging Wu,Amir Zadeh,Katerina Fragkiadaki,Deepak Pathak
机构: Carnegie Mellon University (卡内基梅隆大学); Lambda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: this https URL.
zh

[CV-1] Latent Denoising Makes Good Visual Tokenizers

【速读】:该论文试图解决的问题是:当前视觉分词器(visual tokenizer)在生成建模中的有效性缺乏明确的设计原则,其性能提升依赖经验性设计而非理论指导。解决方案的关键在于提出一种基于去噪(denoising)目标的分词器设计范式——Latent Denoising Tokenizer (l-DeTok),通过直接对齐分词器嵌入(embedding)与下游去噪任务,使潜在表示在遭受插值噪声和随机掩码等严重干扰时仍能被高效重建,从而提升生成模型的整体性能。实验表明,该方法在ImageNet 256x256数据集上显著优于六种代表性生成模型所使用的标准分词器。

链接: https://arxiv.org/abs/2507.15856
作者: Jiawei Yang,Tianhong Li,Lijie Fan,Yonglong Tian,Yue Wang
机构: USC(南加州大学); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Google DeepMind(谷歌深度心智); OpenAI(OpenAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective – reconstructing clean signals from corrupted inputs such as Gaussian noise or masking – a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
zh

[CV-2] SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

【速读】:该论文旨在解决视频目标分割(Video Object Segmentation, VOS)中现有方法在应对剧烈视觉变化、遮挡和复杂场景变换时性能不足的问题,其根源在于当前模型过度依赖外观匹配而缺乏对物体的高阶概念理解。解决方案的关键在于提出一种概念驱动的分割框架——Segment Concept (SeC),该框架通过大型视觉语言模型(Large Vision-Language Models, LVLMs)构建跨帧的语义先验,并在此基础上逐步生成以对象为中心的高层表示;在推理阶段,SeC基于已处理帧形成目标的完整语义表征,实现后续帧的鲁棒分割,并自适应地平衡LVLM语义推理与增强特征匹配之间的计算资源分配,从而提升模型在复杂场景下的泛化能力。

链接: https://arxiv.org/abs/2507.15852
作者: Zhixiong Zhang,Shuangrui Ding,Xiaoyi Dong,Songxin He,Jianfan Lin,Junsong Tang,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL code: this https URL dataset: this https URL

点击查看摘要

Abstract:Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.
zh

[CV-3] Look Focus Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

【速读】:该论文旨在解决机器人学习系统中视觉处理效率低下的问题,即传统方法依赖对原始摄像头图像的被动、均匀处理,而人类视觉则是通过主动注视(gaze)聚焦于任务相关区域,从而显著降低视觉计算负担并提升感知效率。其解决方案的关键在于引入类人主动视觉机制,构建一个模拟人类头部运动与眼动追踪的主动视觉(Active Vision)机器人系统,并基于视锥化图像处理(foveated image processing)技术,将人类眼动数据与机器人操作示范同步采集,形成用于训练的基准数据集与仿真环境。特别地,作者在Vision Transformer (ViT) 中采用受图像分割启发的视锥化补丁标记化(foveated patch tokenization)方案,在保持感兴趣区域视觉保真度的同时大幅减少令牌数量和计算量;同时提出两种融合 gaze 信息的策略:一是两阶段模型先预测 gaze 再指导动作,二是端到端联合预测 gaze 与动作。实验表明,该方法不仅显著降低计算开销,还提升了高精度任务性能及对未见干扰物的鲁棒性,验证了人类视觉机制作为机器人视觉系统的有益归纳偏置(inductive bias)。

链接: https://arxiv.org/abs/2507.15833
作者: Ian Chuang,Andrew Lee,Dechen Gao,Jinyu Zou,Iman Soltani
机构: University of California, Berkeley (加州大学伯克利分校); University of California, Davis (加州大学戴维斯分校); Tongji University (同济大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems. this https URL
zh

[CV-4] Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)生成模型在物理常识推理方面的不足,即模型生成的视频常违背基本的因果关系、物体行为及工具使用等直观物理规律。为应对这一问题,作者提出了PhysVidBench基准测试框架,其关键在于设计了一个三阶段评估流程:首先从文本提示中提取具身物理问题,其次利用视觉语言模型对生成视频进行描述性 captioning,最后通过语言模型仅基于caption回答涉及物理常识的问题。该间接评估策略有效规避了直接视频分析中常见的幻觉问题,并聚焦于工具使用与材料属性等以往被忽视的物理推理维度,从而提供了一个结构化且可解释的物理常识评估体系。

链接: https://arxiv.org/abs/2507.15824
作者: Enes Sanli,Baris Sarper Tezcan,Aykut Erdem,Erkut Erdem
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.
zh

[CV-5] Diffusion models for multivariate subsurface generation and efficient probabilistic inversion

【速读】:该论文旨在解决多变量地下建模与概率反演中生成式模型的性能瓶颈问题,特别是如何在保持统计稳健性的同时提升条件采样效率并降低计算成本。其解决方案的关键在于改进扩散后验采样(Diffusion Posterior Sampling)方法,引入一种考虑扩散建模固有噪声污染的似然近似策略,并通过调整更新规则实现对局部硬数据(如测井数据)和非线性地球物理数据(如全叠地震数据)的联合条件约束。该方法将反演过程内嵌于扩散过程中,避免了传统基于马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)等外循环迭代的高计算开销,从而显著提升了后验概率密度函数的采样质量与整体计算效率。

链接: https://arxiv.org/abs/2507.15809
作者: Roberto Miele,Niklas Linde
机构: Institute of Earth Sciences (地球科学研究所); University of Lausanne (洛桑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Diffusion models offer stable training and state-of-the-art performance for deep generative modeling tasks. Here, we consider their use in the context of multivariate subsurface modeling and probabilistic inversion. We first demonstrate that diffusion models enhance multivariate modeling capabilities compared to variational autoencoders and generative adversarial networks. In diffusion modeling, the generative process involves a comparatively large number of time steps with update rules that can be modified to account for conditioning data. We propose different corrections to the popular Diffusion Posterior Sampling approach by Chung et al. (2023). In particular, we introduce a likelihood approximation accounting for the noise-contamination that is inherent in diffusion modeling. We assess performance in a multivariate geological scenario involving facies and correlated acoustic impedance. Conditional modeling is demonstrated using both local hard data (well logs) and nonlinear geophysics (fullstack seismic data). Our tests show significantly improved statistical robustness, enhanced sampling of the posterior probability density function and reduced computational costs, compared to the original approach. The method can be used with both hard and indirect conditioning data, individually or simultaneously. As the inversion is included within the diffusion process, it is faster than other methods requiring an outer-loop around the generative model, such as Markov chain Monte Carlo.
zh

[CV-6] rue Multimodal In-Context Learning Needs Attention to the Visual Context

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态上下文学习(Multimodal In-Context Learning, MICL)中对视觉信息利用不足的问题,即模型倾向于依赖文本模式而忽视图像内容,导致MICL本质上仍为单模态行为,限制了其实际应用价值。解决方案的关键在于提出两种创新方法:一是动态注意力重分配(Dynamic Attention Reallocation, DARA),通过微调策略重新平衡视觉与文本token之间的注意力分布,引导模型关注视觉上下文;二是构建TrueMICL数据集,该数据集明确要求任务完成必须整合多模态信息(尤其是视觉内容),从而可靠评估模型的真实多模态适应能力。实验表明,该综合方案显著提升了模型的真正多模态上下文学习性能。

链接: https://arxiv.org/abs/2507.15807
作者: Shuo Chen,Jianzhe Liu,Zhen Han,Yan Xia,Daniel Cremers,Philip Torr,Volker Tresp,Jindong Gu
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Technical University of Munich (慕尼黑工业大学); Siemens AG (西门子股份公司); University of Science and Technology of China (中国科学技术大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Konrad Zuse School of Excellence in Reliable AI (relAI) (康拉德·祖塞可靠人工智能卓越学院); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted to COLM 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at this https URL .
zh

[CV-7] ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction ICCV2025

【速读】:该论文旨在解决像素级视觉任务(如语义分割)中标签稀缺问题,即如何利用少量标注数据与大量未标注数据实现高效且准确的模型训练。其核心挑战在于:尽管基础分割模型(如SEEM)具备强大的跨域泛化能力,但直接使用其生成的预测掩码作为监督信号会导致不确定性高、可靠性差的问题。解决方案的关键在于提出ConformalSAM框架,该框架首先通过目标域的少量标注数据对基础模型进行不确定性校准(基于共形预测,Conformal Prediction, CP),从而筛选出高置信度的像素标签用于监督学习;随后引入自依赖训练策略,在训练后期缓解对SEEM生成掩码的过拟合问题。此方法有效提升了早期学习阶段的可靠性,并在多个标准半监督语义分割基准上显著优于现有方法。

链接: https://arxiv.org/abs/2507.15803
作者: Danhui Chen,Ziquan Liu,Chuxi Yang,Dan Wang,Yan Yan,Yi Xu,Xiangyang Ji
机构: Dalian University of Technology (大连理工大学); Queen Mary University of London (伦敦玛丽女王大学); Washington State University (华盛顿州立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV 2025

点击查看摘要

Abstract:Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain’s labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks in the later training stage. Our experiment demonstrates that, on three standard benchmarks of SSSS, ConformalSAM achieves superior performance compared to recent SSSS methods and helps boost the performance of those methods as a plug-in.
zh

[CV-8] Exploring Superposition and Interference in State-of-the-Art Low-Parameter Vision Models

【速读】:该论文旨在解决低参数量深度神经网络(<1.5M参数)在计算机视觉任务中因特征图干扰(feature map interference)导致的性能瓶颈问题,该干扰现象与超线性激活函数(superlinear activation functions)引发的特征叠加(superposition)密切相关。解决方案的关键在于通过优化瓶颈结构(bottleneck architectures)来减少特征间的干扰,从而提升模型的可扩展性和准确性;研究据此提出了一种名为NoDepth Bottleneck的原型架构,其设计基于实验中获得的机制性洞见,在ImageNet数据集上实现了稳健的缩放准确率,为低参数场景下的高效神经网络设计提供了新路径。

链接: https://arxiv.org/abs/2507.15798
作者: Lilian Hollard,Lucas Mohimont,Nathalie Gaveau,Luiz-Angelo Steffenel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The paper investigates the performance of state-of-the-art low-parameter deep neural networks for computer vision, focusing on bottleneck architectures and their behavior using superlinear activation functions. We address interference in feature maps, a phenomenon associated with superposition, where neurons simultaneously encode multiple characteristics. Our research suggests that limiting interference can enhance scaling and accuracy in very low-scaled networks (under 1.5M parameters). We identify key design elements that reduce interference by examining various bottleneck architectures, leading to a more efficient neural network. Consequently, we propose a proof-of-concept architecture named NoDepth Bottleneck built on mechanistic insights from our experiments, demonstrating robust scaling accuracy on the ImageNet dataset. These findings contribute to more efficient and scalable neural networks for the low-parameter range and advance the understanding of bottlenecks in computer vision. this https URL
zh

[CV-9] Regularized Low-Rank Adaptation for Few-Shot Organ Segmentation MICCAI2025

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中低秩适应(Low-Rank Adaptation, LoRA)方法因固定秩(rank)设置而难以适配医学图像分割任务复杂性的难题。其关键解决方案在于引入基于奇异值分解(Singular Value Decomposition, SVD)的可学习低秩表示,并在损失函数中加入 $ l_1 $ 稀疏正则项,通过近端优化器进行求解,从而实现训练过程中自动调整内在秩(intrinsic rank),使模型能够根据下游任务特性自适应地确定最优秩,显著提升性能并增强对初始秩选择不敏感的鲁棒性。

链接: https://arxiv.org/abs/2507.15793
作者: Ghassen Baklouti,Julio Silva-Rodríguez,Jose Dolz,Houda Bahig,Ismail Ben Ayed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) of pre-trained foundation models is increasingly attracting interest in medical imaging due to its effectiveness and computational efficiency. Among these methods, Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace. While it has shown good performance, its implementation requires a fixed and unalterable rank, which might be challenging to select given the unique complexities and requirements of each medical imaging downstream task. Inspired by advancements in natural image processing, we introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation. Viewing the low-rank representation of the trainable weight matrices as a singular value decomposition, we introduce an l_1 sparsity regularizer to the loss function, and tackle it with a proximal optimizer. The regularizer could be viewed as a penalty on the decomposition rank. Hence, its minimization enables to find task-adapted ranks automatically. Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods across two distinguishable tasks: base organs and novel organs. Our extensive experiments demonstrate the significant performance improvements driven by our method, highlighting its efficiency and robustness against suboptimal rank initialization. Our code is publicly available: this https URL
zh

[CV-10] Label tree semantic losses for rich multi-class medical image segmentation

【速读】:该论文旨在解决医学图像分割中因传统学习方法对所有错误一视同仁而无法有效利用标签空间中类别间语义信息的问题,尤其是在标签类别数量增多且存在细微差异时性能下降明显。其解决方案的关键在于提出两种基于树结构的语义损失函数(tree-based semantic loss functions),充分利用标签的层次化组织关系来增强模型对类别语义的理解能力,并进一步将其集成到一种仅需稀疏、无背景标注的训练方法中,从而提升在全监督与弱监督场景下的分割精度与鲁棒性。

链接: https://arxiv.org/abs/2507.15777
作者: Junwen Wang,Oscar MacCormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren
机构: King’s College London (国王学院); King’s College Hospital (国王学院医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2506.21150

点击查看摘要

Abstract:Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
zh

[CV-11] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization ACM-MM’25

【速读】:该论文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)在多源数据和个体表达差异导致的样本异质性下性能下降的问题。解决方案的关键在于提出一种新型的异质性感知分布框架(Heterogeneity-aware Distributional Framework, HDF),其核心包含两个即插即用模块:一是时频分布注意力模块(Time-Frequency Distributional Attention Module, DAM),通过双分支注意力机制同时捕捉时间一致性与频域鲁棒性,提升对序列不一致性和视觉风格变化的容忍度;二是分布感知缩放模块(Distribution-aware Scaling Module, DSM),基于梯度敏感性和信息瓶颈原理,自适应平衡分类损失与对比损失,从而实现更稳定且判别力更强的表示学习。

链接: https://arxiv.org/abs/2507.15765
作者: Feng-Qi Cui,Anyang Tong,Jinyang Huang,Jie Zhang,Dan Guo,Zhi Liu,Meng Wang
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); IHPC and CFAR, Agency for Science, Technology and Research (新加坡科技研究局); The University of Electro-Communications (电波通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM’25

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at this https URL.
zh

[CV-12] Appearance Harmonization via Bilateral Grid Prediction with Transformers for 3DGS NEURIPS2025

【速读】:该论文旨在解决现代相机管线在多视角图像中引入的光度不一致性问题,这种不一致性源于曝光调整、白平衡和色彩校正等设备端处理操作,导致视图间外观差异,破坏多视角一致性并降低新视角合成质量。解决方案的关键在于提出一种基于Transformer的方法,通过预测空间自适应的双边网格(bilateral grids)来以多视角一致的方式校正光度变化,从而实现无需场景特定重训练即可跨场景鲁棒泛化;同时将学习到的网格集成至3D高斯泼溅(3D Gaussian Splatting)流程中,在保持高效训练的同时显著提升重建质量。

链接: https://arxiv.org/abs/2507.15748
作者: Jisu Shin,Richard Shaw,Seunghyun Shin,Anton Pelykh,Zhensong Zhang,Hae-Gon Jeon,Eduardo Perez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); GIST AI Graduate School (韩国科学技术院人工智能研究生院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, NeurIPS 2025 under review

点击查看摘要

Abstract:Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade the quality of novel view synthesis. Joint optimization of scene representations and per-image appearance embeddings has been proposed to address this issue, but at the cost of increased computational complexity and slower training. In this work, we propose a transformer-based method that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner, enabling robust cross-scene generalization without the need for scene-specific retraining. By incorporating the learned grids into the 3D Gaussian Splatting pipeline, we improve reconstruction quality while maintaining high training efficiency. Extensive experiments show that our approach outperforms or matches existing scene-specific optimization methods in reconstruction fidelity and convergence speed.
zh

[CV-13] okensGen: Harnessing Condensed Tokens for Long Video Generation

【速读】:该论文旨在解决生成一致性的长视频(long videos)这一复杂挑战,尤其针对基于扩散模型(diffusion-based generative models)在扩展至长时间段时出现的内存瓶颈和长期不一致性问题。其解决方案的关键在于提出了一种两阶段框架TokensGen,通过压缩视频为语义丰富的“token”来实现高效建模:首先训练一个由文本和视频token引导的短视频扩散模型To2V(Token-to-Video),利用视频分词器(Video Tokenizer)将短片段编码为紧凑token;其次引入T2To(Text-to-Token)模型一次性生成所有token,保障跨片段的全局一致性;最后在推理阶段采用自适应FIFO-Diffusion策略实现相邻片段间的平滑过渡,减少边界伪影。该方法有效提升了长视频的时间连贯性和内容一致性,同时保持较低计算开销,具备可扩展性和模块化特性。

链接: https://arxiv.org/abs/2507.15728
作者: Wenqi Ouyang,Zeqi Xiao,Danni Yang,Yifan Zhou,Shuai Yang,Lei Yang,Jianlou Si,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at this https URL .
zh

[CV-14] A Practical Investigation of Spatially-Controlled Image Generation with Transformers

【速读】:该论文旨在解决空间可控图像生成(spatially-controlled image generation)领域中因训练数据、模型架构和生成范式差异导致的性能因素难以解耦的问题,从而缺乏公平且细致的科学比较。其核心挑战在于现有方法虽在性能上取得进展,但方法间的差异性使得无法清晰评估各技术贡献,且部分方法的设计动机与细节在文献中被模糊化。解决方案的关键在于通过受控实验对扩散模型(diffusion-based)与流模型(flow-based)以及自回归模型(autoregressive, AR)进行系统性对比,提出控制标记预填充(control token prefilling)作为通用且高效的基线方法,并发现采样阶段的改进——如将无分类器引导(classifier-free guidance)扩展至控制任务及softmax截断(softmax truncation)——显著提升控制一致性;同时重新阐明适配器方法(adapter-based approaches)的核心优势:在有限下游数据下可缓解“遗忘”问题并维持生成质量,但在生成-控制一致性方面仍逊于全量微调。

链接: https://arxiv.org/abs/2507.15724
作者: Guoxuan Xia,Harleen Hanspal,Petru-Daniel Tudosiu,Shifeng Zhang,Sarah Parisot
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate “forgetting” and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.
zh

[CV-15] Efficient Face Image Quality Assessment via Self-training and Knowledge Distillation ICCV

【速读】:该论文旨在解决人脸图像质量评估(Face Image Quality Assessment, FIQA)算法在实际应用中面临的计算复杂度高、难以部署的问题。其核心解决方案是提出一种两阶段的蒸馏框架:首先训练一个强大的教师模型,通过自训练策略利用大量未标注数据生成伪标签以增强模型性能;随后将该教师模型的知识蒸馏到轻量级的学生模型中,学生模型在融合原始标注数据与多轮伪标签数据的基础上进行训练,从而在保持接近教师模型性能的同时显著降低计算开销。该方法在ICCV 2025 VQualA FIQA挑战赛中获得第一名,验证了其高效性和实用性。

链接: https://arxiv.org/abs/2507.15709
作者: Wei Sun,Weixia Zhang,Linhan Cao,Jun Jia,Xiangyang Zhu,Dandan Zhu,Xiongkuo Min,Guangtao Zhai
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Efficient-FIQA achieved first place in the ICCV VQualA 2025 Face Image Quality Assessment Challenge

点击查看摘要

Abstract:Face image quality assessment (FIQA) is essential for various face-related applications. Although FIQA has been extensively studied and achieved significant progress, the computational complexity of FIQA algorithms remains a key concern for ensuring scalability and practical deployment in real-world systems. In this paper, we aim to develop a computationally efficient FIQA method that can be easily deployed in real-world applications. Specifically, our method consists of two stages: training a powerful teacher model and distilling a lightweight student model from it. To build a strong teacher model, we adopt a self-training strategy to improve its capacity. We first train the teacher model using labeled face images, then use it to generate pseudo-labels for a set of unlabeled images. These pseudo-labeled samples are used in two ways: (1) to distill knowledge into the student model, and (2) to combine with the original labeled images to further enhance the teacher model through self-training. The enhanced teacher model is used to further pseudo-label another set of unlabeled images for distilling the student models. The student model is trained using a combination of labeled images, pseudo-labeled images from the original teacher model, and pseudo-labeled images from the enhanced teacher model. Experimental results demonstrate that our student model achieves comparable performance to the teacher model with an extremely low computational overhead. Moreover, our method achieved first place in the ICCV 2025 VQualA FIQA Challenge. The code is available at this https URL.
zh

[CV-16] DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting

【速读】:该论文旨在解决稀疏视图三维高斯泼溅(Sparse-view 3D Gaussian Splatting, 3DGS)在重建高质量新视角时易过拟合训练视图中高频(High-Frequency, HF)细节的问题。传统基于傅里叶变换的频率正则化方法存在参数调优困难且倾向于有害的高频学习偏差。其解决方案的关键在于提出DWTGS框架,通过小波域(Wavelet Transform, DWT)损失实现更有效的频率正则化:具体而言,在多尺度小波分解中仅监督低频(Low-Frequency, LF)LL子带,并以自监督方式约束高频HH子带的稀疏性,从而提升泛化能力并减少高频幻觉。

链接: https://arxiv.org/abs/2507.15690
作者: Hung Nguyen,Runfa Li,An Le,Truong Nguyen
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in reconstructing high-quality novel views, as it often overfits to the widely-varying high-frequency (HF) details of the sparse training views. While frequency regularization can be a promising approach, its typical reliance on Fourier transforms causes difficult parameter tuning and biases towards detrimental HF learning. We propose DWTGS, a framework that rethinks frequency regularization by leveraging wavelet-space losses that provide additional spatial supervision. Specifically, we supervise only the low-frequency (LF) LL subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband in a self-supervised manner. Experiments across benchmarks show that DWTGS consistently outperforms Fourier-based counterparts, as this LF-centric strategy improves generalization and reduces HF hallucinations.
zh

[CV-17] LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression ICCV2025

【速读】:该论文旨在解决现有基于人工智能(AI)的点云压缩方法对特定训练数据分布依赖性强的问题,从而限制了其在真实场景中的部署。其核心解决方案是提出一种基于隐式神经表示(Implicit Neural Representation, INR)的无损点云几何压缩方法——LINR-PCGC,该方法通过将过拟合网络参数编码至比特流中实现分布无关性。关键创新在于:1)设计了一种基于点云组级别的编码框架与高效的网络初始化策略,使编码速度提升约60%;2)提出一种基于多尺度稀疏卷积(multiscale SparseConv)的轻量级编码网络,包含尺度上下文提取、子节点预测和模型压缩模块,显著加快推理速度并减小解码器规模。实验表明,该方法在MVUB数据集上相较于G-PCC TMC13v23和SparsePCGC分别减少比特率约21.21%和21.95%,且首次实现了INR驱动的无损点云几何压缩。

链接: https://arxiv.org/abs/2507.15686
作者: Wenjie Huang,Qi Yang,Shuting Xia,He Huang,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on this https URL.
zh

[CV-18] Hi2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing

【速读】:该论文旨在解决遥感场景下视觉重定位(visual relocalization)中存在的精度不足与计算复杂度高的问题,尤其是在大规模场景、高海拔变化及现有视觉先验存在域差异的情况下。现有方法在基于图像的检索与位姿回归之间存在权衡,而依赖Structure-from-Motion(SfM)模型的结构化方法则面临可扩展性差和计算开销大的挑战。其解决方案的关键在于引入3D高斯泼溅(3D Gaussian Splatting, 3DGS)作为新型场景表示,利用高斯基元中蕴含的丰富语义信息与几何约束,构建了一个双层级(sparse-to-dense and coarse-to-fine)重定位框架——Hi²-GSLoc。该框架通过分层策略实现从稀疏初始位姿估计到稠密迭代优化的全过程:第一阶段采用专为高斯设计的一致性渲染感知采样和地标引导检测器进行鲁棒初定位;第二阶段通过粗到细的稠密光栅化匹配与可靠性验证机制实现位姿精修,同时结合分区高斯训练、GPU加速并行匹配与动态内存管理以支持大规模遥感应用。

链接: https://arxiv.org/abs/2507.15683
作者: Boni Hu,Zhenyu Xia,Lin Chen,Pengcheng Han,Shuhui Bu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures

点击查看摘要

Abstract:Visual relocalization, which estimates the 6-degree-of-freedom (6-DoF) camera pose from query images, is fundamental to remote sensing and UAV applications. Existing methods face inherent trade-offs: image-based retrieval and pose regression approaches lack precision, while structure-based methods that register queries to Structure-from-Motion (SfM) models suffer from computational complexity and limited scalability. These challenges are particularly pronounced in remote sensing scenarios due to large-scale scenes, high altitude variations, and domain gaps of existing visual priors. To overcome these limitations, we leverage 3D Gaussian Splatting (3DGS) as a novel scene representation that compactly encodes both 3D geometry and appearance. We introduce \mathrmHi^2 -GSLoc, a dual-hierarchical relocalization framework that follows a sparse-to-dense and coarse-to-fine paradigm, fully exploiting the rich semantic information and geometric constraints inherent in Gaussian primitives. To handle large-scale remote sensing scenarios, we incorporate partitioned Gaussian training, GPU-accelerated parallel matching, and dynamic memory management strategies. Our approach consists of two stages: (1) a sparse stage featuring a Gaussian-specific consistent render-aware sampling strategy and landmark-guided detector for robust and accurate initial pose estimation, and (2) a dense stage that iteratively refines poses through coarse-to-fine dense rasterization matching while incorporating reliability verification. Through comprehensive evaluation on simulation data, public datasets, and real flight experiments, we demonstrate that our method delivers competitive localization accuracy, recall rate, and computational efficiency while effectively filtering unreliable pose estimates. The results confirm the effectiveness of our approach for practical remote sensing applications.
zh

[CV-19] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

【速读】:该论文旨在解决基于CLIP的图像质量评估(Image Quality Assessment, IQA)方法中存在的参数冗余和局部失真特征识别能力不足的问题。其解决方案的关键在于提出一种模态自适应的知识蒸馏(knowledge distillation)策略,通过设计分级提示模板引导CLIP输出质量分数并对其进行微调,从而将CLIP在IQA任务中的知识有效迁移至结构更优的学生模型中,显著降低模型复杂度的同时提升性能。

链接: https://arxiv.org/abs/2507.15680
作者: Yongkang Hou,Jiarun Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP’s IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.
zh

[CV-20] HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark

【速读】:该论文旨在解决当前多语言视觉问答(Multilingual Visual Question Answering, MLVQA)模型在处理手写文档时能力不足的问题,特别是缺乏高质量、真实场景下的多语言手写文档理解基准数据集。其解决方案的关键在于构建HW-MLVQA——一个专为多语言手写文档设计的新型视觉问答(VQA)基准,包含1600页真实手写文档及2400组问题-答案对,并提供覆盖文本、图像以及图文融合三种模态的评估框架。该基准特别模拟无真实文本转录的现实场景,从而严格评测OCR模型性能,推动多语言手写文档理解领域的技术进步与研究创新。

链接: https://arxiv.org/abs/2507.15655
作者: Aniket Pal,Ajoy Mondal,Minesh Mathew,C.V. Jawahar
机构: IIIT Hyderabad (印度国际信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is a minor revision of the original paper submitted to IJDAR

点击查看摘要

Abstract:The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image text modality. To simulate authentic real-world contexts devoid of ground truth textual transcriptions, we facilitates a rigorous assessment of proprietary and open-source OCR models. The benchmark aspires to facilitate pivotal advancements in multilingual handwritten document interpretation, fostering innovation and scholarly inquiry within this specialized domain.
zh

[CV-21] Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的对象幻觉(object hallucination)问题,即模型在生成内容时会错误地引入图像中并不存在的对象,这主要源于模型内部先验知识对视觉信息的过度抑制,尤其是在中间层阶段。解决方案的关键在于提出一种无需训练、模型无关的动态层选择方法——基于提取视觉事实的解码(Decoding by Extracting Visual Facts, EVA),其核心机制是通过对比原始输入与纯文本输入在选定中间层输出分布的差异,提取出视觉事实知识,并将其按比例融合至最终层的输出logits中以修正生成结果,从而有效降低幻觉率。

链接: https://arxiv.org/abs/2507.15652
作者: Haoran Zhou,Zihan Zhang,Hao Chen
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant strides by combining visual recognition and language understanding to generate content that is both coherent and contextually accurate. However, MLLMs continue to struggle with object hallucinations, where models produce seemingly plausible but factually incorrect outputs, including objects that do not exist in the image. Recent work has revealed that the prior knowledge in MLLMs significantly suppresses visual information in deep layers, causing hallucinatory outputs. However, how these priors suppress visual information at the intermediate layer stage in MLLMs remains unclear. We observe that visual factual knowledge and the differences between intermediate-layer prior/original probability distributions show similar evolutionary trends in intermediate layers. Motivated by this, we introduce Decoding by Extracting Visual Facts (EVA), a simple, training-free method that dynamically selects intermediate layers with the most significant visual factual information. By contrasting the output distributions of the selected layer derived from the original input and pure-text input, EVA extracts visual factual knowledge and proportionally incorporates it into the final layer to correct the output logits. Importantly, EVA is model-agnostic, seamlessly integrates with various classic decoding strategies, and is applicable across different MLLMs. We validate EVA on widely-used benchmarks, and the results show that it significantly reduces hallucination rates compared to baseline methods, underscoring its effectiveness in mitigating hallucinations.
zh

[CV-22] Uncovering Critical Features for Deepfake Detection through the Lottery Ticket Hypothesis

【速读】:该论文旨在解决深度伪造(deepfake)检测模型在实际部署中面临的两大挑战:一是现有检测方法的可解释性不足,二是模型参数量庞大导致难以在资源受限环境中应用。其解决方案的关键在于引入彩票假设(Lottery Ticket Hypothesis, LTH),通过迭代幅度剪枝(iterative magnitude pruning)识别出具有高检测性能的稀疏子网络(winning tickets)。实验表明,这些子网络可在高达80%稀疏度下仍保持接近原始模型的准确率(如MesoNet在OpenForensic数据集上保留56.2%准确率,仅需3,000参数),且剪枝后的模型对关键面部区域的关注度保持稳定,同时具备跨数据集迁移能力,从而为轻量化、高效、可部署的深度伪造检测系统提供了可行路径。

链接: https://arxiv.org/abs/2507.15636
作者: Lisan Al Amin,Md. Ismail Hossain,Thanh Thi Nguyen,Tasnim Jahan,Mahbubul Islam,Faisal Quader
机构: Cognitive Links, Maryland, USA; Apurba-NSU R&D Lab, North South University, Bangladesh; AiLECS Lab, Monash University, Australia; United International University, Bangladesh; University of Maryland, Baltimore County, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:Recent advances in deepfake technology have created increasingly convincing synthetic media that poses significant challenges to information integrity and social trust. While current detection methods show promise, their underlying mechanisms remain poorly understood, and the large sizes of their models make them challenging to deploy in resource-limited environments. This study investigates the application of the Lottery Ticket Hypothesis (LTH) to deepfake detection, aiming to identify the key features crucial for recognizing deepfakes. We examine how neural networks can be efficiently pruned while maintaining high detection accuracy. Through extensive experiments with MesoNet, CNN-5, and ResNet-18 architectures on the OpenForensic and FaceForensics++ datasets, we find that deepfake detection networks contain winning tickets, i.e., subnetworks, that preserve performance even at substantial sparsity levels. Our results indicate that MesoNet retains 56.2% accuracy at 80% sparsity on the OpenForensic dataset, with only 3,000 parameters, which is about 90% of its baseline accuracy (62.6%). The results also show that our proposed LTH-based iterative magnitude pruning approach consistently outperforms one-shot pruning methods. Using Grad-CAM visualization, we analyze how pruned networks maintain their focus on critical facial regions for deepfake detection. Additionally, we demonstrate the transferability of winning tickets across datasets, suggesting potential for efficient, deployable deepfake detection systems.
zh

[CV-23] Experimenting active and sequential learning in a medieval music manuscript

【速读】:该论文旨在解决古籍音乐手稿中光学音乐识别(Optical Music Recognition, OMR)因标注数据稀缺和历史文献复杂性导致的性能瓶颈问题。其解决方案的关键在于引入基于不确定性的主动学习(Active Learning, AL)与顺序学习(Sequential Learning, SL)策略,利用YOLOv8模型在迭代过程中选择预测置信度最低(即不确定性最高)的样本进行人工标注与重新训练,从而以极少的初始标注数据实现接近全监督训练的检测与版面识别精度。实验表明,在匿名项目提供的中世纪意大利“laude”乐谱新数据集上,该方法能显著减少人工标注负担,但同时也指出当前基于不确定性的AL策略在特定场景下效果有限,亟需更高效的采样机制以应对数据稀缺挑战。

链接: https://arxiv.org/abs/2507.15633
作者: Sachin Sharma(GSSI),Federico Simonetta(GSSI),Michele Flammini(GSSI)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, accepted at IEEE MLSP 2025 (IEEE International Workshop on Machine Learning for Signal Processing). Special Session: Applications of AI in Cultural and Artistic Heritage

点击查看摘要

Abstract:Optical Music Recognition (OMR) is a cornerstone of music digitization initiatives in cultural heritage, yet it remains limited by the scarcity of annotated data and the complexity of historical manuscripts. In this paper, we present a preliminary study of Active Learning (AL) and Sequential Learning (SL) tailored for object detection and layout recognition in an old medieval music manuscript. Leveraging YOLOv8, our system selects samples with the highest uncertainty (lowest prediction confidence) for iterative labeling and retraining. Our approach starts with a single annotated image and successfully boosts performance while minimizing manual labeling. Experimental results indicate that comparable accuracy to fully supervised training can be achieved with significantly fewer labeled examples. We test the methodology as a preliminary investigation on a novel dataset offered to the community by the Anonymous project, which studies laude, a poetical-musical genre spread across Italy during the 12th-16th Century. We show that in the manuscript at-hand, uncertainty-based AL is not effective and advocates for more usable methods in data-scarcity scenarios.
zh

[CV-24] Gaussian Splatting with Discretized SDF for Relightable Assets

【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的逆向渲染(inverse rendering)中几何约束难以施加的问题,尤其在处理离散高斯原语时无法有效利用连续的符号距离场(Signed Distance Field, SDF)约束。其关键解决方案是引入一种离散化SDF表示方法,将SDF值编码到每个高斯内作为采样值,并通过SDF到不透明度的映射关系实现SDF的溅射渲染,从而避免了传统方法中需额外存储连续SDF或复杂优化设计的问题。此外,为保证离散SDF样本与真实SDF的一致性,提出基于投影的几何一致性损失,强制高斯投影至SDF零等值面并与溅射表面对齐,显著提升了光照重建质量,同时保持内存开销与原始3DGS相当。

链接: https://arxiv.org/abs/2507.15629
作者: Zuo-Liang Zhu,Jian Yang,Beibei Wang
机构: Nankai University (南开大学); Nanjing University (南京大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a discretized SDF to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray this http URL key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (\eg Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher relighting quality, while requiring no extra memory beyond GS and avoiding complex manually designed optimization. The experiments reveal that our method outperforms existing Gaussian-based inverse rendering methods. Our code is available at this https URL.
zh

[CV-25] A Survey on Efficiency Optimization Techniques for DNN-based Video Analytics: Process Systems Algorithms and Applications

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在视频分析中的效率问题,尽管DNNs已被广泛用于提升准确性,但其在实际部署中仍面临计算资源消耗高、延迟大等效率瓶颈。解决方案的关键在于从硬件支持、数据处理到运行部署等多个层次进行系统性优化,提出一个自底向上的优化框架,以全面梳理和整合现有提升DNN效率的技术路径,并深入分析当前性能优化中存在的挑战与未来方向。

链接: https://arxiv.org/abs/2507.15628
作者: Shanjiang Tang,Rui Huang,Hsinyu Luo,Chunjiang Wang,Ce Yu,Yusen Li,Hao Fu,Chao Sun,and Jian Xiao
机构: Institute for Clarity in Documentation (文档清晰度研究所); The Thørväld Group (Thørväld集团); Inria Paris-Rocquencourt (Inria巴黎-罗克库尔研究中心); Rajiv Gandhi University (拉吉夫·甘地大学); Tsinghua University (清华大学); Palmer Research Laboratories (帕尔默研究实验室); The Kumquat Consortium (库姆夸特联盟)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The explosive growth of video data in recent years has brought higher demands for video analytics, where accuracy and efficiency remain the two primary concerns. Deep neural networks (DNNs) have been widely adopted to ensure accuracy; however, improving their efficiency in video analytics remains an open challenge. Different from existing surveys that make summaries of DNN-based video mainly from the accuracy optimization aspect, in this survey, we aim to provide a thorough review of optimization techniques focusing on the improvement of the efficiency of DNNs in video analytics. We organize existing methods in a bottom-up manner, covering multiple perspectives such as hardware support, data processing, operational deployment, etc. Finally, based on the optimization framework and existing works, we analyze and discuss the problems and challenges in the performance optimization of DNN-based video analytics.
zh

[CV-26] CylinderPlane: Nested Cylinder Representation for 3D-aware Image Generation

【速读】:该论文旨在解决基于笛卡尔坐标系的三平面(Tri-plane)表示在生成360°视角图像时存在的特征歧义问题,特别是由于对称区域共享相同特征而导致的多面伪影(multi-face artifacts)。其解决方案的关键在于提出了一种基于柱坐标系(Cylindrical Coordinate System)的新颖隐式表示方法——CylinderPlane,该方法通过显式分离不同角度下的特征,有效消除了特征歧义并保证了多视角一致性。进一步地,论文引入嵌套柱体表示(nested cylinder representation),通过组合不同尺度的多个圆柱体,增强模型对复杂几何结构和多分辨率输入的适应能力,从而提升细节学习能力和鲁棒性。

链接: https://arxiv.org/abs/2507.15606
作者: Ru Jia,Xiaozhuang Ma,Jianji Wang,Nanning Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, to be published

点击查看摘要

Abstract:While the proposal of the Tri-plane representation has advanced the development of the 3D-aware image generative models, problems rooted in its inherent structure, such as multi-face artifacts caused by sharing the same features in symmetric regions, limit its ability to generate 360 ^\circ view images. In this paper, we propose CylinderPlane, a novel implicit representation based on Cylindrical Coordinate System, to eliminate the feature ambiguity issue and ensure multi-view consistency in 360 ^\circ . Different from the inevitable feature entanglement in Cartesian coordinate-based Tri-plane representation, the cylindrical coordinate system explicitly separates features at different angles, allowing our cylindrical representation possible to achieve high-quality, artifacts-free 360 ^\circ image synthesis. We further introduce the nested cylinder representation that composites multiple cylinders at different scales, thereby enabling the model more adaptable to complex geometry and varying resolutions. The combination of cylinders with different resolutions can effectively capture more critical locations and multi-scale features, greatly facilitates fine detail learning and robustness to different resolutions. Moreover, our representation is agnostic to implicit rendering methods and can be easily integrated into any neural rendering pipeline. Extensive experiments on both synthetic dataset and unstructured in-the-wild images demonstrate that our proposed representation achieves superior performance over previous methods.
zh

[CV-27] SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

【速读】:该论文旨在解决稀疏视图下表面重建与新视角渲染的难题,现有基于符号距离函数(Signed Distance Function, SDF)的方法在细节刻画上存在局限,而基于三维高斯泼溅(3D Gaussian Splatting, 3DGS)的方法则缺乏全局几何一致性。其解决方案的关键在于提出一种混合方法:利用SDF捕获粗略几何结构以增强3DGS的渲染效果,同时通过3DGS生成的新视角图像反向优化SDF的细节,从而实现更精确的表面重建与高质量的新视角合成。

链接: https://arxiv.org/abs/2507.15602
作者: Zihui Gao,Jia-Wang Bian,Guosheng Lin,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); ByteDance Seed (字节跳动种子团队); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines the strengths of both approaches: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine the details of SDF for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets. Code will be released at this https URL.
zh

[CV-28] Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

【速读】:该论文旨在解决现有视觉-语言-动作模型(Vision-Language-Action model, VLA)在复杂操作任务中因依赖合成数据或有限规模的遥操作示范而导致泛化能力差、灵巧性不足的问题。其关键解决方案在于提出一种物理指令微调(physical instruction tuning)训练范式,该范式融合大规模人类视频预训练、物理空间对齐以实现三维推理,并通过后训练适配实现机器人任务迁移;同时引入基于部件级别的运动标记化方法,在手部轨迹建模中达到毫米级重建精度,从而显著提升动作学习的精细度与实用性。

链接: https://arxiv.org/abs/2507.15597
作者: Hao Luo,Yicheng Feng,Wanpeng Zhang,Sipeng Zheng,Ye Wang,Haoqi Yuan,Jiazheng Liu,Chaoyi Xu,Qin Jin,Zongqing Lu
机构: Peking University (北京大学); Renmin University of China (中国人民大学); BeingBeyond
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 37 pages

点击查看摘要

Abstract:We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources – including motion capture, VR, and RGB-only videos – into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at this https URL.
zh

[CV-29] SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging

【速读】:该论文旨在解决医学图像分割,特别是皮肤病变(skin lesion)分割任务中模型性能与计算效率之间的矛盾问题。现有方法往往在高精度与低延迟之间难以平衡,限制了其在资源受限的临床环境中的部署。解决方案的关键在于提出SegDT模型,其基于扩散变换器(diffusion transformer, DiT)架构,并融合了修正流(Rectified Flow)技术,能够在显著减少推理步数的情况下提升生成质量,同时保持标准扩散模型的灵活性。实验表明,该方法在三个基准数据集上达到当前最优性能,且推理速度更快,适用于真实医疗场景。

链接: https://arxiv.org/abs/2507.15595
作者: Salah Eddine Bekhouche,Gaby Maroun,Fadi Dornaika,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \hrefthis https URLGitHub.
zh

[CV-30] Compress-Align-Detect: onboard change detection from unregistered images

【速读】:该论文旨在解决卫星遥感影像中变化检测(change detection)因地面站数据传输与处理延迟而导致无法实现近实时应用的问题。其核心挑战在于如何在星载计算资源受限条件下,将完整的变更检测流程部署至卫星平台上,同时应对数据存储、图像配准和变化检测三者间的协同优化难题。解决方案的关键在于提出一个端到端的深度神经网络框架,包含三个相互关联的子模块:面向星载存储优化的图像压缩模块、轻量级非正射校正多时相图像配准模块,以及具有时不变特性和计算高效性的变化检测模型。该方案首次在星载环境下实现了上述任务的统一建模与联合优化,在低功耗硬件上实现了0.7 Mpixel/s的吞吐率,并在F1分数指标下验证了不同压缩率下的有效性能。

链接: https://arxiv.org/abs/2507.15578
作者: Gabriele Inzerillo,Diego Valsesia,Aniello Fiengo,Enrico Magli
机构: Politecnico di Torino – Department of Electronics and Telecommunications (都灵理工大学–电子与电信系); European Space Agency (欧洲空间局)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Change detection from satellite images typically incurs a delay ranging from several hours up to days because of latency in downlinking the acquired images and generating orthorectified image products at the ground stations; this may preclude real- or near real-time applications. To overcome this limitation, we propose shifting the entire change detection workflow onboard satellites. This requires to simultaneously solve challenges in data storage, image registration and change detection with a strict complexity constraint. In this paper, we present a novel and efficient framework for onboard change detection that addresses the aforementioned challenges in an end-to-end fashion with a deep neural network composed of three interlinked submodules: (1) image compression, tailored to minimize onboard data storage resources; (2) lightweight co-registration of non-orthorectified multi-temporal image pairs; and (3) a novel temporally-invariant and computationally efficient change detection model. This is the first approach in the literature combining all these tasks in a single end-to-end framework with the constraints dictated by onboard processing. Experimental results compare each submodule with the current state-of-the-art, and evaluate the performance of the overall integrated system in realistic setting on low-power hardware. Compelling change detection results are obtained in terms of F1 score as a function of compression rate, sustaining a throughput of 0.7 Mpixel/s on a 15W accelerator.
zh

[CV-31] GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation

【速读】:该论文旨在解决传统Mixup数据增强方法在图像分类任务中因像素级线性插值导致生成图像不真实、影响模型学习的问题,尤其是在高风险的医学影像分析场景下更为显著。其核心解决方案是提出GeMix(Generative Mixup),一个两阶段框架:首先利用类条件生成对抗网络(class-conditional GANs)如StyleGAN2-ADA训练生成器以建模数据分布;其次,在增强阶段通过Dirichlet先验采样两个标签向量并结合Beta分布系数进行软标签混合,再将该软标签作为条件输入生成器,从而合成语义一致且沿类别流形连续的图像。此方法实现了从“启发式混合”到“学习型、标签感知插值”的转变,显著提升了图像的真实性与类别区分能力,同时保持了对现有训练流程的兼容性。

链接: https://arxiv.org/abs/2507.15577
作者: Hugo Carlesso,Maria Eliza Patulea,Moncef Garouani,Radu Tudor Ionescu,Josiane Mothe
机构: IRIT, UMR5505 CNRS (IRIT, UMR5505 CNRS); Université de Toulouse (图卢兹大学); University of Bucharest (布加勒斯特大学); Université Toulouse Capitole (图卢兹-卡皮托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at this https URL to foster reproducibility and further research.
zh

[CV-32] DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding ICCV2025

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在视频理解任务中对时序信息整合不足的问题。传统方法将空间与时间信息分别处理,导致快速运动物体的时空特征难以准确捕捉,进而影响模型对关键区域的感知和时空交互能力。其解决方案的关键在于提出一种名为动态图像(Dynamic-Image, DynImg)的新视频表示方法:通过引入一组非关键帧作为时序提示(temporal prompts),引导模型在视觉特征提取阶段重点关注包含快速运动物体的空间区域;同时设计4D视频旋转位置编码(4D video Rotary Position Embedding)以保持DynImg中时空邻接性,从而确保时序顺序正确,提升MLLM对视频中时空结构的理解能力。实验表明,该方法在多个视频理解基准上较现有最优方法提升约2%。

链接: https://arxiv.org/abs/2507.15569
作者: Xiaoyi Bao,Chenwei Xie,Hao Tang,Tingyu Weng,Xiaofeng Wang,Yun Zheng,Xingang Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Alibaba Group (阿里巴巴集团); Peking University (北京大学); Luoyang Institute for Robot and Intelligent Equipment (洛阳机器人与智能装备研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.
zh

[CV-33] HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation ICCV2025

【速读】:该论文旨在解决零样本人-物交互(Human-Object Interaction, HOI)检测中对未见动作类别的泛化能力不足以及相同物体参与的不同动作难以区分的问题。其解决方案的关键在于提出一种基于低秩分解的视觉语言模型(Vision-Language Model, VLM)特征自适应方法——HOLa:首先通过低秩因子分解将VLM文本特征解耦为共享的基础特征与可适配的权重,构建紧凑且保留类别间共性信息的HOI表示,从而提升对未见类别的泛化性能;其次,通过优化每类HOI的权重并引入人-物标记(human-object tokens)增强视觉交互表征,进一步细化动作区分能力;最后,借助大语言模型(Large Language Model, LLM)生成的动作正则化引导权重自适应过程,强化对未见动作的判别能力。

链接: https://arxiv.org/abs/2507.15542
作者: Qinqian Lei,Bo Wang,Robby T. Tan
机构: National University of Singapore (新加坡国立大学); University of Mississippi (密西西比大学); ASUS Intelligent Cloud Services (AICS) (华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at this https URL.
zh

[CV-34] owards Holistic Surgical Scene Graph MICCAI2025

【速读】:该论文旨在解决当前基于图结构的外科场景理解方法中对工具-动作-目标(tool-action-target)组合以及操作手身份等关键要素建模不足的问题。这些问题在手术场景中具有重要临床意义,但现有研究尚未充分纳入图表示模型。解决方案的关键在于提出一个新的数据集Endoscapes-SG201,其中包含对工具-动作-目标三元组及操作手身份的精细标注,并设计了一种名为SSG-Com的图神经网络方法,用于显式学习和表示这些核心场景元素。实验表明,引入这些结构化信息显著提升了下游任务(如安全视角评估和动作三元组识别)的性能,验证了其在提升外科场景理解能力中的必要性。

链接: https://arxiv.org/abs/2507.15541
作者: Jongmin Shin,Enki Cho,Ka Yong Kim,Jung Yong Kim,Seong Tae Kim,Namkee Oh
机构: Korea University (韩国科学技术院); Samsung (三星)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes-such as diverse combinations of tool-action-target and the identity of the hand operating the tool-remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool-action-target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at this https URL
zh

[CV-35] Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

【速读】:该论文旨在解决自监督程序学习(self-supervised procedure learning)问题,即从无标签的程序类视频中自动识别关键步骤并确定其顺序。传统方法通常先学习视频间的帧级对应关系,再提取关键步骤及其顺序,但易受动作顺序变化、背景冗余帧和重复动作的影响,导致性能下降。论文提出了一种基于融合Gromov-Wasserstein最优传输(fused Gromov-Wasserstein optimal transport)与结构先验的自监督框架,用于更鲁棒地计算跨视频的帧级映射;为避免优化仅依赖时间对齐项时出现退化解(如所有帧坍缩至嵌入空间中的单一簇),进一步引入对比正则化项(contrastive regularization term),强制不同帧在嵌入空间中分布分散,从而有效防止平凡解,提升关键步骤识别的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.15540
作者: Syed Ahmed Mahmood,Ali Shah Ali,Umer Ahmed,Fawad Javed Fateh,M. Zeeshan Zia,Quoc-Huy Tran
机构: Retrocausal, Inc.(Retrocausal, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.
zh

[CV-36] SAIGFormer: A Spatially-Adaptive Illumination-Guided Network for Low-Light Image Enhancement

【速读】:该论文旨在解决基于Transformer的低光照增强方法在非均匀光照场景(如逆光和阴影)下表现不佳的问题,具体表现为过曝或亮度恢复不足。其解决方案的关键在于提出一种空间自适应的光照引导Transformer框架(SAIGFormer),核心创新包括:1)设计动态积分图像表示以建模空间变化的光照分布,并构建空间自适应积分光照估计器(SAI²E);2)引入光照引导的多头自注意力机制(IG-MSA),利用光照信息校准与亮度相关的特征,从而实现视觉愉悦的光照增强效果。

链接: https://arxiv.org/abs/2507.15520
作者: Hanting Li,Fei Zhou,Xin Sun,Yang Hua,Jungong Han,Liang-Jie Zhang
机构: Faculty of Data Science, City University of Macau (澳门城市大学数据科学学院); College of Oceanography and Space Informatics, China University of Petroleum (East China) (中国石油大学(华东)海洋与空间信息学院); School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast (贝尔法斯特女王大学电子、电气工程与计算机科学学院); Department of Automation, Tsinghua University (清华大学自动化系); College of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机科学与软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Recent Transformer-based low-light enhancement methods have made promising progress in recovering global illumination. However, they still struggle with non-uniform lighting scenarios, such as backlit and shadow, appearing as over-exposure or inadequate brightness restoration. To address this challenge, we present a Spatially-Adaptive Illumination-Guided Transformer (SAIGFormer) framework that enables accurate illumination restoration. Specifically, we propose a dynamic integral image representation to model the spatially-varying illumination, and further construct a novel Spatially-Adaptive Integral Illumination Estimator ( \textSAI^2\textE ). Moreover, we introduce an Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism, which leverages the illumination to calibrate the lightness-relevant features toward visual-pleased illumination enhancement. Extensive experiments on five standard low-light datasets and a cross-domain benchmark (LOL-Blur) demonstrate that our SAIGFormer significantly outperforms state-of-the-art methods in both quantitative and qualitative metrics. In particular, our method achieves superior performance in non-uniform illumination enhancement while exhibiting strong generalization capabilities across multiple datasets. Code is available at this https URL.
zh

[CV-37] Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reason er

【速读】:该论文旨在解决现有基于强化学习微调(Reinforcement Learning Fine-Tuning, RLFT)的方法在图表类多模态数据上的复杂推理能力不足的问题。当前主流的R1-Style方法主要聚焦于数学推理和代码智能,缺乏对图表这种富含信息且需复杂逻辑推理的多模态数据的有效支持。解决方案的关键在于提出Chart-R1模型,其核心创新包括:一是开发了一种程序化数据合成技术,生成高质量、分步式的图表推理数据,覆盖单子图与多子图场景,弥补了该领域标注数据稀缺的问题;二是设计两阶段训练策略——Chart-COT(带链式思维监督的逐步推理)与Chart-RFT(数值敏感的强化微调),其中Chart-RFT采用群体相对策略优化(Group Relative Policy Optimization)并引入软奖励机制以增强对数值响应的敏感性,从而显著提升模型在图表理解与推理任务中的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.15509
作者: Lei Chen,Xuanle Zhao,Zhixiong Zeng,Jing Huang,Yufeng Zhong,Lin Ma
机构: Meituan(美团); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emphi.e., ChartRQA). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emphe.g., GPT-4o, Claude-3.5).
zh

[CV-38] Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization ICCV2025

【速读】:该论文旨在解决文本到视频检索(Text-to-video retrieval, TVR)中因文本查询模糊、文本与视频映射不明确以及视频帧质量低等因素导致的固有不确定性问题。现有交互式方法通常依赖启发式策略生成澄清问题,缺乏对不确定性的显式量化,从而限制了其效果。解决方案的关键在于提出UMIVR框架,通过三个无需训练的度量指标显式量化三类关键不确定性:基于语义熵的文本模糊度评分(Text Ambiguity Score, TAS)、基于Jensen-Shannon散度的映射不确定性评分(Mapping Uncertainty Score, MUS)以及基于时序质量的帧采样器(Temporal Quality-based Frame Sampler, TQFS),并据此自适应生成针对性澄清问题,迭代优化用户查询,显著降低检索歧义,实验表明在MSR-VTT-1k数据集上经过10轮交互后Recall@1达到69.2%,验证了其有效性。

链接: https://arxiv.org/abs/2507.15504
作者: Bingqing Zhang,Zhuo Cao,Heming Du,Yang Li,Xue Li,Jiajun Liu,Sen Wang
机构: The University of Queensland (昆士兰大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR’s effectiveness, achieving notable gains in Recall@1 (69.2% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
zh

[CV-39] Dense-depth map guided deep Lidar-Visual Odometry with Sparse Point Clouds and Images

【速读】:该论文旨在解决自主系统中位姿估计(pose estimation)的准确性与鲁棒性问题,特别是在复杂动态环境和存在遮挡区域时的性能瓶颈。解决方案的关键在于提出了一种融合激光雷达(LiDAR)点云与图像信息的新型LiDAR-Visual里程计(LiDAR-Visual Odometry)框架:首先通过深度补全(depth completion)生成稠密深度图,结合多尺度特征提取网络与注意力机制实现自适应的深度感知表征;其次利用稠密深度信息优化光流估计,降低遮挡区域误差;最后引入分层位姿精修模块,逐步优化运动估计,有效应对动态场景干扰与尺度模糊问题。

链接: https://arxiv.org/abs/2507.15496
作者: JunYing Huang,Ao Xu,DongSun Yong,KeRen Li,YuanFeng Wang,Qi Qin
机构: Shenzhen University (深圳大学); Tsinghua University in Shenzhen (清华大学深圳研究院); YunJiZhiHui Engineering Co., Ltd. (深圳市云际智汇工程有限公司); Quantum Science Center of the Guangdong-Hong Kong-Macao Greater Bay Area (粤港澳大湾区量子科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Odometry is a critical task for autonomous systems for self-localization and navigation. We propose a novel LiDAR-Visual odometry framework that integrates LiDAR point clouds and images for accurate and robust pose estimation. Our method utilizes a dense-depth map estimated from point clouds and images through depth completion, and incorporates a multi-scale feature extraction network with attention mechanisms, enabling adaptive depth-aware representations. Furthermore, we leverage dense depth information to refine flow estimation and mitigate errors in occlusion-prone regions. Our hierarchical pose refinement module optimizes motion estimation progressively, ensuring robust predictions against dynamic environments and scale ambiguities. Comprehensive experiments on the KITTI odometry benchmark demonstrate that our approach achieves similar or superior accuracy and robustness compared to state-of-the-art visual and LiDAR odometry methods.
zh

[CV-40] GR-3 Technical Report

【速读】:该论文旨在解决构建通用机器人策略(generalist robot policies)的挑战,即开发一种能够适应多种新物体、环境和抽象指令,并具备高效迁移能力的机器人控制系统。解决方案的关键在于提出GR-3——一个大规模视觉-语言-动作(Vision-Language-Action, VLA)模型,其核心优势包括:通过网络规模的视觉-语言数据进行联合训练以增强泛化能力;利用VR设备收集的少量人类轨迹数据实现高效微调;结合机器人轨迹数据进行有效的模仿学习,从而在长时程和高灵巧任务(如双臂操作与移动协同)中表现出鲁棒性能。此外,论文还配套设计了ByteMini机器人平台,进一步验证了GR-3在真实场景中的广泛适用性与优越性。

链接: https://arxiv.org/abs/2507.15493
作者: Chilam Cheang,Sijin Chen,Zhongren Cui,Yingdong Hu,Liqun Huang,Tao Kong,Hang Li,Yifeng Li,Yuxiao Liu,Xiao Ma,Hao Niu,Wenxuan Ou,Wanli Peng,Zeyu Ren,Haixin Shi,Jiawen Tian,Hongtao Wu,Xin Xiao,Yuyang Xiao,Jiafeng Xu,Yichu Yang
机构: ByteDance(字节跳动)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Authors are listed in alphabetical order. Project page: this https URL

点击查看摘要

Abstract:We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, \pi_0 , on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.
zh

[CV-41] An aerial color image anomaly dataset for search missions in complex forested terrain

【速读】:该论文旨在解决在复杂森林环境中难以检测微小异常(如犯罪线索)的问题,尤其是在植被遮蔽导致传统自动化分析失效的情况下。其解决方案的关键在于通过大规模人群协作(crowd-search)获取一个标注详尽、真实场景下难检测的异常数据集,该数据集可作为基准用于评估和改进异常检测算法,并辅以交互式网络界面支持在线标注与动态扩展,从而推动面向实际应用的上下文感知型检测方法的发展。

链接: https://arxiv.org/abs/2507.15492
作者: Rakesh John Amala Arokia Nathan,Matthias Gessner,Nurullah Özkan,Marius Bock,Mohamed Youssef,Maximilian Mews,Björn Piltz,Ralf Berger,Oliver Bimber
机构: Johannes Kepler University (约翰尼斯·开普勒大学); German Aerospace Center (德国航空航天中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:After a family murder in rural Germany, authorities failed to locate the suspect in a vast forest despite a massive search. To aid the search, a research aircraft captured high-resolution aerial imagery. Due to dense vegetation obscuring small clues, automated analysis was ineffective, prompting a crowd-search initiative. This effort produced a unique dataset of labeled, hard-to-detect anomalies under occluded, real-world conditions. It can serve as a benchmark for improving anomaly detection approaches in complex forest environments, supporting manhunts and rescue operations. Initial benchmark tests showed existing methods performed poorly, highlighting the need for context-aware approaches. The dataset is openly accessible for offline processing. An additional interactive web interface supports online viewing and dynamic growth by allowing users to annotate and submit new findings.
zh

[CV-42] Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval

【速读】:该论文旨在解决边缘端设备上文本-视频检索任务中准确率与计算效率难以平衡的问题。现有方法要么采用均匀帧采样以保证内容覆盖但计算开销大,要么依赖显著帧采样降低负担却因查询无关的帧选择导致检索偏差。其解决方案的关键在于提出一种用户中心框架ProCLIP,核心创新包括:(1)设计提示感知的帧采样策略,通过文本提示动态引导轻量级特征提取器选择语义相关帧,克服传统显著帧采样方法静态、查询无关的局限性;(2)采用两阶段候选过滤策略,结合轻量模块进行快速粗筛与基于CLIP的细粒度重排序,实现高效且高精度的检索。实验表明,ProCLIP在保持竞争性准确率(MSR-VTT数据集上R@1=49.0)的同时,相较基线实现75.3%的延迟降低。

链接: https://arxiv.org/abs/2507.15491
作者: Deyu Zhang,Tingting Long,Jinrui Zhang,Ligeng Chen,Ju Ren,Yaoxue Zhang
机构: Central South University (中南大学); Tsinghua University (清华大学); Honor Device Co., Ltd (荣耀终端有限公司)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at this https URL.
zh

[CV-43] One Last Attention for Your Vision-Language Model ICCV2025

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在下游任务微调过程中,现有方法通常仅关注单模态表示(文本或视觉)的优化,而忽视了决策过程中关键的融合表示(即理性矩阵,rational matrix)的作用,从而限制了模型性能提升的问题。其解决方案的关键在于提出一种简单但有效的理性适配(Rational Adaptation, RAda)策略:通过在VLM末端引入一个轻量级注意力层学习得到一个可训练掩码(learned mask),动态校准理性矩阵中每个元素的贡献,从而实现对最终跨模态交互的精准调整,且无需修改中间特征,显著提升了微调效率与效果。

链接: https://arxiv.org/abs/2507.15480
作者: Liang Chen,Ghazi Shazan Ahmad,Tianjun Yao,Lingqiao Liu,Zhiqiang Shen
机构: Tongji University (同济大学); MBZUAI; The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph\ie rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbfRational \textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \hrefthis https URLthis http URL.
zh

[CV-44] ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting ICCV2025

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在场景重建中缺乏语义理解的问题,从而限制了其在对象级感知任务中的应用。其核心解决方案是提出ObjectGS框架,通过将场景建模为以个体物体为局部锚点的结构,每个锚点生成神经高斯分布并共享对象ID,实现对象级别的精确重建与语义约束。关键创新在于动态生长或剪枝锚点、优化特征表示,并引入one-hot ID编码结合分类损失,强制实现清晰的语义分割边界,从而在开放词汇和全景分割任务上显著优于现有方法,并支持网格提取与场景编辑等下游应用。

链接: https://arxiv.org/abs/2507.15454
作者: Ruijie Zhu,Mulin Yu,Linning Xu,Lihan Jiang,Yixuan Li,Tianzhu Zhang,Jiangmiao Pang,Bo Dai
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: this https URL
zh

[CV-45] Low-Latency Event-Based Velocimetry for Quadrotor Control in a Narrow Pipe

【速读】:该论文旨在解决无人机在狭窄管道等受限空间中悬停时因自诱导湍流扰动导致的飞行稳定性差的问题。传统方法要么依赖持续运动以缓解气流回流效应,要么在悬停状态下稳定性不足。其解决方案的关键在于构建了一个闭环控制系统,该系统利用实时流场测量信息进行扰动补偿:首先开发了一种低延迟、基于事件触发的烟雾测速技术,以高时间分辨率估计局部气流;随后采用基于循环卷积神经网络的扰动估计器实时推断作用力与力矩扰动,并将这些扰动信息融入通过强化学习训练的控制器中。该方法显著提升了无人机在管道截面横向移动过程中的抗扰能力,有效避免了与管壁碰撞,是首个实现由实时流场感知驱动的空中机器人闭环控制的研究,为复杂气动环境下的飞行控制开辟了新方向。

链接: https://arxiv.org/abs/2507.15444
作者: Leonard Bauersfeld,Davide Scaramuzza
机构: University of Zurich (苏黎世大学); European Union’s Horizon Europe Research and Innovation Programme (欧盟地平线欧洲研究与创新计划); European Research Council (欧洲研究委员会)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. The flow-feedback control proves particularly effective during lateral translation maneuvers in the pipe cross-section. There, the real-time disturbance information enables the controller to effectively counteract transient aerodynamic effects, thereby preventing collisions with the pipe wall. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements. This opens new directions for research on flight in aerodynamically complex environments. In addition, our work also sheds light on the characteristic flow structures that emerge during flight in narrow, circular pipes, providing new insights at the intersection of robotics and fluid dynamics.
zh

[CV-46] EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

【速读】:该论文旨在解决第一人称视角视频(egomotion video)推理效率低下的问题,尤其是在资源受限的边缘设备上部署时,传统视觉-语言模型因计算成本高、冗余信息多而难以满足实时性需求。其核心解决方案是提出一种无需训练的token剪枝方法 EgoPrune,关键在于:1)基于嵌入式机器人(EmbodiedR)改进的关键帧选择器,实现时间上的高效采样;2)引入视角感知冗余过滤(Perspective-Aware Redundancy Filtering, PARF),利用透视变换对齐视觉 token 并去除冗余;3)采用最大边际相关性(Maximal Marginal Relevance, MMR)策略,在保留视觉-文本语义相关性的同时兼顾帧内多样性。该方法显著降低浮点运算次数(FLOPs)、内存占用和延迟,且已在 Jetson Orin NX 边缘设备上验证其实用性与高效性。

链接: https://arxiv.org/abs/2507.15428
作者: Jiaao Li,Kaiyuan Li,Chen Gao,Yong Li,Xinlei Chen
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Egomotion videos are first-person recordings where the view changes continuously due to the agent’s movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.
zh

[CV-47] SurgX: Neuron-Concept Association for Explainable Surgical Phase Recognition MICCAI2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在手术阶段识别(surgical phase recognition)任务中模型决策过程缺乏可解释性的问题,这一问题限制了医生对模型的信任以及模型调试的效率。解决方案的关键在于提出 SurgX——一种基于概念的解释框架,通过将神经网络中的神经元与临床相关的概念进行关联,从而增强模型的可解释性;具体包括:选取代表性神经元示例序列、构建面向手术视频数据集的概念集合、建立神经元与概念的映射关系,并识别对预测至关重要的神经元,最终实现对模型预测逻辑的可视化和分析。

链接: https://arxiv.org/abs/2507.15418
作者: Ka Young Kim,Hyeon Bae Kim,Seong Tae Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:Surgical phase recognition plays a crucial role in surgical workflow analysis, enabling various applications such as surgical monitoring, skill assessment, and workflow optimization. Despite significant advancements in deep learning-based surgical phase recognition, these models remain inherently opaque, making it difficult to understand how they make decisions. This lack of interpretability hinders trust and makes it challenging to debug the model. To address this challenge, we propose SurgX, a novel concept-based explanation framework that enhances the interpretability of surgical phase recognition models by associating neurons with relevant concepts. In this paper, we introduce the process of selecting representative example sequences for neurons, constructing a concept set tailored to the surgical video dataset, associating neurons with concepts and identifying neurons crucial for predictions. Through extensive experiments on two surgical phase recognition models, we validate our method and analyze the explanation for prediction. This highlights the potential of our method in explaining surgical phase recognition. The code is available at this https URL
zh

[CV-48] Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond

【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)中因普遍存在的遮挡(occlusion)和数据集偏差导致的特征提取困难与分类不准问题。针对这一挑战,其解决方案的关键在于:1)引入辅助多模态语义引导机制,利用语义分割图作为密集语义先验以生成语义增强的面部表征,并结合稀疏的面部关键点作为几何先验以缓解身份和性别等内在噪声;2)设计多尺度交叉交互模块(Multi-scale Cross-interaction Module, MCM),实现不同尺度下关键点特征与语义增强表征的有效融合;3)提出动态对抗排斥增强损失(Dynamic Adversarial Repulsion Enhancement Loss, DARELoss),通过动态调整模糊类别间的间隔来提升模型对相似表情的区分能力。上述方法共同提升了模型在真实世界复杂遮挡场景下的鲁棒性与识别性能。

链接: https://arxiv.org/abs/2507.15401
作者: Huiyu Zhai,Xingxing Yang,Yalan Ye,Chenyang Li,Bin Fan,Changze Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model’s ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at this https URL.
zh

[CV-49] Blended Point Cloud Diffusion for Localized Text-guided Shape Editing ICCV2025

【速读】:该论文旨在解决在局部编辑点云表示的3D形状时难以保持全局结构一致性的问题。现有方法在实现细粒度局部修改时,常导致整体形状失真或语义不一致。解决方案的关键在于提出一种基于图像修复(inpainting)的框架,利用基础3D扩散模型进行局部编辑,并引入部分条件形状作为结构引导,确保未编辑区域维持原始形状身份;同时,设计了一种推理阶段的坐标融合算法(coordinate blending algorithm),在噪声逐步降低的过程中动态平衡完整形状重建与局部修复,从而无需计算昂贵且易出错的逆向过程即可实现高质量、细粒度的3D形状编辑。

链接: https://arxiv.org/abs/2507.15399
作者: Etai Sella,Noam Atia,Ron Mokady,Hadar Averbuch-Elor
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape’s identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.
zh

[CV-50] o Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models ICCV2025

【速读】:该论文旨在解决传统主动学习(Active Learning, AL)评估方法仅关注最终准确率、忽视学习过程动态性的局限性。为填补这一空白,作者提出PALM(Performance Analysis of Active Learning Models),其关键在于构建一个统一且可解释的数学模型,通过四个核心参数——可达准确率、覆盖效率、早期阶段性能和可扩展性——量化AL轨迹。该模型能基于部分标注数据预测完整的学习曲线,从而实现对不同AL策略的系统性比较与成本效益优化,显著提升AL在资源受限场景下的评估精度与实用性。

链接: https://arxiv.org/abs/2507.15381
作者: Julia Machnio,Mads Nielsen,Mostafa Mehdipour Ghazi
机构: Pioneer Centre for AI (先锋人工智能中心); University of Copenhagen (哥本哈根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications. The code is available at: this https URL.
zh

[CV-51] DAViD: Data-efficient and Accurate Vision Models from Synthetic Data ICCV2025

【速读】:该论文旨在解决当前以人为中心的计算机视觉模型训练中对大规模真实数据依赖性强、成本高昂且存在数据合规性风险的问题。其关键解决方案是采用高保真度的合成数据进行模型训练,通过程序化数据生成技术实现对数据多样性的精确控制,从而在不牺牲准确性的前提下显著降低训练与推理成本,并提升模型公平性和数据使用合规性。

链接: https://arxiv.org/abs/2507.15365
作者: Fatemeh Saleh,Sadegh Aliakbarian,Charlie Hewitt,Lohit Petikam,Xiao-Xian,Antonio Criminisi,Thomas J. Cashman,Tadas Baltrušaitis
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at this https URL.
zh

[CV-52] RoadFusion: Latent Diffusion Model for Pavement Defect Detection

【速读】:该论文旨在解决道路缺陷检测中面临的三大关键挑战:标注数据有限、训练与部署环境之间的域偏移(domain shift),以及不同道路条件下缺陷外观的高度变异性。解决方案的核心在于提出RoadFusion框架,其创新性地结合了基于文本提示和空间掩码的潜在扩散模型(latent diffusion model)以生成多样且逼真的合成缺陷,从而缓解数据稀缺问题;同时引入双路径特征适配器(dual-path feature adaptation),分别优化正常与异常输入的特征表示,提升对域偏移和缺陷变异性的鲁棒性;此外,轻量级判别器在局部patch级别学习细粒度缺陷模式,增强检测精度。该方法在六个基准数据集上实现了分类与定位任务的一致优异性能,多项指标达到当前最优水平。

链接: https://arxiv.org/abs/2507.15346
作者: Muhammad Aqeel,Kidus Dagnaw Bellete,Francesco Setti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIAP 2025

点击查看摘要

Abstract:Pavement defect detection faces critical challenges including limited annotated data, domain shift between training and deployment environments, and high variability in defect appearances across different road conditions. We propose RoadFusion, a framework that addresses these limitations through synthetic anomaly generation with dual-path feature adaptation. A latent diffusion model synthesizes diverse, realistic defects using text prompts and spatial masks, enabling effective training under data scarcity. Two separate feature adaptors specialize representations for normal and anomalous inputs, improving robustness to domain shift and defect variability. A lightweight discriminator learns to distinguish fine-grained defect patterns at the patch level. Evaluated on six benchmark datasets, RoadFusion achieves consistently strong performance across both classification and localization tasks, setting new state-of-the-art in multiple metrics relevant to real-world road inspection.
zh

[CV-53] ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

【速读】:该论文旨在解决工业缺陷检测系统在单类异常检测(One-Class Anomaly Detection)范式下所面临的两大核心问题:一是假设异常分布均匀导致的模型泛化能力不足,二是真实制造环境中标注数据稀缺带来的训练困难。解决方案的关键在于提出ExDD(Explicit Dual Distribution)框架,其创新性地显式建模正常与异常两类特征分布,通过并行记忆库分别捕获两者独特的统计特性,从而打破对异常分布均匀性的错误假设;同时引入具有领域特定文本条件的潜在扩散模型(Latent Diffusion Models),生成符合工业场景的合成缺陷样本以缓解数据稀缺问题,并结合邻域感知比率评分机制融合多种距离度量,有效增强在既偏离正常模式又贴近已知缺陷区域的信号响应强度。

链接: https://arxiv.org/abs/2507.15335
作者: Muhammad Aqeel,Federico Leonardi,Francesco Setti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICIAP 2025

点击查看摘要

Abstract:Industrial defect detection systems face critical limitations when confined to one-class anomaly detection paradigms, which assume uniform outlier distributions and struggle with data scarcity in realworld manufacturing environments. We present ExDD (Explicit Dual Distribution), a novel framework that transcends these limitations by explicitly modeling dual feature distributions. Our approach leverages parallel memory banks that capture the distinct statistical properties of both normality and anomalous patterns, addressing the fundamental flaw of uniform outlier assumptions. To overcome data scarcity, we employ latent diffusion models with domain-specific textual conditioning, generating in-distribution synthetic defects that preserve industrial context. Our neighborhood-aware ratio scoring mechanism elegantly fuses complementary distance metrics, amplifying signals in regions exhibiting both deviation from normality and similarity to known defect patterns. Experimental validation on KSDD2 demonstrates superior performance (94.2% I-AUROC, 97.7% P-AUROC), with optimal augmentation at 100 synthetic samples.
zh

[CV-54] BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

【速读】:该论文旨在解决当前深度估计(Depth Estimation)领域中基础模型(Depth Foundation Models, DFMs)评估协议不一致的问题,传统基准测试依赖于对齐(alignment-based)指标,易引入偏差、偏好特定深度表示形式,并阻碍公平比较。其解决方案的关键在于提出BenchDepth这一新基准,通过五个精心选取的下游代理任务(proxy tasks)——深度补全(depth completion)、立体匹配(stereo matching)、单目前馈式三维场景重建(monocular feed-forward 3D scene reconstruction)、SLAM(Simultaneous Localization and Mapping)以及视觉-语言空间理解(vision-language spatial understanding)——来评估DFMs的实际应用价值,从而绕过有争议的对齐步骤,实现更客观、实用的模型性能衡量。

链接: https://arxiv.org/abs/2507.15321
作者: Zhenyu Li,Haotong Lin,Jiashi Feng,Peter Wonka,Bingyi Kang
机构: KAUST; ByteDance Seed; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.
zh

[CV-55] Few-Shot Object Detection via Spatial-Channel State Space Model

【速读】:该论文针对少样本目标检测(Few-Shot Object Detection, FSOD)中因训练样本有限导致特征提取不准确的问题,特别是现有方法在通道层面难以有效区分重要与非重要特征通道的现象——即高权重通道未必有效、低权重通道仍具价值。解决方案的关键在于引入通道间相关性建模机制,通过空间-通道状态空间建模(Spatial-Channel State Space Modeling, SCSM)模块实现对特征通道的精细化调控:其中,空间特征建模(SFM)模块平衡空间与通道关系的学习,而基于Mamba架构的通道状态建模(CSM)模块则利用其对一维序列的高效建模能力,捕捉通道间的动态依赖关系,从而精准突出有效通道并修正无效通道,显著提升特征表示质量,在VOC和COCO数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2507.15308
作者: Zhimeng Xin,Tianxu Wu,Yixiong Zou,Shiming Chen,Dingjie Fu,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the limited training samples in few-shot object detection (FSOD), we observe that current methods may struggle to accurately extract effective features from each channel. Specifically, this issue manifests in two aspects: i) channels with high weights may not necessarily be effective, and ii) channels with low weights may still hold significant value. To handle this problem, we consider utilizing the inter-channel correlation to facilitate the novel model’s adaptation process to novel conditions, ensuring the model can correctly highlight effective channels and rectify those incorrect ones. Since the channel sequence is also 1-dimensional, its similarity with the temporal sequence inspires us to take Mamba for modeling the correlation in the channel sequence. Based on this concept, we propose a Spatial-Channel State Space Modeling (SCSM) module for spatial-channel state modeling, which highlights the effective patterns and rectifies those ineffective ones in feature channels. In SCSM, we design the Spatial Feature Modeling (SFM) module to balance the learning of spatial relationships and channel relationships, and then introduce the Channel State Modeling (CSM) module based on Mamba to learn correlation in channels. Extensive experiments on the VOC and COCO datasets show that the SCSM module enables the novel detector to improve the quality of focused feature representation in channels and achieve state-of-the-art performance.
zh

[CV-56] Minutiae-Anchored Local Dense Representation for Fingerprint Matching

【速读】:该论文旨在解决在多种采集条件下(如滚印、平面印、部分指纹、非接触式及潜在指纹)进行指纹匹配时面临的鲁棒性和准确性难题。其解决方案的关键在于提出一种基于细节特征(minutiae)锚定的局部密集表示方法(DMD),该方法通过在每个检测到的细节点中心及其方向上提取局部图像块,构建一个三维张量表示:其中两个维度对应指纹平面的空间位置,第三个维度编码语义特征。这种结构化表示能够同时捕捉细粒度的纹线纹理和判别性细节特征,并通过多级细粒度描述聚合多个细节点及其周围纹线结构的信息。此外,由于DMD与图像块具有强空间对应关系,可利用前景分割掩码识别有效描述区域,从而在匹配过程中仅在重叠前景区域内进行比较,显著提升了效率与鲁棒性。

链接: https://arxiv.org/abs/2507.15297
作者: Zhiyu Pan,Xiongjun Guan,Yongjie Duan,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Fingerprint matching under diverse capture conditions remains a fundamental challenge in biometric recognition. To achieve robust and accurate performance in such scenarios, we propose DMD, a minutiae-anchored local dense representation which captures both fine-grained ridge textures and discriminative minutiae features in a spatially structured manner. Specifically, descriptors are extracted from local patches centered and oriented on each detected minutia, forming a three-dimensional tensor, where two dimensions represent spatial locations on the fingerprint plane and the third encodes semantic features. This representation explicitly captures abstract features of local image patches, enabling a multi-level, fine-grained description that aggregates information from multiple minutiae and their surrounding ridge structures. Furthermore, thanks to its strong spatial correspondence with the patch image, DMD allows for the use of foreground segmentation masks to identify valid descriptor regions. During matching, comparisons are then restricted to overlapping foreground areas, improving efficiency and robustness. Extensive experiments on rolled, plain, parital, contactless, and latent fingerprint datasets demonstrate the effectiveness and generalizability of the proposed method. It achieves state-of-the-art accuracy across multiple benchmarks while maintaining high computational efficiency, showing strong potential for large-scale fingerprint recognition. Corresponding code is available at this https URL.
zh

[CV-57] In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems

【速读】:该论文旨在解决生物特征识别系统中物理和数字攻击检测的泛化能力不足问题,尤其是传统深度学习模型在面对新型攻击或环境变化时性能下降、且依赖大量标注数据的问题。解决方案的关键在于提出一种基于视觉语言模型(Vision Language Models, VLM)的上下文学习(in-context learning)框架,利用开放源代码模型在无需资源密集型训练的情况下实现对物理呈现攻击和数字变形攻击的有效检测,从而提升系统在安全关键场景下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2507.15285
作者: Lazaro Janier Gonzalez-Soler,Maciej Salwowski,Christoph Busch
机构: da/sec - Biometrics and Security Research Group (da/sec - 生物特征与安全研究组); Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE-TIFS

点击查看摘要

Abstract:Recent advances in biometric systems have significantly improved the detection and prevention of fraudulent activities. However, as detection methods improve, attack techniques become increasingly sophisticated. Attacks on face recognition systems can be broadly divided into physical and digital approaches. Traditionally, deep learning models have been the primary defence against such attacks. While these models perform exceptionally well in scenarios for which they have been trained, they often struggle to adapt to different types of attacks or varying environmental conditions. These subsystems require substantial amounts of training data to achieve reliable performance, yet biometric data collection faces significant challenges, including privacy concerns and the logistical difficulties of capturing diverse attack scenarios under controlled conditions. This work investigates the application of Vision Language Models (VLM) and proposes an in-context learning framework for detecting physical presentation attacks and digital morphing attacks in biometric systems. Focusing on open-source models, the first systematic framework for the quantitative evaluation of VLMs in security-critical scenarios through in-context learning techniques is established. The experimental evaluation conducted on freely available databases demonstrates that the proposed subsystem achieves competitive performance for physical and digital attack detection, outperforming some of the traditional CNNs without resource-intensive training. The experimental results validate the proposed framework as a promising tool for improving generalisation in attack detection.
zh

[CV-58] Conditional Video Generation for High-Efficiency Video Compression

【速读】:该论文旨在解决传统视频压缩方法在高压缩比下难以保持人类视觉感知质量的问题。现有方法往往侧重于重建误差最小化,但忽略了人眼对视频内容的感知特性,导致压缩后视频在主观观感上存在明显失真。解决方案的关键在于将视频压缩重构问题重新建模为一个条件生成任务,利用条件扩散模型(conditional diffusion models)实现基于人类视觉感知优化的视频重建。具体而言,其核心创新包括:(1) 多粒度条件机制(Multi-granular conditioning),同时捕捉静态场景结构与动态时空特征;(2) 高效紧凑表示(Compact representations),在保证语义丰富性的前提下降低传输开销;(3) 多条件训练策略(Multi-condition training),通过模态丢弃和角色感知嵌入(role-aware embeddings)增强模型对不同输入模态的鲁棒性,避免单一模态依赖。实验表明,该方法在高压缩比下显著优于传统及神经编码器,在FVD和LPIPS等感知质量指标上表现优异。

链接: https://arxiv.org/abs/2507.15269
作者: Fangqiu Yi,Jingyu Xu,Jiawei Shao,Chi Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fréchet Video Distance (FVD) and LPIPS, especially under high compression ratios.
zh

[CV-59] MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP ICCV2025

【速读】:该论文旨在解决图像到点云(Image-to-point-cloud, I2P)配准中因预测对应点存在噪声和异常值而导致的对应关系学习效果不佳的问题。现有基于微分PnP(differential Perspective-n-Point)的方法虽然能施加2D-3D投影约束,但对噪声和异常值敏感,限制了对应学习的有效性。解决方案的关键在于提出一种近似盲PnP(approximated blind PnP)的对应学习方法——MinCD-PnP,其核心思想是将原本计算复杂度高的盲PnP简化为最小化学习到的2D与3D关键点之间的Chamfer距离(Chamfer Distance, CD),从而在保持鲁棒性的同时显著降低计算开销。为高效求解MinCD-PnP,作者进一步设计了一个轻量级多任务学习模块MinCD-Net,可无缝集成至现有I2P配准网络架构中,实验表明该方法在跨场景和跨数据集设置下均实现了更高的内点率(Inlier Ratio, IR)和配准召回率(Registration Recall, RR)。

链接: https://arxiv.org/abs/2507.15257
作者: Pei An,Jiaqi Yang,Muyao Peng,You Yang,Qiong Liu,Xiaolin Wu,Liangliang Nan
机构: Huazhong University of Science and Technology (华中科技大学); Northwestern Polytechnical University (西北工业大学); McMaster University (麦克马斯特大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D-3D correspondences between an image and a point cloud. The differential perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing the projective constraints on 2D-3D correspondences. However, differential PnP is highly sensitive to noise and outliers in the predicted correspondences. This issue hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP against noise and outliers in correspondences, we propose an approximated blind PnP based correspondence learning approach. To mitigate the high computational cost of blind PnP, we simplify blind PnP to an amenable task of minimizing Chamfer distance between learned 2D and 3D keypoints, called MinCD-PnP. To effectively solve MinCD-PnP, we design a lightweight multi-task learning module, named as MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves a higher inlier ratio (IR) and registration recall (RR) in both cross-scene and cross-dataset settings.
zh

[CV-60] FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers ICCV2025

【速读】:该论文旨在解决当前基于扩散变压器(Diffusion Transformers, DiT)的文本到图像(Text-to-Image, T2I)生成技术在主体驱动(subject-driven)图像合成中面临的两大挑战:一是现有方法通常依赖于针对每个主体进行训练(如可学习文本嵌入或专用编码器),限制了其实际应用的灵活性;二是未能充分挖掘DiT模型本身具备的零样本(zero-shot)潜力,导致无法实现跨场景下一致且高质量的主体保留合成。解决方案的关键在于提出一个真正无需训练的框架FreeCus,其核心创新包括:1)引入关键的注意力共享机制,在保持主体布局完整性的同时保留编辑灵活性;2)通过分析DiT中的动态偏移机制,设计改进版本以增强细粒度特征提取能力;3)融合先进的多模态大语言模型(Multimodal Large Language Models, MLLMs)提升跨模态语义表示能力。实验表明,该方法成功激活了DiT的零样本主体一致性生成能力,性能达到或超越需额外训练的方法,并兼容现有图像修复和控制模块,显著提升了生成体验。

链接: https://arxiv.org/abs/2507.15249
作者: Yanbing Zhang,Zhe Wang,Qin Zhou,Mengping Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT’s capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject’s layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT’s dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT’s zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: this https URL.
zh

[CV-61] Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation

【速读】:该论文旨在解决跨域少样本学习(Cross-Domain Few-Shot Learning, CD-FSL)中因标注样本稀缺导致模型在更新大量Transformer参数时易过拟合的问题。其解决方案的关键在于提出一种新的概念——凝聚投影(Coalescent Projection, CP),作为软提示(soft prompts)的有效替代,并结合仅依赖基础域(base domain)的伪类别生成方法与自监督变换(Self-Supervised Transformations, SSTs),以增强模型对未见域样本的适应能力。该方法在BSCD-FSL基准极端域偏移场景下展现出显著有效性。

链接: https://arxiv.org/abs/2507.15243
作者: Naeem Paeedeh,Mahardhika Pratama,Wolfgang Mayer,Jimmy Cao,Ryszard Kowlczyk
机构: STEM, University of South Australia, Australia(南澳大学); Systems Research Institute, Polish Academy of Sciences, Poland(波兰科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at this https URL.
zh

[CV-62] Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders

【速读】:该论文旨在解决高风险领域(如医学影像)中深度学习模型决策过程缺乏可解释性的问题,从而促进临床采纳。其解决方案的关键在于引入基于稀疏自编码器(Sparse Autoencoder, SAE)的可解释性分析方法,通过对预训练于大规模乳腺X线图像-报告对的视觉-语言基础模型Mammo-CLIP进行patch级SAE建模,识别并探究与临床相关乳腺概念(如肿块和可疑钙化)相关的潜在特征。研究发现,SAE潜空间中激活度最高的类别级神经元常与真实病灶区域对齐,并揭示了影响模型决策的若干混杂因素;同时进一步分析了下游微调过程中模型依赖的潜在神经元,验证了SAE潜表示在解析基础模型各层内部工作机制方面的潜力。

链接: https://arxiv.org/abs/2507.15227
作者: Krishna Kanth Nakka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Interpretability is critical in high-stakes domains such as medical imaging, where understanding model decisions is essential for clinical adoption. In this work, we introduce Sparse Autoencoder (SAE)-based interpretability to breast imaging by analyzing Mammo-CLIP, a vision–language foundation model pretrained on large-scale mammogram image–report pairs. We train a patch-level \textttMammo-SAE on Mammo-CLIP to identify and probe latent features associated with clinically relevant breast concepts such as \textitmass and \textitsuspicious calcification. Our findings reveal that top activated class level latent neurons in the SAE latent space often tend to align with ground truth regions, and also uncover several confounding factors influencing the model’s decision-making process. Additionally, we analyze which latent neurons the model relies on during downstream finetuning for improving the breast concept prediction. This study highlights the promise of interpretable SAE latent representations in providing deeper insight into the internal workings of foundation models at every layer for breast imaging.
zh

[CV-63] Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel

【速读】:该论文旨在解决复杂血管网络在三维建模中因分支结构、曲率及不规则形态导致的几何与拓扑表示不准确问题。其解决方案的关键在于提出一种分层部件(hierarchical part-based)生成框架,将血管的全局二叉树状拓扑结构与局部几何细节分离建模:首先生成关键图以刻画整体层级结构,再基于几何属性条件生成血管段,最后依据关键图进行分层组装,从而实现对复杂血管网络的高保真重建。

链接: https://arxiv.org/abs/2507.15223
作者: Siqi Chen,Guoqing Zhang,Jiahao Lai,Bingzhi Shen,Sihong Zhang,Caixia Dong,Xuejin Chen,Yang Li
机构: 11; 2 2; 33; 44; 55
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: this https URL.
zh

[CV-64] Improving Joint Embedding Predictive Architecture with Diffusion Noise

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在表征能力上相较于生成式模型(Generative Models)仍存在不足的问题,尤其是如何利用生成模型对数据分布的建模优势来增强SSL的语义理解能力。其解决方案的关键在于将扩散模型(Diffusion Model)中的扩散噪声机制引入到掩码图像建模(Masked Image Modeling, MIM)框架中,通过在掩码标记的位置嵌入(position embedding)中引入噪声信息,构建N-JEPA(Noise-based JEPA)模型;同时设计多级噪声调度策略作为特征增强手段,从而提升模型在下游分类任务中的鲁棒性和表示性能。

链接: https://arxiv.org/abs/2507.15216
作者: Yuping Qiu,Rui Zhu,Ying-cong Chen
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a close relationship between masked image modeling (MIM) and diffusion models. In this paper, we propose N-JEPA (Noise-based JEPA) to incorporate diffusion noise into MIM by the position embedding of masked tokens. The multi-level noise schedule is a series of feature augmentations to further enhance the robustness of our model. We perform a comprehensive study to confirm its effectiveness in the classification of downstream tasks. Codes will be released soon in public.
zh

[CV-65] MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction ICCV2025

【速读】:该论文旨在解决3D人体网格模型的高效生成与重建问题,特别是针对高分辨率、包含衣物和手部细节的复杂人体几何结构。传统方法在处理超过500个顶点的密集网格时面临计算效率低下或结构建模能力不足的问题。解决方案的关键在于提出MeshMamba,一种基于Mamba状态空间模型(Mamba State Space Models, Mamba-SSMs)的神经网络架构,其核心创新是通过基于身体部位标注或模板网格三维位置对顶点进行有序排列(serialization),从而将非结构化的3D网格转化为可被Mamba高效处理的序列输入,保留了人体关节结构的局部与全局拓扑关系。这一技术使得模型能够有效扩展至超过10,000个顶点,并在此基础上构建了MambaDiff3D(用于生成带衣物和手势的稠密3D人体网格)和Mamba-HMR(用于从单张图像中恢复全身含面部与手部的精细人体网格),实现了性能优越且接近实时的全身体网格重建与生成。

链接: https://arxiv.org/abs/2507.15212
作者: Yusuke Yoshiyasu,Leyuan Sun,Ryusuke Sagawa
机构: National Institute of Advanced Industrial Science and Technology (AIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV2025

点击查看摘要

Abstract:In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, Mamba-HMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.
zh

[CV-66] Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection

【速读】:该论文旨在解决事件相机(event-based camera)产生的稀疏异步数据在转换为标准神经网络所需的密集张量时,会丧失其高时间分辨率和低延迟优势的问题。现有基于图结构的方法虽能保留稀疏性并支持异步推理,但因对时空动态建模不足,导致下游任务性能受限。解决方案的关键在于提出一种新颖的时空多图表示方法:构建两个解耦的图结构——利用B样条基函数建模全局空间结构的空间图,以及通过运动矢量注意力机制捕捉局部动态变化的时序图;该设计使模型能够使用高效的二维卷积核替代计算昂贵的三维卷积核,在不增加计算成本的前提下显著提升事件驱动目标检测的准确率(在Gen1汽车和eTraM数据集上提升超过6%),同时实现5倍加速与参数量减少。

链接: https://arxiv.org/abs/2507.15150
作者: Aayush Atul Verma,Arpitsinh Vaghela,Bharatesh Chakravarthi,Kaustav Chanda,Yezhou Yang
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based sensors offer high temporal resolution and low latency by generating sparse, asynchronous data. However, converting this irregular data into dense tensors for use in standard neural networks diminishes these inherent advantages, motivating research into graph representations. While such methods preserve sparsity and support asynchronous inference, their performance on downstream tasks remains limited due to suboptimal modeling of spatiotemporal dynamics. In this work, we propose a novel spatiotemporal multigraph representation to better capture spatial structure and temporal changes. Our approach constructs two decoupled graphs: a spatial graph leveraging B-spline basis functions to model global structure, and a temporal graph utilizing motion vector-based attention for local dynamic changes. This design enables the use of efficient 2D kernels in place of computationally expensive 3D kernels. We evaluate our method on the Gen1 automotive and eTraM datasets for event-based object detection, achieving over a 6% improvement in detection accuracy compared to previous graph-based works, with a 5x speedup, reduced parameter count, and no increase in computational cost. These results highlight the effectiveness of structured graph modeling for asynchronous vision. Project page: this http URL.
zh

[CV-67] Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications

【速读】:该论文旨在解决远程、资源匮乏环境中医疗系统因互操作性差、缺乏离线支持及对昂贵基础设施依赖而难以有效部署的问题。其核心解决方案是设计并实现了一个面向边缘计算的电子健康记录(Electronic Health Record, EHR)平台,具备离线优先运行能力、AES-256加密本地存储与可选云同步功能,并支持模块化诊断集成。通过在小型嵌入式设备上运行,该平台实现了低成本、高隐私合规性(符合HIPAA/GDPR)的健康信息管理,从而显著提升了数字健康技术在断网或基础设施薄弱地区的适用性和可扩展性。

链接: https://arxiv.org/abs/2507.15146
作者: Sebastian A. Cruz Romero,Misael J. Mercado Hernandez,Samir Y. Ali Rivera,Jorge A. Santiago Fernandez,Wilfredo E. Lugo Beauchamp
机构: University of Puerto Rico at Mayagüez (波多黎各大学马亚圭兹分校)
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted at IEEE Global Humanitarian Technology Conference 2025

点击查看摘要

Abstract:The design of medical systems for remote, resource-limited environments faces persistent challenges due to poor interoperability, lack of offline support, and dependency on costly infrastructure. Many existing digital health solutions neglect these constraints, limiting their effectiveness for frontline health workers in underserved regions. This paper presents a portable, edge-enabled Electronic Health Record platform optimized for offline-first operation, secure patient data management, and modular diagnostic integration. Running on small-form factor embedded devices, it provides AES-256 encrypted local storage with optional cloud synchronization for interoperability. As a use case, we integrated a non-invasive anemia screening module leveraging fingernail pallor analysis. Trained on 250 patient cases (27% anemia prevalence) with KDE-balanced data, the Random Forest model achieved a test RMSE of 1.969 g/dL and MAE of 1.490 g/dL. A severity-based model reached 79.2% sensitivity. To optimize performance, a YOLOv8n-based nail bed detector was quantized to INT8, reducing inference latency from 46.96 ms to 21.50 ms while maintaining mAP@0.5 at 0.995. The system emphasizes low-cost deployment, modularity, and data privacy compliance (HIPAA/GDPR), addressing critical barriers to digital health adoption in disconnected settings. Our work demonstrates a scalable approach to enhance portable health information systems and support frontline healthcare in underserved regions.
zh

[CV-68] Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

【速读】:该论文旨在解决长时程视觉规划(Long-horizon Visual Planning)任务中的两大挑战:一是程序性标注数据稀缺,限制了模型对任务动态过程的学习能力;二是传统基于单标记预测(next-token prediction)的目标函数难以显式建模视觉规划任务中特有的结构化动作空间。解决方案的关键在于两个创新:其一,提出辅助任务增强(Auxiliary Task Augmentation),通过设计与长时程视频规划相关的辅助任务(如目标预测)来提升模型的规划能力;其二,引入多标记预测(Multi-token Prediction),利用多个预测头同时预测未来多个动作标记,从而更有效地捕捉视觉规划任务中的结构化动作空间。该方法在COIN和CrossTask数据集上实现了SOTA性能,在Ego4D长期动作预测任务中也达到与专用视角特征方法相当的效果。

链接: https://arxiv.org/abs/2507.15130
作者: Ce Zhang,Yale Song,Ruta Desai,Michael Louis Iuzzolino,Joseph Tighe,Gedas Bertasius,Satwik Kottur
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user’s progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model’s ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model’s planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. Code will be made available.
zh

[CV-69] LoopNet: A Multitasking Few-Shot Learning Approach for Loop Closure in Large Scale SLAM

【速读】:该论文针对实时同步定位与地图构建(SLAM)系统中的两个核心问题展开研究:一是回环检测(loop closure detection)的准确性不足,二是嵌入式硬件平台在计算资源受限条件下难以满足实时性要求。其解决方案的关键在于提出了一种基于多任务ResNet架构的LoopNet方法,该方法通过少量样本学习(few-shot learning)实现在线再训练,以适应动态视觉数据集的变化;同时,利用DISK(DIStinctive Keypoints)描述子替代传统手工特征和深度学习方法,显著提升了在不同环境条件下的鲁棒性和精度,并且模型经过优化可在嵌入式设备上高效运行。

链接: https://arxiv.org/abs/2507.15109
作者: Mohammad-Maher Nakshbandi,Ziad Sharawy,Sorin Grigorescu
机构: Transilvania University of Brasov (特兰西瓦尼亚布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the main challenges in the Simultaneous Localization and Mapping (SLAM) loop closure problem is the recognition of previously visited places. In this work, we tackle the two main problems of real-time SLAM systems: 1) loop closure detection accuracy and 2) real-time computation constraints on the embedded hardware. Our LoopNet method is based on a multitasking variant of the classical ResNet architecture, adapted for online retraining on a dynamic visual dataset and optimized for embedded devices. The online retraining is designed using a few-shot learning approach. The architecture provides both an index into the queried visual dataset, and a measurement of the prediction quality. Moreover, by leveraging DISK (DIStinctive Keypoints) descriptors, LoopNet surpasses the limitations of handcrafted features and traditional deep learning methods, offering better performance under varying conditions. Code is available at this https URL. Additinally, we introduce a new loop closure benchmarking dataset, coined LoopDB, which is available at this https URL.
zh

[CV-70] BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking

【速读】:该论文旨在解决内镜黏膜下剥离术(Endoscopic Submucosal Dissection, ESD)过程中出血源定位与连续追踪的难题,当前人工智能(AI)方法多集中于出血区域分割,忽视了在视觉遮挡频繁、场景动态变化的复杂环境中对出血源的精准识别与时间序列跟踪需求。其解决方案的关键在于提出首个面向ESD出血源的综合性数据集BleedOrigin-Bench,包含1,771个专家标注的出血源及39,755个伪标签帧,覆盖8个解剖部位和6种临床挑战场景,并设计了一种双阶段检测-追踪框架BleedOrigin-Net,实现从出血发生检测到空间持续跟踪的全流程自动化处理,显著提升了出血源定位的准确性与实时性。

链接: https://arxiv.org/abs/2507.15094
作者: Mengya Xu,Rulin Zhou,An Wang,Chaoyang Lyu,Zhen Li,Ning Zhong,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 14 figures

点击查看摘要

Abstract:Intraoperative bleeding during Endoscopic Submucosal Dissection (ESD) poses significant risks, demanding precise, real-time localization and continuous monitoring of the bleeding source for effective hemostatic intervention. In particular, endoscopists have to repeatedly flush to clear blood, allowing only milliseconds to identify bleeding sources, an inefficient process that prolongs operations and elevates patient risks. However, current Artificial Intelligence (AI) methods primarily focus on bleeding region segmentation, overlooking the critical need for accurate bleeding source detection and temporal tracking in the challenging ESD environment, which is marked by frequent visual obstructions and dynamic scene changes. This gap is widened by the lack of specialized datasets, hindering the development of robust AI-assisted guidance systems. To address these challenges, we introduce BleedOrigin-Bench, the first comprehensive ESD bleeding source dataset, featuring 1,771 expert-annotated bleeding sources across 106,222 frames from 44 procedures, supplemented with 39,755 pseudo-labeled frames. This benchmark covers 8 anatomical sites and 6 challenging clinical scenarios. We also present BleedOrigin-Net, a novel dual-stage detection-tracking framework for the bleeding source localization in ESD procedures, addressing the complete workflow from bleeding onset detection to continuous spatial tracking. We compare with widely-used object detection models (YOLOv11/v12), multimodal large language models, and point tracking methods. Extensive evaluation demonstrates state-of-the-art performance, achieving 96.85% frame-level accuracy ( \pm\leq8 frames) for bleeding onset detection, 70.24% pixel-level accuracy ( \leq100 px) for initial source detection, and 96.11% pixel-level accuracy ( \leq100 px) for point tracking.
zh

[CV-71] Visual Place Recognition for Large-Scale UAV Applications

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)视觉定位(Visual Place Recognition, vPR)中因数据集规模小、地理与时间多样性不足导致的模型泛化能力差,以及航拍图像固有的旋转模糊性(rotational ambiguity)问题。解决方案的关键在于两个方面:一是构建了LASED这一大规模航拍数据集,包含约一百万张图像,系统采样自爱沙尼亚17万唯一位置,覆盖十年时间跨度,具备显著的地理和时间多样性;二是引入可调节卷积神经网络(steerable Convolutional Neural Networks, steerable CNNs),利用其旋转等变性(rotational equivariance)特性,显式建模并消除图像旋转带来的不确定性,从而生成对方位不敏感的特征表示。实验表明,结合该数据集与steerable CNNs的方案在召回率上相较传统方法平均提升12%,显著增强了vPR模型在复杂空域环境中的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2507.15089
作者: Ioannis Tsampikos Papapetros,Ioannis Kansizoglou,Antonios Gasteratos
机构: Democritus University of Thrace (德谟克利特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual Place Recognition (vPR) plays a crucial role in Unmanned Aerial Vehicle (UAV) navigation, enabling robust localization across diverse environments. Despite significant advancements, aerial vPR faces unique challenges due to the limited availability of large-scale, high-altitude datasets, which limits model generalization, along with the inherent rotational ambiguity in UAV imagery. To address these challenges, we introduce LASED, a large-scale aerial dataset with approximately one million images, systematically sampled from 170,000 unique locations throughout Estonia over a decade, offering extensive geographic and temporal diversity. Its structured design ensures clear place separation significantly enhancing model training for aerial scenarios. Furthermore, we propose the integration of steerable Convolutional Neural Networks (CNNs) to explicitly handle rotational variance, leveraging their inherent rotational equivariance to produce robust, orientation-invariant feature representations. Our extensive benchmarking demonstrates that models trained on LASED achieve significantly higher recall compared to those trained on smaller, less diverse datasets, highlighting the benefits of extensive geographic coverage and temporal diversity. Moreover, steerable CNNs effectively address rotational ambiguity inherent in aerial imagery, consistently outperforming conventional convolutional architectures, achieving on average 12% recall improvement over the best-performing non-steerable network. By combining structured, large-scale datasets with rotation-equivariant neural networks, our approach significantly enhances model robustness and generalization for aerial vPR.
zh

[CV-72] Aesthetics is Cheap Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型在文本图像(text image)生成与编辑任务中表现不足的问题,特别是其在光学字符识别(OCR)相关任务中的能力局限。研究指出,尽管近期涌现出如Flux系列和GPT-4o等高性能通用生成模型,它们在真实感文本图像的生成与编辑方面仍存在显著缺陷。解决方案的关键在于系统性地构建一个包含33个代表性任务的评估框架,将文本图像生成扩展为OCR生成任务,并基于高质量输入与提示对六种闭源与开源模型进行多维度评测,从而揭示当前模型在文档、手写体、场景文本、艺术字体及复杂排版文本等五类任务中的薄弱环节。作者主张将高保真文本图像生成与编辑能力内化为通用生成模型的基础技能,而非依赖专用模块,以推动该领域向更鲁棒、通用的方向发展。

链接: https://arxiv.org/abs/2507.15085
作者: Peirong Zhang,Haowei Xu,Jiaxin Zhang,Guitao Xu,Xuhan Zheng,Zhenhua Yang,Junle Liu,Yuyi Zhang,Lianwen Jin
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emphe.g., Flux-series) and unified generative models (\emphe.g., GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models’ capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \ layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.
zh

[CV-73] StableAnimator: Overcoming Pose Misalignment and Face Distortion for Human Image Animation

【速读】:该论文旨在解决当前扩散模型在人体图像动画生成中难以保持身份(Identity, ID)一致性的问题,尤其是在参考图像与驱动视频在身体尺寸或姿态上存在显著差异时。解决方案的关键在于提出StableAnimator++框架,其核心创新包括:1)引入可学习的姿势对齐模块,通过SVD引导预测参考图像与驱动姿态间的相似变换矩阵,实现姿态与参考图像的空间对齐;2)采用预训练编码器提取图像和人脸嵌入,并通过全局内容感知的人脸编码器优化面部特征;3)设计分布感知的身份适配器(distribution-aware ID Adapter),在时间层干扰下仍能通过分布对齐保持身份一致性;4)在推理阶段融合基于Hamilton-Jacobi-Bellman(HJB)方程的面部优化策略,引导去噪过程以提升面部保真度。这一系列设计共同实现了无需后处理即可高质量生成身份一致的视频动画。

链接: https://arxiv.org/abs/2507.15064
作者: Shuyuan Tu,Zhen Xing,Xintong Han,Zhi-Qi Cheng,Qi Dai,Chong Luo,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); Tencent Inc. (腾讯公司); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2411.17697

点击查看摘要

Abstract:Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.
zh

[CV-74] Rethinking Pan-sharpening: Principled Design Unified Training and a Universal Loss Surpass Brute-Force Scaling

【速读】:该论文旨在解决当前全色锐化(pan-sharpening)领域中模型规模日益庞大、训练依赖单一卫星数据集所导致的计算开销高和泛化能力差的问题。其解决方案的关键在于提出一种轻量级、单步处理的全色锐化框架PanTiny,结合“多数据合一”(multiple-in-one)的训练范式,使一个紧凑模型同时在WV2、WV3和GF2三个具有不同分辨率与光谱特性的卫星数据集上进行联合训练;此外,引入一种通用性强的复合损失函数(composite loss function),显著提升各类模型性能。实验证明,这种基于模型设计、训练策略与损失函数协同优化的方法,在保持高效性的同时大幅增强泛化能力,优于许多参数量更大的专用模型。

链接: https://arxiv.org/abs/2507.15059
作者: Ran Zhang,Xuanhua He,Li Xueheng,Ke Cao,Liu Liu,Wenbo Xu,Fang Jiabin,Yang Qize,Jie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of pan-sharpening has recently seen a trend towards increasingly large and complex models, often trained on single, specific satellite datasets. This approach, however, leads to high computational overhead and poor generalization on full resolution data, a paradigm we challenge in this paper. In response to this issue, we propose PanTiny, a lightweight, single-step pan-sharpening framework designed for both efficiency and robust performance. More critically, we introduce multiple-in-one training paradigm, where a single, compact model is trained simultaneously on three distinct satellite datasets (WV2, WV3, and GF2) with different resolution and spectral information. Our experiments show that this unified training strategy not only simplifies deployment but also significantly boosts generalization on full-resolution data. Further, we introduce a universally powerful composite loss function that elevates the performance of almost all of models for pan-sharpening, pushing state-of-the-art metrics into a new era. Our PanTiny model, benefiting from these innovations, achieves a superior performance-to-efficiency balance, outperforming most larger, specialized models. Through extensive ablation studies, we validate that principled engineering in model design, training paradigms, and loss functions can surpass brute-force scaling. Our work advocates for a community-wide shift towards creating efficient, generalizable, and data-conscious models for pan-sharpening. The code is available at this https URL .
zh

[CV-75] OmniVTON: Training-Free Universal Virtual Try-On ICCV2025

【速读】:该论文旨在解决图像驱动的虚拟试衣(Image-based Virtual Try-On, VTON)技术中长期存在的难题:现有方法要么依赖监督学习在特定场景下实现高保真度但泛化能力差,要么采用无监督方式提升跨域适应性却受限于数据偏差和通用性不足,缺乏一种无需训练、适用于多种场景的统一解决方案。其核心创新在于提出OmniVTON框架,通过解耦服装(garment)与姿态(pose)条件约束,实现了跨场景下的纹理保真度与姿态一致性。关键方法包括:引入服装先验生成机制以对齐衣物与人体结构,并结合连续边界拼接技术保留精细纹理;利用DDIM反演提取结构信息并抑制纹理干扰,从而实现不依赖原始图像纹理的姿态精准对齐。这一解耦设计有效消除了扩散模型在多条件联合处理时的固有偏差,使系统具备训练-free特性及多人群体虚拟试衣能力,成为首个支持单场景多人试衣的通用VTON框架。

链接: https://arxiv.org/abs/2507.15037
作者: Zhaotong Yang,Yuhui Li,Shengfeng He,Xinzhe Li,Yangyang Xu,Junyu Dong,Yong Du
机构: Ocean University of China (中国海洋大学); Singapore Management University (新加坡管理大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, which ensure high fidelity but struggle with cross-domain generalization, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality. A unified, training-free solution that works across both scenarios remains an open challenge. We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. To preserve garment details, we introduce a garment prior generation mechanism that aligns clothing with the body, followed by continuous boundary stitching technique to achieve fine-grained texture retention. For precise pose alignment, we utilize DDIM inversion to capture structural cues while suppressing texture interference, ensuring accurate body alignment independent of the original image textures. By disentangling garment and pose constraints, OmniVTON eliminates the bias inherent in diffusion models when handling multiple conditions simultaneously. Experimental results demonstrate that OmniVTON achieves superior performance across diverse datasets, garment types, and application scenarios. Notably, it is the first framework capable of multi-human VTON, enabling realistic garment transfer across multiple individuals in a single scene. Code is available at this https URL
zh

[CV-76] EBA-AI: Ethics-Guided Bias-Aware AI for Efficient Underwater Image Enhancement and Coral Reef Monitoring

【速读】:该论文旨在解决当前基于人工智能(AI)的水下图像增强模型中存在的数据集偏差(dataset bias)、高计算成本及缺乏透明度等问题,这些问题可能导致海洋环境监测中的误判和不可靠决策。其核心解决方案是提出EBA-AI(Ethics-guided Bias-aware AI)框架,该框架通过CLIP嵌入(CLIP embeddings)识别并缓解数据集偏差,确保不同水下环境中样本的平衡表示;同时引入自适应处理机制以优化能效,在保持增强质量的前提下显著降低GPU资源消耗(如PSNR仅下降1.0 dB),从而实现大规模实时应用可行性。此外,EBA-AI集成不确定性估计与可解释性技术,提升AI驱动环境决策的信任度,实现了效率、公平性和可解释性的协同优化。

链接: https://arxiv.org/abs/2507.15036
作者: Lyes Saad Saoud,Irfan Hussain
机构: Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Underwater image enhancement is vital for marine conservation, particularly coral reef monitoring. However, AI-based enhancement models often face dataset bias, high computational costs, and lack of transparency, leading to potential misinterpretations. This paper introduces EBA-AI, an ethics-guided bias-aware AI framework to address these challenges. EBA-AI leverages CLIP embeddings to detect and mitigate dataset bias, ensuring balanced representation across varied underwater environments. It also integrates adaptive processing to optimize energy efficiency, significantly reducing GPU usage while maintaining competitive enhancement quality. Experiments on LSUI400, Oceanex, and UIEB100 show that while PSNR drops by a controlled 1.0 dB, computational savings enable real-time feasibility for large-scale marine monitoring. Additionally, uncertainty estimation and explainability techniques enhance trust in AI-driven environmental decisions. Comparisons with CycleGAN, FunIEGAN, RAUNENet, WaterNet, UGAN, PUGAN, and UTUIE validate EBA-AI’s effectiveness in balancing efficiency, fairness, and interpretability in underwater image processing. By addressing key limitations of AI-driven enhancement, this work contributes to sustainable, bias-aware, and computationally efficient marine conservation efforts. For interactive visualizations, animations, source code, and access to the preprint, visit: this https URL
zh

[CV-77] OpenBreastUS: Benchmarking Neural Operators for Wave Imaging Using Breast Ultrasound Computed Tomography

【速读】:该论文旨在解决传统波方程数值求解方法在计算效率和稳定性方面的局限性,从而限制了其在超声 computed tomography (USCT) 等医学成像应用中实现准实时图像重建的问题。解决方案的关键在于构建了一个大规模、高保真的波方程数据集 OpenBreastUS,其中包含 8,000 个解剖学上真实的乳腺体模和超过 1600 万次基于真实 USCT 配置的频域波场仿真,为神经算子(neural operators)提供了现实场景下的训练与评估基准,从而显著提升了其在正向模拟与逆成像任务中的性能、可扩展性和泛化能力,并首次实现了基于神经算子求解器的人体乳腺在体成像。

链接: https://arxiv.org/abs/2507.15035
作者: Zhijun Zeng,Youjia Zheng,Hao Hu,Zeyuan Dong,Yihang Zheng,Xinliang Liu,Jinzhuo Wang,Zuoqiang Shi,Linfeng Zhang,Yubing Li,He Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and efficient simulation of wave equations is crucial in computational wave imaging applications, such as ultrasound computed tomography (USCT), which reconstructs tissue material properties from observed scattered waves. Traditional numerical solvers for wave equations are computationally intensive and often unstable, limiting their practical applications for quasi-real-time image reconstruction. Neural operators offer an innovative approach by accelerating PDE solving using neural networks; however, their effectiveness in realistic imaging is limited because existing datasets oversimplify real-world complexity. In this paper, we present OpenBreastUS, a large-scale wave equation dataset designed to bridge the gap between theoretical equations and practical imaging applications. OpenBreastUS includes 8,000 anatomically realistic human breast phantoms and over 16 million frequency-domain wave simulations using real USCT configurations. It enables a comprehensive benchmarking of popular neural operators for both forward simulation and inverse imaging tasks, allowing analysis of their performance, scalability, and generalization capabilities. By offering a realistic and extensive dataset, OpenBreastUS not only serves as a platform for developing innovative neural PDE solvers but also facilitates their deployment in real-world medical imaging problems. For the first time, we demonstrate efficient in vivo imaging of the human breast using neural operator solvers.
zh

[CV-78] owards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding ICCV2025

【速读】:该论文试图解决当前视频大语言模型(video LLMs)在视频理解任务中缺乏对正确性(correctness)和鲁棒性(robustness)的充分评估问题,尤其是现有基准未能真实反映模型与人类智能之间的差距。解决方案的关键在于提出Video Thinking Test (Video-TT),这是一个包含1,000个YouTube Shorts视频的评测基准,每个视频配有一个开放式问题和四个针对视觉与叙事复杂性的自然对抗性问题,从而系统性地衡量模型在真实场景下的理解能力与鲁棒性表现。

链接: https://arxiv.org/abs/2507.15028
作者: Yuanhan Zhang,Yunice Chew,Yuhao Dong,Aria Leo,Bo Hu,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Independent Researcher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025; Project page: this https URL

点击查看摘要

Abstract:Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.
zh

[CV-79] FastSmoothSAM: A Fast Smooth Method For Segment Anything Model

【速读】:该论文旨在解决Fast Segment Anything (FastSAM) 在图像分割中生成锯齿状边缘、导致边缘精度不足的问题,从而影响其在工业自动化、医学影像和自动驾驶等对边缘精度要求较高的实际场景中的应用。解决方案的关键在于引入基于B-Spline曲线拟合的四阶段精修流程,通过两次曲线拟合操作有效平滑锯齿边缘,同时保留关键几何信息,显著提升边缘的视觉质量和分析准确性,且不牺牲FastSAM原有的实时处理能力。

链接: https://arxiv.org/abs/2507.15008
作者: Jiasheng Xu,Yewang Chen
机构: Huaqiao University (华侨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately identifying and representing object edges is a challenging task in computer vision and image processing. The Segment Anything Model (SAM) has significantly influenced the field of image segmentation, but suffers from high memory consumption and long inference times, limiting its efficiency in real-time applications. To address these limitations, Fast Segment Anything (FastSAM) was proposed, achieving real-time segmentation. However, FastSAM often generates jagged edges that deviate from the true object shapes. Therefore, this paper introduces a novel refinement approach using B-Spline curve fitting techniques to enhance the edge quality in FastSAM. Leveraging the robust shape control and flexible geometric construction of B-Splines, a four-stage refining process involving two rounds of curve fitting is employed to effectively smooth jagged edges. This approach significantly improves the visual quality and analytical accuracy of object edges without compromising critical geometric information. The proposed method improves the practical utility of FastSAM by improving segmentation accuracy while maintaining real-time processing capabilities. This advancement unlocks greater potential for FastSAM technology in various real-world scenarios, such as industrial automation, medical imaging, and autonomous systems, where precise and efficient edge recognition is crucial.
zh

[CV-80] Axis-Aligned Document Dewarping

【速读】:该论文旨在解决文档畸变校正(document dewarping)中现有基于学习的方法依赖监督回归且未充分利用物理文档内在几何特性的问题。其解决方案的关键在于引入轴对齐几何约束(axis-aligned geometric constraint)和轴对齐预处理策略(axis alignment preprocessing strategy),通过强制将扭曲的特征线转换为与坐标轴对齐的形式,从而更好地利用平面文档离散网格结构的固有轴对齐性质,显著提升校正精度与鲁棒性。

链接: https://arxiv.org/abs/2507.15000
作者: Chaoyun Wang,I-Chao Shen,Takeo Igarashi,Nanning Zheng,Caigui Jiang
机构: Xi’an Jiaotong University (西安交通大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document dewarping is crucial for many applications. However, existing learning-based methods primarily rely on supervised regression with annotated data without leveraging the inherent geometric properties in physical documents to the dewarping process. Our key insight is that a well-dewarped document is characterized by transforming distorted feature lines into axis-aligned ones. This property aligns with the inherent axis-aligned nature of the discrete grid geometry in planar documents. In the training phase, we propose an axis-aligned geometric constraint to enhance document dewarping. In the inference phase, we propose an axis alignment preprocessing strategy to reduce the dewarping difficulty. In the evaluation phase, we introduce a new metric, Axis-Aligned Distortion (AAD), that not only incorporates geometric meaning and aligns with human visual perception but also demonstrates greater robustness. As a result, our method achieves SOTA results on multiple existing benchmarks and achieves 18.2%~34.5% improvements on the AAD metric.
zh

[CV-81] Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像回归任务中性能受限的问题,尤其是现有方法依赖预设输出词汇表和通用任务提示(如“你如何评价这张图片?”)时无法有效利用文本输入的语义信息,导致其表现与仅使用图像训练的模型相当。解决方案的关键在于提出一种基于Transformer的分类回归方法(Regression via Transformer-Based Classification, RvTC),通过引入灵活的分箱(bin-based)策略替代固定词汇表约束,避免了手动设计词汇表的复杂性;同时强调数据特定提示(data-specific prompts)的重要性,即在提示中加入与具体图像相关的语义信息(如挑战标题),显著提升模型对跨模态语义的理解能力,从而在多个图像评估数据集上实现最优性能,例如在AVA数据集上将相关系数从0.83提升至0.90。

链接: https://arxiv.org/abs/2507.14997
作者: Roy H. Jennings,Genady Paikin,Roy Shaul,Evgeny Soloveichik
机构: Samsung Israel R&D Center (三星以色列研发中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., “How would you rate this image?”), assuming this mimics human rating behavior. Our analysis reveals these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts improves correlations from 0.83 to 0.90, a new state-of-the-art. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information surpassing mere statistical biases. This underscores the importance of incorporating meaningful textual context in multimodal regression tasks.
zh

[CV-82] Hierarchical Cross-modal Prompt Learning for Vision-Language Models ICCV2025

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在下游任务适应过程中难以兼顾性能提升与泛化能力保持的问题。现有提示学习方法存在两个关键瓶颈:模态隔离(modality isolation)导致跨模态信息交互不足,以及语义层级衰减(hierarchical semantic decay)使得深层表示丢失可迁移的浅层语义。其解决方案的核心是提出Hierarchical Cross-modal Prompt Learning(HiCroPL)框架,通过构建文本与视觉模态间的双向知识流,实现语义互 refine;具体而言,在早期层利用文本提示通过分层知识映射器(hierarchical knowledge mapper)向视觉提示注入清晰语义以增强低层视觉表征,在深层则由视觉提示携带任务相关对象信息反馈至文本提示,促进深层次对齐;同时,该框架通过多尺度特征融合机制保障深层表示保留可迁移的浅层语义,从而显著提升模型泛化能力。

链接: https://arxiv.org/abs/2507.14976
作者: Hao Zheng,Shunzhi Yang,Zhuoxin He,Jinfeng Yang,Zhenhua Huang
机构: South China Normal University (华南师范大学); Shenzhen Polytechnic University (深圳职业技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL’s superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: this https URL.
zh

[CV-83] Decision PCR: Decision version of the Point Cloud Registration task

【速读】:该论文旨在解决低重叠点云配准(Low-overlap Point Cloud Registration, PCR)任务中传统评估指标失效的问题,尤其是在极低内点比例下难以可靠判断配准结果质量的挑战。其解决方案的关键在于将PCR任务重新定义为“决策型PCR”(Decision version of PCR),并提出一种基于深度学习的数据驱动方法:首先构建基于3DMatch数据集的训练集,随后训练一个深度神经网络分类器来精准评估配准质量,从而替代传统依赖于内点数量的度量方式。该分类器被集成到标准PCR流程中,显著提升了现有先进方法(如GeoTransformer)在3DLoMatch等挑战性基准上的性能,实现了86.97%的新SOTA召回率,并展现出良好的跨场景泛化能力。

链接: https://arxiv.org/abs/2507.14965
作者: Yaojie Zhang,Tianlun Huang,Weijun Wang,Wei Feng
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-overlap point cloud registration (PCR) remains a significant challenge in 3D vision. Traditional evaluation metrics, such as Maximum Inlier Count, become ineffective under extremely low inlier ratios. In this paper, we revisit the registration result evaluation problem and identify the Decision version of the PCR task as the fundamental problem. To address this Decision PCR task, we propose a data-driven approach. First, we construct a corresponding dataset based on the 3DMatch dataset. Then, a deep learning-based classifier is trained to reliably assess registration quality, overcoming the limitations of traditional metrics. To our knowledge, this is the first comprehensive study to address this task through a deep learning framework. We incorporate this classifier into standard PCR pipelines. When integrated with our approach, existing state-of-the-art PCR methods exhibit significantly enhanced registration performance. For example, combining our framework with GeoTransformer achieves a new SOTA registration recall of 86.97% on the challenging 3DLoMatch benchmark. Our method also demonstrates strong generalization capabilities on the unseen outdoor ETH dataset.
zh

[CV-84] Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

【速读】:该论文旨在解决嵌入式设备上实时多标签视频分类任务中因计算资源和能耗限制而导致的效率瓶颈问题。其核心挑战在于如何在保持高精度的同时降低推理延迟与能量消耗。解决方案的关键在于提出了一种上下文感知的模块化框架Polymorph,该框架通过动态激活一组轻量级低秩适配器(Low Rank Adapters, LoRA)来实现高效推理:每个LoRA适配器针对由标签共现模式挖掘出的特定类别子集进行优化,并以LoRA权重形式部署于共享主干网络之上;运行时根据当前帧的活跃标签选择并组合必要的适配器,从而避免全模型切换和权重合并操作,显著提升了可扩展性、降低了延迟与能耗。

链接: https://arxiv.org/abs/2507.14959
作者: Saeid Ghafouri,Mohsen Fayyaz,Xiangchen Li,Deepu John,Bo Ji,Dimitrios Nikolopoulos,Hans Vandierendonck
机构: Queen’s University Belfast (贝尔法斯特女王大学); Microsoft (微软); Virginia Tech (弗吉尼亚理工学院); University College Dublin (都柏林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at this https URL.
zh

[CV-85] Open-set Cross Modal Generalization via Multimodal Unified Representation ICCV2025

【速读】:该论文旨在解决现有跨模态统一表示方法在开放集(open-set)环境下缺乏适应性的问题,即传统方法多基于封闭集(closed-set)评估,无法有效处理新模态中未见类别的泛化任务。为此,作者提出Open-set Cross Modal Generalization (OSCMG)任务,以更贴近真实应用场景中跨模态知识迁移与未知类别识别的需求。解决方案的核心是提出MICU模型,其关键创新在于两个组件:Fine-Coarse Masked multimodal InfoNCE(FCMI)通过在语义和时序层面施加对比学习并引入掩码机制,增强跨模态对齐与鲁棒性;Cross modal Unified Jigsaw Puzzles(CUJP)则结合模态无关特征选择与自监督学习,提升特征多样性与模型不确定性建模能力,从而显著改善模型对开放集中未见类别的判别性能。

链接: https://arxiv.org/abs/2507.14935
作者: Hai Huang,Yan Xia,Shulei Wang,Hanting Wang,Minghui Fang,Shengpeng Ji,Sashuai Zhou,Tao Jin,Zhou Zhao
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model’s ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at this https URL.
zh

[CV-86] Probabilistic smooth attention for deep multiple instance learning in medical imaging

【速读】:该论文旨在解决医学图像分类中因标注数据稀缺而导致的深度多实例学习(Multiple Instance Learning, MIL)方法在注意力机制应用上的局限性问题。现有方法通常将注意力值视为确定性权重,忽略了单个实例对整体预测贡献的不确定性,从而可能影响模型的鲁棒性和可解释性。解决方案的关键在于提出一种新颖的概率化框架,通过估计注意力值的概率分布来建模不确定性,并同时捕捉局部邻近实例间的相互作用与全局长距离依赖关系。该方法不仅在三个医学数据集上显著优于十一个主流基线模型,还生成了具有疾病定位意义的不确定性热力图,提升了模型的临床可解释性。

链接: https://arxiv.org/abs/2507.14932
作者: Francisco M. Castro-Macías,Pablo Morales-Álvarez,Yunan Wu,Rafael Molina,Aggelos K. Katsaggelos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Multiple Instance Learning (MIL) paradigm is attracting plenty of attention in medical imaging classification, where labeled data is scarce. MIL methods cast medical images as bags of instances (e.g. patches in whole slide images, or slices in CT scans), and only bag labels are required for training. Deep MIL approaches have obtained promising results by aggregating instance-level representations via an attention mechanism to compute the bag-level prediction. These methods typically capture both local interactions among adjacent instances and global, long-range dependencies through various mechanisms. However, they treat attention values deterministically, potentially overlooking uncertainty in the contribution of individual instances. In this work we propose a novel probabilistic framework that estimates a probability distribution over the attention values, and accounts for both global and local interactions. In a comprehensive evaluation involving \colorreview eleven state-of-the-art baselines and three medical datasets, we show that our approach achieves top predictive performance in different metrics. Moreover, the probabilistic treatment of the attention provides uncertainty maps that are interpretable in terms of illness localization.
zh

[CV-87] 3-Dimensional CryoEM Pose Estimation and Shift Correction Pipeline

【速读】:该论文旨在解决冷冻电子显微镜(cryo-EM)中由于信噪比(SNR)极低导致的粒子姿态估计(pose estimation)与平移校正(shift correction)难题,这些问题直接影响三维重构的保真度。其解决方案的关键在于提出一种基于多维缩放(MDS)技术的鲁棒姿态估计方法,通过将旋转矩阵表示为旋转轴和垂直于该轴的单位向量来建模,并引入两个互补组件:(i) 采用ℓ₁-范数目标函数的联合优化框架,利用投影坐标下降法精确强制单位长度和正交约束,从而提升对噪声的鲁棒性;(ii) 一种基于全局最小二乘法的迭代平移校正算法,实现一致的平面内平移估计。相较于以往依赖ℓ₂-范数且仅近似满足几何约束的方法,该方案避免了误差累积,显著提升了欧拉角精度和傅里叶壳相关(FSC)指标下的重建质量。

链接: https://arxiv.org/abs/2507.14924
作者: Kaishva Chintan Shah,Virajith Boddapati,Karthik S. Gurumoorthy,Sandip Kaledhonkar,Ajit Rajwade
机构: Indian Institute of Technology, Bombay (印度理工学院孟买分校); Walmart Global Tech (沃尔玛全球科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate pose estimation and shift correction are key challenges in cryo-EM due to the very low SNR, which directly impacts the fidelity of 3D reconstructions. We present an approach for pose estimation in cryo-EM that leverages multi-dimensional scaling (MDS) techniques in a robust manner to estimate the 3D rotation matrix of each particle from pairs of dihedral angles. We express the rotation matrix in the form of an axis of rotation and a unit vector in the plane perpendicular to the axis. The technique leverages the concept of common lines in 3D reconstruction from projections. However, common line estimation is ridden with large errors due to the very low SNR of cryo-EM projection images. To address this challenge, we introduce two complementary components: (i) a robust joint optimization framework for pose estimation based on an \ell_1 -norm objective or a similar robust norm, which simultaneously estimates rotation axes and in-plane vectors while exactly enforcing unit norm and orthogonality constraints via projected coordinate descent; and (ii) an iterative shift correction algorithm that estimates consistent in-plane translations through a global least-squares formulation. While prior approaches have leveraged such embeddings and common-line geometry for orientation recovery, existing formulations typically rely on \ell_2 -based objectives that are sensitive to noise, and enforce geometric constraints only approximately. These choices, combined with a sequential pipeline structure, can lead to compounding errors and suboptimal reconstructions in low-SNR regimes. Our pipeline consistently outperforms prior methods in both Euler angle accuracy and reconstruction fidelity, as measured by the Fourier Shell Correlation (FSC).
zh

[CV-88] Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

【速读】:该论文旨在解决现有3D Gaussian Splatting(3DGS)重建方法在训练过程中对计算资源和大规模数据集依赖性强、几何与外观预测耦合导致回归速度慢的问题。其核心解决方案是提出一种解耦框架\method,通过双分支结构分别生成几何信息(多视角点图,point-maps)和外观特征(高斯特征,Gaussian features),并以GS-maps形式统一表示3DGS对象;同时引入全局注意力机制融合局部图像对特征,并利用精炼网络提升重建质量,最终实现无需相机参数的位姿无关(pose-free)3D重建,显著降低资源消耗并保持高质量输出。

链接: https://arxiv.org/abs/2507.14921
作者: Xiufeng Huang,Ka Chun Cheung,Runmin Cong,Simon See,Renjie Wan
机构: Hong Kong Baptist University (香港浸会大学); NVIDIA AI Technology Center (NVIDIA人工智能技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMMM2025. Non-camera-ready version

点击查看摘要

Abstract:Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.
zh

[CV-89] Semantic-Aware Representation Learning for Multi-label Image Classification

【速读】:该论文旨在解决多标签图像分类中现有方法因注意力机制或图卷积网络(GCN)学习到的图像表征存在噪声且难以精确定位目标对象的问题。其解决方案的关键在于提出一种语义感知的表征学习(Semantic-Aware Representation Learning, SARL)框架:首先通过标签语义相关特征学习模块提取与标签语义相关的特征,进而设计基于最优传输(optimal transport)的注意力机制实现语义对齐的图像表征,最后采用区域得分聚合策略完成多标签预测,从而提升分类精度与语义一致性。

链接: https://arxiv.org/abs/2507.14918
作者: Ren-Dong Xie,Zhi-Fen He,Bo Li,Bin Liu,Jin-Yan Hu
机构: Nanchang Hangkong University (南昌航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-label image classification, an important research area in computer vision, focuses on identifying multiple labels or concepts within an image. Existing approaches often employ attention mechanisms or graph convolutional networks (GCNs) to learn image representation. However, this representation may contain noise and may not locate objects precisely. Therefore, this paper proposes a Semantic-Aware Representation Learning (SARL) for multi-label image classification. First, a label semantic-related feature learning module is utilized to extract semantic-related features. Then, an optimal transport-based attention mechanism is designed to obtain semantically aligned image representation. Finally, a regional score aggregation strategy is used for multi-label prediction. Experimental results on two benchmark datasets, PASCAL VOC 2007 and MS-COCO, demonstrate the superiority of SARL over existing methods.
zh

[CV-90] riCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

【速读】:该论文旨在解决现有3D视觉定位(3D visual grounding)方法中因依赖独立模态编码器(如RGB图像、文本和3D点云)而导致模型结构复杂、训练效率低的问题。当前方法虽尝试利用预训练的2D多模态模型(如CLIP)处理3D任务,但仍难以有效对齐点云数据与2D编码器,因而仍需额外的3D编码器进行特征提取,进一步增加模型复杂度。其解决方案的关键在于构建一个统一的2D预训练多模态网络,通过适配器(adapter-based)微调策略将CLIP模型扩展至三模态(RGB图像、文本和点云),并设计几何感知的2D-3D特征恢复与融合模块(GARF),实现多尺度几何特征的高效融合;最终结合文本特征与多模态解码器完成深层跨模态理解,从而在显著减少58%可训练参数的同时,在3D检测和3D视觉定位任务上分别提升6.52%和6.25%的性能。

链接: https://arxiv.org/abs/2507.14904
作者: Fan Li,Zanyi Wang,Zeyi Huang,Guang Dai,Jingdong Wang,Mengmeng Wang
机构: Xi’an Jiaotong University (西安交通大学); SGIT AI Lab; Zhejiang University of Technology (浙江工业大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.
zh

[CV-91] U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLM s

【速读】:该论文旨在解决通用多模态检索(Universal Multimodal Retrieval, UMR)中基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的嵌入学习机制不明确、性能受限及泛化能力不足的问题。现有方法虽多采用对比学习范式,但其训练细节差异显著且关键因素未被充分挖掘,导致模型表现不稳定。解决方案的关键在于通过系统性分析嵌入生成与训练策略中的核心要素——包括渐进式过渡(progressive transition)、难负样本挖掘(hard negative mining)和重排序器蒸馏(re-ranker distillation),发现并验证了以往常被忽视的因素对性能的显著影响。基于此,作者提出统一框架U-MARVEL,该框架在监督设置下显著优于M-BEIR基准上的现有方法,并在组合图像检索和文本到视频检索等零样本任务中展现出强大的泛化能力,从而提升了UMR任务中嵌入学习的有效性和鲁棒性。

链接: https://arxiv.org/abs/2507.14902
作者: Xiaojie Li,Chu Li,Shi-Zhe Chen,Xi Chen
机构: Nanjing University (南京大学); BAC, Tencent PCG (腾讯PCG)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report (in progress)

点击查看摘要

Abstract:Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (\textbfUniversal \textbfMultimod\textbfAl \textbfRetrie\textbfVal via \textbfEmbedding \textbfLearning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exihibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at this https URL
zh

[CV-92] InsightX Agent : An LMM-based Agent ic Framework with Integrated Tools for Reliable X-ray NDT Analysis

【速读】:该论文旨在解决当前基于深度学习的无损检测(NDT)方法在工业质量保证中普遍存在的交互性不足、可解释性差以及缺乏自我评估能力的问题,这些问题限制了模型的可靠性与操作员信任度。其解决方案的关键在于提出一个基于大模型(Large Multimodal Model, LMM)的智能体框架——InsightX Agent,该框架以LMM为核心调度器,协同Sparse Deformable Multi-Scale Detector(SDMSD)与Evidence-Grounded Reflection(EGR)工具,实现从被动数据处理向主动推理的转变:SDMSD通过多尺度特征图生成密集缺陷区域提案并经非极大值抑制(NMS)稀疏化,提升小目标检测效率;EGR则引导LMM执行链式思维式的审查流程,涵盖上下文评估、单个缺陷分析、误报剔除、置信度重校准和质量保障,从而验证并优化初始检测结果,显著增强诊断可靠性与可解释性。

链接: https://arxiv.org/abs/2507.14899
作者: Jiale Liu,Huan Wang,Yue Zhang,Xiaoyu Luo,Jiaxiang Hu,Zhiliang Liu,Min Xie
机构: University of Edinburgh (爱丁堡大学); University of Electronic Science and Technology of China (电子科技大学); City University of Hong Kong (香港城市大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-destructive testing (NDT), particularly X-ray inspection, is vital for industrial quality assurance, yet existing deep-learning-based approaches often lack interactivity, interpretability, and the capacity for critical self-assessment, limiting their reliability and operator trust. To address these shortcomings, this paper proposes InsightX Agent, a novel LMM-based agentic framework designed to deliver reliable, interpretable, and interactive X-ray NDT analysis. Unlike typical sequential pipelines, InsightX Agent positions a Large Multimodal Model (LMM) as a central orchestrator, coordinating between the Sparse Deformable Multi-Scale Detector (SDMSD) and the Evidence-Grounded Reflection (EGR) tool. The SDMSD generates dense defect region proposals for multi-scale feature maps and sparsifies them through Non-Maximum Suppression (NMS), optimizing detection of small, dense targets in X-ray images while maintaining computational efficiency. The EGR tool guides the LMM agent through a chain-of-thought-inspired review process, incorporating context assessment, individual defect analysis, false positive elimination, confidence recalibration and quality assurance to validate and refine the SDMSD’s initial proposals. By strategically employing and intelligently using tools, InsightX Agent moves beyond passive data processing to active reasoning, enhancing diagnostic reliability and providing interpretations that integrate diverse information sources. Experimental evaluations on the GDXray+ dataset demonstrate that InsightX Agent not only achieves a high object detection F1-score of 96.35% but also offers significantly improved interpretability and trustworthiness in its analyses, highlighting the transformative potential of agentic LLM frameworks for industrial inspection tasks.
zh

[CV-93] BeatFormer: Efficient motion-robust remote heart rate estimation through unsupervised spectral zoomed attention filters

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)在复杂场景下性能受限的问题,尤其是现有深度学习方法依赖大规模标注数据、泛化能力不足,而传统手工特征方法虽计算高效但线性假设限制了其在动态运动等复杂条件下的表现。解决方案的关键在于提出BeatFormer模型,其核心创新是融合了缩放正交复数注意力机制(zoomed orthonormal complex attention)与频域能量度量,从而实现高效且鲁棒的rPPG信号估计;同时引入频谱对比学习(Spectral Contrastive Learning, SCL),使模型无需任何PPG或心率(HR)标签即可训练,显著提升了跨数据集迁移能力和实际适用性。

链接: https://arxiv.org/abs/2507.14885
作者: Joaquim Comas,Federico Sukno
机构: Pompeu Fabra University (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) captures cardiac signals from facial videos and is gaining attention for its diverse applications. While deep learning has advanced rPPG estimation, it relies on large, diverse datasets for effective generalization. In contrast, handcrafted methods utilize physiological priors for better generalization in unseen scenarios like motion while maintaining computational efficiency. However, their linear assumptions limit performance in complex conditions, where deep learning provides superior pulsatile information extraction. This highlights the need for hybrid approaches that combine the strengths of both methods. To address this, we present BeatFormer, a lightweight spectral attention model for rPPG estimation, which integrates zoomed orthonormal complex attention and frequency-domain energy measurement, enabling a highly efficient model. Additionally, we introduce Spectral Contrastive Learning (SCL), which allows BeatFormer to be trained without any PPG or HR labels. We validate BeatFormer on the PURE, UBFC-rPPG, and MMPD datasets, demonstrating its robustness and performance, particularly in cross-dataset evaluations under motion scenarios.
zh

[CV-94] Region-aware Depth Scale Adaptation with Sparse Measurements

【速读】:该论文旨在解决基础模型(foundation models)在单目深度估计中输出为相对尺度而非度量尺度(metric scale)的问题,这一限制阻碍了其在真实场景中的直接应用。现有解决方案通常依赖于额外的训练或微调以实现尺度适配,但这类方法不仅计算成本高,还可能损害模型原有的泛化能力。论文提出了一种非学习型(non-learning-based)方法,其关键在于利用稀疏深度测量值(sparse depth measurements)将基础模型的相对尺度预测转换为度量尺度深度,无需重新训练或微调,从而在不牺牲原模型强泛化能力的前提下实现精确的度量深度输出。

链接: https://arxiv.org/abs/2507.14879
作者: Rizhao Fan,Tianfang Ma,Zhigen Li,Ning An,Jian Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the emergence of foundation models for depth prediction has led to remarkable progress, particularly in zero-shot monocular depth estimation. These models generate impressive depth predictions; however, their outputs are often in relative scale rather than metric scale. This limitation poses challenges for direct deployment in real-world applications. To address this, several scale adaptation methods have been proposed to enable foundation models to produce metric depth. However, these methods are typically costly, as they require additional training on new domains and datasets. Moreover, fine-tuning these models often compromises their original generalization capabilities, limiting their adaptability across diverse scenes. In this paper, we introduce a non-learning-based approach that leverages sparse depth measurements to adapt the relative-scale predictions of foundation models into metric-scale depth. Our method requires neither retraining nor fine-tuning, thereby preserving the strong generalization ability of the original foundation models while enabling them to produce metric depth. Experimental results demonstrate the effectiveness of our approach, high-lighting its potential to bridge the gap between relative and metric depth without incurring additional computational costs or sacrificing generalization ability.
zh

[CV-95] Hybrid-supervised Hypergraph-enhanced Transformer for Micro-gesture Based Emotion Recognition

【速读】:该论文旨在解决基于微手势(micro-gestures)的人类情绪状态识别问题,当前该领域在建模人类情绪与微手势之间关系方面仍缺乏充分探索。解决方案的关键在于提出一种融合超图增强的Transformer架构的混合监督框架,其中通过堆叠超图增强的自注意力模块和多尺度时序卷积模块构建编码器与解码器;特别地,为更精确捕捉微手势中细微运动特征,设计了一个包含上采样操作的解码器用于自监督重建任务,并引入可动态更新的超边以建模骨骼关节间的复杂空间关系;同时,在编码器输出端设计浅层情绪识别头,通过监督学习挖掘情绪状态与局部微手势运动之间的关联,最终实现端到端的一阶段联合训练,显著优于现有方法。

链接: https://arxiv.org/abs/2507.14867
作者: Zhaoqiang Xia,Hexiang Huang,Haoyu Chen,Xiaoyi Feng,Guoying Zhao
机构: Northwestern Polytechnical University (西北工业大学); Innovation Center NPU Chongqing (西北工业大学重庆创新中心); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-gestures are unconsciously performed body gestures that can convey the emotion states of humans and start to attract more research attention in the fields of human behavior understanding and affective computing as an emerging topic. However, the modeling of human emotion based on micro-gestures has not been explored sufficiently. In this work, we propose to recognize the emotion states based on the micro-gestures by reconstructing the behavior patterns with a hypergraph-enhanced Transformer in a hybrid-supervised framework. In the framework, hypergraph Transformer based encoder and decoder are separately designed by stacking the hypergraph-enhanced self-attention and multiscale temporal convolution modules. Especially, to better capture the subtle motion of micro-gestures, we construct a decoder with additional upsampling operations for a reconstruction task in a self-supervised learning manner. We further propose a hypergraph-enhanced self-attention module where the hyperedges between skeleton joints are gradually updated to present the relationships of body joints for modeling the subtle local motion. Lastly, for exploiting the relationship between the emotion states and local motion of micro-gestures, an emotion recognition head from the output of encoder is designed with a shallow architecture and learned in a supervised way. The end-to-end framework is jointly trained in a one-stage way by comprehensively utilizing self-reconstruction and supervision information. The proposed method is evaluated on two publicly available datasets, namely iMiGUE and SMG, and achieves the best performance under multiple metrics, which is superior to the existing methods.
zh

[CV-96] An Uncertainty-aware DETR Enhancement Framework for Object Detection

【速读】:该论文旨在解决目标检测中边界框定位精度不足以及预测不确定性未被显式建模的问题(即传统检测器依赖确定性边界框回归,忽视了预测中的不确定性,从而限制了模型的鲁棒性)。其解决方案的关键在于提出一种基于DETR架构的不确定性感知增强框架:首先将边界框建模为多元高斯分布,并引入Gromov-Wasserstein距离作为损失函数的一部分以更好地对齐预测分布与真实分布;其次基于贝叶斯风险公式设计信息过滤机制以降低高风险预测,提升检测可靠性;最后通过简单算法利用置信区间量化定位不确定性。该方法在COCO基准上验证了有效性,并成功扩展至白细胞检测任务,在LISC和WBCDD数据集上达到最先进性能,证明了其通用性和可扩展性。

链接: https://arxiv.org/abs/2507.14855
作者: Xingshu Chen,Sicheng Yu,Chong Cheng,Hao Wang,Ting Tian
机构: Sun Yat-sen University (中山大学); AI Thrust, HKUST(GZ) (香港科技大学(广州)人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the problem of object detection with a focus on improving both the localization accuracy of bounding boxes and explicitly modeling prediction uncertainty. Conventional detectors rely on deterministic bounding box regression, ignoring uncertainty in predictions and limiting model robustness. In this paper, we propose an uncertainty-aware enhancement framework for DETR-based object detectors. We model bounding boxes as multivariate Gaussian distributions and incorporate the Gromov-Wasserstein distance into the loss function to better align the predicted and ground-truth distributions. Building on this, we derive a Bayes Risk formulation to filter high-risk information and improve detection reliability. We also propose a simple algorithm to quantify localization uncertainty via confidence intervals. Experiments on the COCO benchmark show that our method can be effectively integrated into existing DETR variants, enhancing their performance. We further extend our framework to leukocyte detection tasks, achieving state-of-the-art results on the LISC and WBCDD datasets. These results confirm the scalability of our framework across both general and domain-specific detection tasks. Code page: this https URL.
zh

[CV-97] Grounding Degradations in Natural Language for All-In-One Video Restoration

【速读】:该论文旨在解决视频修复(video restoration)中多退化类型(multi-degradation)场景下的统一建模与可解释性难题,即如何在不依赖训练或推理阶段退化知识的前提下,实现对视频帧退化语义上下文的自然语言引导修复。其解决方案的关键在于:利用基础模型(foundation models)将视频帧的退化感知语义上下文以自然语言形式进行锚定(grounding),从而提供可解释且灵活的指导;同时,通过学习一种退化知识的近似表示,使得基础模型在推理时可安全解耦(disentangled),无需额外计算开销。此方法突破了传统视频修复依赖先验退化信息的限制,实现了端到端、无监督退化感知的通用修复框架。

链接: https://arxiv.org/abs/2507.14851
作者: Muhammad Kamran Janjua,Amirhosein Ghasemabadi,Kunlin Zhang,Mohammad Salameh,Chao Gao,Di Niu
机构: Huawei Technologies, Canada (华为技术加拿大公司); ECE Department, University of Alberta, Canada (阿尔伯塔大学电子与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 17 pages

点击查看摘要

Abstract:In this work, we propose an all-in-one video restoration framework that grounds degradation-aware semantic context of video frames in natural language via foundation models, offering interpretable and flexible guidance. Unlike prior art, our method assumes no degradation knowledge in train or test time and learns an approximation to the grounded knowledge such that the foundation model can be safely disentangled during inference adding no extra cost. Further, we call for standardization of benchmarks in all-in-one video restoration, and propose two benchmarks in multi-degradation setting, three-task (3D) and four-task (4D), and two time-varying composite degradation benchmarks; one of the latter being our proposed dataset with varying snow intensity, simulating how weather degradations affect videos naturally. We compare our method with prior works and report state-of-the-art performance on all benchmarks.
zh

[CV-98] raining Self-Supervised Depth Completion Using Sparse Measurements and a Single Image

【速读】:该论文旨在解决从稀疏深度测量中恢复密集深度图的难题,尤其针对现有监督学习方法依赖昂贵的密集深度标注以及自监督方法受限于多帧图像序列(如静态场景或单帧输入无法适用)的问题。其解决方案的关键在于提出一种全新的自监督深度补全范式,仅需稀疏深度数据及其对应图像即可训练,无需密集标签或相邻视角图像;通过设计基于深度分布特性的新型损失函数,实现观测点到未观测区域的有效深度信息传播,并引入视觉基础模型生成的分割图以进一步提升深度估计精度。

链接: https://arxiv.org/abs/2507.14845
作者: Rizhao Fan,Zhigen Li,Heping Li,Ning An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth completion is an important vision task, and many efforts have been made to enhance the quality of depth maps from sparse depth measurements. Despite significant advances, training these models to recover dense depth from sparse measurements remains a challenging problem. Supervised learning methods rely on dense depth labels to predict unobserved regions, while self-supervised approaches require image sequences to enforce geometric constraints and photometric consistency between frames. However, acquiring dense annotations is costly, and multi-frame dependencies limit the applicability of self-supervised methods in static or single-frame scenarios. To address these challenges, we propose a novel self-supervised depth completion paradigm that requires only sparse depth measurements and their corresponding image for training. Unlike existing methods, our approach eliminates the need for dense depth labels or additional images captured from neighboring viewpoints. By leveraging the characteristics of depth distribution, we design novel loss functions that effectively propagate depth information from observed points to unobserved regions. Additionally, we incorporate segmentation maps generated by vision foundation models to further enhance depth estimation. Extensive experiments demonstrate the effectiveness of our proposed method.
zh

[CV-99] owards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization

【速读】:该论文旨在解决从单张RGB图像生成高质量3D场景的问题,尤其针对多物体场景中对象生成质量与场景整体一致性难以兼顾的挑战。其解决方案的关键在于提出一个三阶段框架:首先通过图像实例分割与修复(instance segmentation and inpainting)恢复遮挡物体的缺失细节,实现前景3D资产的完整生成;其次利用伪立体视点构建进行相机参数估计和场景深度推断,并结合模型选择策略确保生成3D资产与输入图像的最佳对齐;最后通过模型参数化及点云间Chamfer距离最小化优化布局参数,从而获得与输入图像精确对齐且具有显式几何表示的3D场景。

链接: https://arxiv.org/abs/2507.14841
作者: Xiang Tang,Ruotong Li,Xiaopeng Fan
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Suzhou Research Institute (哈尔滨工业大学苏州研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, Project page: this https URL

点击查看摘要

Abstract:In recent years, 3D generation has made great strides in both academia and industry. However, generating 3D scenes from a single RGB image remains a significant challenge, as current approaches often struggle to ensure both object generation quality and scene coherence in multi-object scenarios. To overcome these limitations, we propose a novel three-stage framework for 3D scene generation with explicit geometric representations and high-quality textural details via single image-guided model generation and spatial layout optimization. Our method begins with an image instance segmentation and inpainting phase, which recovers missing details of occluded objects in the input images, thereby achieving complete generation of foreground 3D assets. Subsequently, our approach captures the spatial geometry of reference image by constructing pseudo-stereo viewpoint for camera parameter estimation and scene depth inference, while employing a model selection strategy to ensure optimal alignment between the 3D assets generated in the previous step and the input. Finally, through model parameterization and minimization of the Chamfer distance between point clouds in 3D and 2D space, our approach optimizes layout parameters to produce an explicit 3D scene representation that maintains precise alignment with input guidance image. Extensive experiments on multi-object scene image sets have demonstrated that our approach not only outperforms state-of-the-art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.
zh

[CV-100] Paired Image Generation with Diffusion-Guided Diffusion Models

【速读】:该论文旨在解决数字乳腺断层扫描(Digital Breast Tomosynthesis, DBT)图像中肿块病灶分割任务因高密度乳腺组织导致病灶隐蔽性强、人工标注困难且数据稀缺的问题。现有扩散模型在数据增强中的应用面临两大挑战:一是难以学习病灶区域特征,导致生成质量低;二是仅能生成图像而无法同步生成对应标注,限制了其在监督训练中的实用性。解决方案的关键在于提出一种无需外部条件的成对图像生成方法,通过为条件扩散模型引入额外的扩散引导器(diffusion guider),实现DBT切片与对应肿块掩膜(mass lesion mask)的联合生成,从而提升生成质量并缓解标注数据短缺问题,有效支持下游分割任务的监督训练。

链接: https://arxiv.org/abs/2507.14833
作者: Haoxuan Zhang,Wenju Cui,Yuzhu Cao,Tao Tan,Jie Liu,Yunsong Peng,Jian Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The segmentation of mass lesions in digital breast tomosynthesis (DBT) images is very significant for the early screening of breast cancer. However, the high-density breast tissue often leads to high concealment of the mass lesions, which makes manual annotation difficult and time-consuming. As a result, there is a lack of annotated data for model training. Diffusion models are commonly used for data augmentation, but the existing methods face two challenges. First, due to the high concealment of lesions, it is difficult for the model to learn the features of the lesion area. This leads to the low generation quality of the lesion areas, thus limiting the quality of the generated images. Second, existing methods can only generate images and cannot generate corresponding annotations, which restricts the usability of the generated images in supervised training. In this work, we propose a paired image generation method. The method does not require external conditions and can achieve the generation of paired images by training an extra diffusion guider for the conditional diffusion model. During the experimental phase, we generated paired DBT slices and mass lesion masks. Then, we incorporated them into the supervised training process of the mass lesion segmentation task. The experimental results show that our method can improve the generation quality without external conditions. Moreover, it contributes to alleviating the shortage of annotated data, thus enhancing the performance of downstream tasks.
zh

[CV-101] PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing ICCV2025

【速读】:该论文旨在解决现有图像去雾模型在处理未见过的真实世界雾霾图像时性能显著下降的问题,其根源在于训练数据的局限性导致模型泛化能力不足。解决方案的关键在于提出物理引导的雾霾迁移网络(Physics-guided Haze Transfer Network, PHATNet),该方法通过将目标域的雾霾模式迁移到源域无雾图像上,生成特定于目标域的微调数据集,从而实现有效的领域自适应。PHATNet进一步引入雾霾迁移一致性损失(Haze-Transfer-Consistency loss)和内容泄露损失(Content-Leakage Loss),以增强模型对雾霾模式与清晰内容的解耦能力,提升去雾效果。

链接: https://arxiv.org/abs/2507.14826
作者: Fu-Jen Tsai,Yan-Tsung Peng,Yen-Yu Lin,Chia-Wen Lin
机构: National Tsing Hua University (国立清华大学); MediaTek (联发科技); National Chengchi University (国立政治大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models’ performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domain-specific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a Haze-Transfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet’s disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets.
zh

[CV-102] FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在金融图表理解任务中表现不足的问题,尤其是面对具有复杂时间结构和领域特定术语的现实金融图表时,现有模型能力存在明显局限。解决方案的关键在于构建首个专注于真实金融图表的基准测试集FinChart-Bench,该数据集包含2015至2024年间收集的1,200张金融图表图像,并配有7,016个True/False、Multiple Choice和Question Answering类型的标注问题,从而为评估LVLM在金融场景下的理解能力提供系统性、标准化的测试平台。通过在此基准上对25个前沿LVLM进行综合评测,研究揭示了当前模型在指令遵循、空间推理以及可靠性等方面的显著短板,为后续改进提供了明确方向。

链接: https://arxiv.org/abs/2507.14823
作者: Dong Shu,Haoyang Yuan,Yuchen Wang,Yanguang Liu,Huopu Zhang,Haiyan Zhao,Mengnan Du
机构: Northwestern University (西北大学); NewsBreak; New Jersey Institute of Technology (新泽西理工学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 Pages, 18 Figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 2015 to 2024, each annotated with True/False (TF), Multiple Choice (MC), and Question Answering (QA) questions, totaling 7,016 questions. We conduct a comprehensive evaluation of 25 state-of-the-art LVLMs on FinChart-Bench. Our evaluation reveals critical insights: (1) the performance gap between open-source and closed-source models is narrowing, (2) performance degradation occurs in upgraded models within families, (3) many models struggle with instruction following, (4) both advanced models show significant limitations in spatial reasoning abilities, and (5) current LVLMs are not reliable enough to serve as automated evaluators. These findings highlight important limitations in current LVLM capabilities for financial chart understanding. The FinChart-Bench dataset is available at this https URL.
zh

[CV-103] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在资源受限或延迟敏感环境中部署时面临的计算密集问题,特别是现有后训练量化(Post-Training Quantization, PTQ)方法依赖架构特定启发式策略、泛化能力差且难以集成到工业部署流程的局限性。解决方案的关键在于提出一种统一的量化框架 SegQuant,其核心创新包括:1)基于图结构的分段感知量化策略(SegLinear),可捕捉模型结构语义与空间异质性;2)双尺度量化方案(DualScale),有效保留极性非对称激活特征,从而保障生成图像的视觉保真度。该方法不仅适用于 Transformer-based 扩散模型,还具备跨模型通用性,并兼容主流部署工具链。

链接: https://arxiv.org/abs/2507.14811
作者: Jiaji Zhang,Ruichao Sun,Hailiang Zhao,Jiaju Wu,Peng Chen,Hao Li,Xinkui Zhao,Kingsum Chow,Gang Xiong,Lin Ye,Shuiguang Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
zh

[CV-104] Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV2026

【速读】:该论文旨在解决机器人动作预测中传统视频预测模型计算成本高、推理延迟大以及对多帧输入依赖性强的问题,从而提升机器人在复杂环境中的实时决策能力。其解决方案的关键在于首次将InstructPix2Pix模型(一种原本用于静态图像编辑的生成式AI模型)改造为可接受视觉与文本双模态输入的框架,仅需当前单帧图像和一段文本指令即可预测未来100帧(10秒)内的视觉序列,显著降低了计算资源消耗并提升了推理效率,同时保持了优于现有方法的结构相似性指数(SSIM)和峰值信噪比(PSNR),适用于对运动轨迹精度要求较高的机器人控制与体育动作分析等场景。

链接: https://arxiv.org/abs/2507.14809
作者: Zesen Zhong,Duomin Zhang,Yijia Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注: 9 pages including appendix, 5 tables, 8 figures, to be submitted to WACV 2026

点击查看摘要

Abstract:Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
zh

[CV-105] Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

【速读】:该论文旨在解决多人脸深度伪造(deepfake)视频检测难题,尤其是在自然社交场景中,现有方法因缺乏对关键上下文线索的感知而性能下降。解决方案的关键在于借鉴人类认知机制,通过系统的人类实验识别出四种核心判别线索:场景-运动一致性(scene-motion coherence)、人脸间外观兼容性(inter-face appearance compatibility)、人际注视对齐(interpersonal gaze alignment)以及人脸-身体一致性(face-body consistency)。基于这些人类启发式线索,作者提出HICOM框架,在基准数据集上实现平均准确率提升3.3%,在真实扰动下提升2.8%,并在未见数据集上超越现有方法5.8%,同时借助大语言模型(LLM)增强可解释性,使检测结果更具透明性和说服力。

链接: https://arxiv.org/abs/2507.14807
作者: Juan Hu,Shaojing Fan,Terence Sim
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce \textsfHICOM, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that \textsfHICOM improves average accuracy by 3.3% in in-dataset detection and 2.8% under real-world perturbations. Moreover, it outperforms existing methods by 5.8% on unseen datasets, demonstrating the generalization of human-inspired cues. \textsfHICOM further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.
zh

[CV-106] Exploring Scalable Unified Modeling for General Low-Level Vision

【速读】:该论文旨在解决低层次视觉任务(如图像恢复、增强、风格化和特征提取)在任务定义和输出域上差异显著所带来的统一建模难题。解决方案的关键在于提出一种基于视觉提示的图像处理框架(Visual task Prompt-based Image Processing, VPIP),该框架利用输入-目标图像对作为视觉提示,通过 prompt 编码器与提示交互模块,灵活整合多种模型架构并有效利用任务特定的视觉表示,从而实现跨任务的统一建模。这一设计使得所构建的通用低层次视觉模型 GenLV 在多任务基准上展现出优异性能,并通过扩展模型容量和任务多样性验证了其可扩展性与泛化能力。

链接: https://arxiv.org/abs/2507.14801
作者: Xiangyu Chen,Kaiwen Zhu,Yuandong Pu,Shuo Cao,Xiaohui Li,Wenlong Zhang,Yihao Liu,Yu Qiao,Jiantao Zhou,Chao Dong
机构: State Key Laboratory of Internet of Things for Smart City, University of Macau (澳门大学物联网智能城市重点实验室); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction, which differ significantly in both task formulation and output domains. To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing (VPIP) framework that leverages input-target image pairs as visual prompts to guide the model in performing a variety of low-level vision tasks. The framework comprises an end-to-end image processing backbone, a prompt encoder, and a prompt interaction module, enabling flexible integration with various architectures and effective utilization of task-specific visual representations. Based on this design, we develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks. To explore the scalability of this approach, we extend the framework along two dimensions: model capacity and task diversity. We construct a large-scale benchmark consisting of over 100 low-level vision tasks and train multiple versions of the model with varying scales. Experimental results show that the proposed method achieves considerable performance across a wide range of tasks. Notably, increasing the number of training tasks enhances generalization, particularly for tasks with limited data, indicating the model’s ability to learn transferable representations through joint training. Further evaluations in zero-shot generalization, few-shot transfer, and task-specific fine-tuning scenarios demonstrate the model’s strong adaptability, confirming the effectiveness, scalability, and potential of the proposed framework as a unified foundation for general low-level vision modeling.
zh

[CV-107] An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks

【速读】:该论文旨在解决传统摄影测量方法在处理极稀疏图像集(如航空影像块)时面临的挑战,尤其是在低重叠度、纹理缺失和立体遮挡等复杂场景下的3D重建性能瓶颈。其解决方案的关键在于评估基于Transformer架构的前沿预训练模型(DUSt3R、MASt3R和VGGT)在航空影像上的表现,发现这些模型能够利用极少量图像(少于10张)实现高精度密集点云重建,且在完整性方面相比COLMAP提升高达50%;其中VGGT还展现出更高的计算效率与相机位姿估计可靠性。尽管如此,这些方法在高分辨率图像和大规模数据集上仍存在位姿不确定性增加的问题,表明它们尚不能完全替代传统的结构光束法平差(SfM)与多视图立体匹配(MVS),但可作为补充工具,在稀疏、低分辨率或复杂环境下发挥优势。

链接: https://arxiv.org/abs/2507.14798
作者: Xinyi Wu,Steven Landgraf,Markus Ulrich,Rongjun Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures, this manuscript has been submitted to Geo-spatial Information Science for consideration

点击查看摘要

Abstract:State-of-the-art 3D computer vision algorithms continue to advance in handling sparse, unordered image sets. Recently developed foundational models for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry Grounded Transformer (VGGT), have attracted attention due to their ability to handle very sparse image overlaps. Evaluating DUSt3R/MASt3R/VGGT on typical aerial images matters, as these models may handle extremely low image overlaps, stereo occlusions, and textureless regions. For redundant collections, they can accelerate 3D reconstruction by using extremely sparsified image sets. Despite tests on various computer vision benchmarks, their potential on photogrammetric aerial blocks remains unexplored. This paper conducts a comprehensive evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of the UseGeo dataset for pose estimation and dense 3D reconstruction. Results show these methods can accurately reconstruct dense point clouds from very sparse image sets (fewer than 10 images, up to 518 pixels resolution), with completeness gains up to +50% over COLMAP. VGGT also demonstrates higher computational efficiency, scalability, and more reliable camera pose estimation. However, all exhibit limitations with high-resolution images and large sets, as pose reliability declines with more images and geometric complexity. These findings suggest transformer-based methods cannot fully replace traditional SfM and MVS, but offer promise as complementary approaches, especially in challenging, low-resolution, and sparse scenarios.
zh

[CV-108] Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models ICCV2025

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在生成高质量图像时面临采样延迟高(high sampling latency)的问题,尤其针对现有基于求解器的加速方法在低延迟预算下常导致图像质量下降的瓶颈。其核心解决方案是提出一种名为Ensemble Parallel Direction(EPD)的新型常微分方程(ODE)求解器,通过在每个ODE步长中引入多个并行梯度评估来降低截断误差(truncation errors),同时利用这些梯度计算的独立性实现完全并行化,从而在保持低延迟的同时提升生成质量。此外,EPD采用蒸馏方式优化少量可学习参数,训练开销极低,并具备作为插件提升现有ODE采样器性能的能力。

链接: https://arxiv.org/abs/2507.14797
作者: Beier Zhu,Ruoyu Wang,Tong Zhao,Hanwang Zhang,Chi Zhang
机构: Nanyang Technological University (南洋理工大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICCV 2025

点击查看摘要

Abstract:Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in this https URL. Comments: To appear in ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.14797 [cs.CV] (or arXiv:2507.14797v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.14797 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-109] Flow Equivariant Recurrent Neural Networks

【速读】:该论文试图解决的问题是:如何在序列模型(如循环神经网络,RNN)中引入对时间参数化对称性的等变性(equivariance),从而提升模型在处理动态环境中的泛化能力和样本效率。传统等变网络理论仅适用于静态变换和前馈网络,无法直接应用于随时间演化的数据流(如视觉运动)。解决方案的关键在于将等变性理论扩展至“流”(flows)——即由单参数李子群(one-parameter Lie subgroups)刻画的时间连续变换,通过设计能够保持隐藏状态几何结构变化的RNN架构,实现对时间演化对称性的建模。实验表明,这类流等变模型在训练速度、长度外推(length generalization)和速度外推(velocity generalization)方面显著优于非等变基线模型。

链接: https://arxiv.org/abs/2507.14793
作者: T. Anderson Keller
机构: The Kempner Instutite for the Study of Natural and Artificial Intelligence (Kempner研究所); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models, such as recurrent neural networks (RNNs), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of `flows’ – one-parameter Lie subgroups capturing natural transformations over time, such as visual motion. We begin by showing that standard RNNs are generally not flow equivariant: their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non-equivariant counterparts in terms of training speed, length generalization, and velocity generalization, on both next step prediction and sequence classification. We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.
zh

[CV-110] A Novel Downsampling Strategy Based on Information Complementarity for Medical Image Segmentation

【速读】:该论文旨在解决传统下采样方法(如最大池化和跨行卷积)在语义分割任务中因丢失关键空间信息而导致像素级预测性能下降的问题。解决方案的关键在于提出一种基于信息互补的混合池化下采样方法(Hybrid Pooling Downsampling, HPD),其核心是用MinMaxPooling替代传统下采样操作,通过提取局部区域的最大值信息来有效保留图像的明暗对比度与细节特征,从而在保持计算效率的同时提升分割精度。实验表明,HPD在ACDC和Synapse数据集上平均提升了0.5%的Dice相似系数(DSC)。

链接: https://arxiv.org/abs/2507.14790
作者: Wenbo Yue,Chang Li,Guoping Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:In convolutional neural networks (CNNs), downsampling operations are crucial to model performance. Although traditional downsampling methods (such as maximum pooling and cross-row convolution) perform well in feature aggregation, receptive field expansion, and computational reduction, they may lead to the loss of key spatial information in semantic segmentation tasks, thereby affecting the pixel-by-pixel prediction this http URL this end, this study proposes a downsampling method based on information complementarity - Hybrid Pooling Downsampling (HPD). The core is to replace the traditional method with MinMaxPooling, and effectively retain the light and dark contrast and detail features of the image by extracting the maximum value information of the local this http URL on various CNN architectures on the ACDC and Synapse datasets show that HPD outperforms traditional methods in segmentation performance, and increases the DSC coefficient by 0.5% on average. The results show that the HPD module provides an efficient solution for semantic segmentation tasks.
zh

[CV-111] FOCUS: Fused Observation of Channels for Unveiling Spectra

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在高光谱成像(Hyperspectral Imaging, HSI)场景下可解释性不足的问题,具体包括两个关键挑战:一是现有显著性方法难以捕捉有意义的光谱线索,常导致注意力机制过度集中在类别标记(class token)上;二是全谱段ViT因高维数据特性计算开销大,难以实现高效解释。解决方案的核心在于提出FOCUS框架,其包含两个创新设计:一是引入类特定的光谱提示(class-specific spectral prompts),引导注意力聚焦于语义上有意义的波长组;二是设计一个可学习的[SINK]标记,并通过吸引力损失(attraction loss)训练其吸收噪声或冗余注意力,从而缓解注意力坍缩问题。该方法可在单次前向传播中生成稳定的三维显著性图和光谱重要性曲线,无需梯度反向传播或修改骨干网络,且参数增量低于1%,显著提升了高分辨率ViT在真实HSI应用中的可解释性与实用性。

链接: https://arxiv.org/abs/2507.14787
作者: Xi Xiao,Aristeidis Tsaris,Anika Tabassum,John Lagergren,Larry M. York,Tianyang Wang,Xiao Wang
机构: Oak Ridge National Laboratory (橡树岭国家实验室); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.
zh

[CV-112] LeAdQA: LLM -Driven Context-Aware Temporal Grounding for Video Question Answering

【速读】:该论文旨在解决视频问答(Video Question Answering, VideoQA)任务中因关键时刻识别不精准和因果关系推理不足而导致的语义理解偏差问题。当前方法普遍存在两种缺陷:一是无任务感知的采样策略对所有帧进行处理,导致关键事件被冗余内容淹没;二是启发式检索仅捕捉表面模式,无法建模复杂推理所需的时序因果结构。解决方案的关键在于提出LeAdQA框架,其核心创新是将因果感知的查询优化与细粒度视觉定位相结合:首先利用大语言模型(LLM)重构问题-选项对以消除因果歧义并聚焦时间范围,随后通过时序定位模型精确定位最显著片段,并辅以自适应融合机制动态整合证据,最终由多模态大语言模型(MLLM)生成上下文一致的答案。此方法在NExT-QA、IntentQA和NExT-GQA数据集上实现了最先进的复杂推理性能,同时保持计算效率。

链接: https://arxiv.org/abs/2507.14784
作者: Xinxin Dong,Baoyun Peng,Haokai Ma,Yufei Wang,Zixuan Dong,Fei Hu,Xiaodong Wang
机构: National University of Defense Technology (国防科技大学); Academy of Military Sciences (军事科学院); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages LLMs to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal focus. These refined queries subsequently direct a temporal grounding model to precisely retrieve the most salient segments, complemented by an adaptive fusion mechanism dynamically integrating the evidence to maximize relevance. The integrated visual-textual cues are then processed by an MLLM to generate accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method’s precise visual grounding substantially enhances the understanding of video-question relationships, achieving state-of-the-art (SOTA) performance on complex reasoning tasks while maintaining computational efficiency.
zh

[CV-113] CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories MICCAI2025

【速读】:该论文旨在解决重症监护病房(ICU)中胸部X线片(CXR)因采集不规律而难以用于动态监测的问题,现有工具多基于横断面分析,无法捕捉时间演变特征。解决方案的关键在于提出一种名为CXR-TFT的多模态框架,通过将稀疏的CXRs与高频临床数据(如生命体征、实验室指标和呼吸流表)进行时序对齐,并利用视觉编码器提取的潜在嵌入(latent embeddings)结合插值方法与Transformer模型,实现对未来CXR异常发现的预测,从而在影像学表现出现前12小时提供预警,显著提升对急性呼吸窘迫综合征等时间敏感疾病的早期干预能力。

链接: https://arxiv.org/abs/2507.14766
作者: Mehak Arora,Ayman Ali,Kaiyuan Wu,Carolyn Davis,Takashi Shimazui,Mahmoud Alwakeel,Victor Moas,Philip Yang,Annette Esper,Rishikesan Kamaleswaran
机构: Duke University (杜克大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: In Review for MICCAI 2025

点击查看摘要

Abstract:In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally sparse CXR imaging and radiology reports with high-frequency clinical data, such as vital signs, laboratory values, and respiratory flow sheets, to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A transformer model is then trained to predict CXR embeddings at each hour, conditioned on previous embeddings and clinical measurements. In a retrospective study of 20,000 ICU patients, CXR-TFT demonstrated high accuracy in forecasting abnormal CXR findings up to 12 hours before they became radiographically evident. This predictive capability in clinical data holds significant potential for enhancing the management of time-sensitive conditions like acute respiratory distress syndrome, where early intervention is crucial and diagnoses are often delayed. By providing distinctive temporal resolution in prognostic CXR analysis, CXR-TFT offers actionable ‘whole patient’ insights that can directly improve clinical outcomes.
zh

[CV-114] InterAct-Video: Reasoning -Rich Video QA for Urban Traffic

【速读】:该论文旨在解决现有视频问答(VideoQA)模型在复杂真实交通场景中难以有效推理细粒度时空依赖关系的问题,尤其是在多事件并发、空间动态变化的交通环境中。其解决方案的关键在于构建了一个名为InterAct VideoQA的高质量基准数据集,该数据集包含8小时来自多样化交叉路口的真实交通视频片段(每段10秒),并配有超过25,000个涵盖时空动态、车辆交互与事故检测等关键交通属性的问答对。通过在该数据集上评估和微调先进VideoQA模型,研究证实了领域特定数据对于提升模型在智能交通系统(ITS)中实际部署能力的重要性。

链接: https://arxiv.org/abs/2507.14743
作者: Joseph Raj Vishal,Rutuja Patil,Manas Srinivas Gowda,Katha Naik,Yezhou Yang,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbfInterAct VideoQA, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: this https URL
zh

[CV-115] MultiRetNet: A Multimodal Vision Model and Deferral System for Staging Diabetic Retinopathy

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)在低收入人群中因筛查可及性差而导致的晚期诊断问题,以及现有模型在复杂临床场景下(如合并症和图像质量差异)准确性不足的局限。其解决方案的关键在于提出MultiRetNet这一多模态融合管道,整合眼底成像、社会经济因素和共病谱信息,并通过全连接层实现最优的多模态特征融合;同时引入对比学习训练的临床回避系统(clinical deferral system),识别分布外样本(out-of-distribution samples)并引导人工复核,从而在保持对低质量图像诊断准确性的同时提升早期检测能力,尤其针对医疗资源匮乏人群,助力实现更公平的疾病筛查与管理。

链接: https://arxiv.org/abs/2507.14738
作者: Jeannie She,Katie Spivakovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic retinopathy (DR) is a leading cause of preventable blindness, affecting over 100 million people worldwide. In the United States, individuals from lower-income communities face a higher risk of progressing to advanced stages before diagnosis, largely due to limited access to screening. Comorbid conditions further accelerate disease progression. We propose MultiRetNet, a novel pipeline combining retinal imaging, socioeconomic factors, and comorbidity profiles to improve DR staging accuracy, integrated with a clinical deferral system for a clinical human-in-the-loop implementation. We experiment with three multimodal fusion methods and identify fusion through a fully connected layer as the most versatile methodology. We synthesize adversarial, low-quality images and use contrastive learning to train the deferral system, guiding the model to identify out-of-distribution samples that warrant clinician review. By maintaining diagnostic accuracy on suboptimal images and integrating critical health data, our system can improve early detection, particularly in underserved populations where advanced DR is often first identified. This approach may reduce healthcare costs, increase early detection rates, and address disparities in access to care, promoting healthcare equity.
zh

[CV-116] GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset NEURIPS2025

【速读】:该论文旨在解决现有农业地块提取研究中对复杂梯田地形覆盖不足的问题,尤其是在高分辨率遥感图像下,梯田地块的边界识别、对象不规则性及多域风格差异带来的挑战。解决方案的关键在于提出了首个全球范围的细粒度梯田地块与边界数据集GTPBD(Global Terraced Parcel and Boundary Dataset),其包含超过20万块人工标注的复杂梯田地块,以及47,537张高分辨率图像和三级标签(像素级边界标签、掩膜标签与地块标签),并覆盖中国七大地理区及跨大陆气候带。该数据集为语义分割、边缘检测、梯田地块提取和无监督域自适应(Unsupervised Domain Adaptation, UDA)等任务提供了基准测试平台,填补了梯田遥感研究中的关键空白,推动了细粒度农业地形分析与跨场景知识迁移的发展。

链接: https://arxiv.org/abs/2507.14697
作者: Zhiwei Zhang,Zi Ye,Yibin Wen,Shuai Yuan,Haohuan Fu,Jianxi Huang,Juepeng Zheng
机构: Sun Yat-Sen University (中山大学); The University of Hong Kong (香港大学); Tsinghua University (清华大学); National Supercomputing Center in Shenzhen (深圳国家超级计算机中心); Southwest Jiaotong University (西南交通大学); China Agricultural University (中国农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 18 figures, submitted to NeurIPS 2025

点击查看摘要

Abstract:Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision this http URL this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the this http URL to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) this http URL, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.
zh

[CV-117] Uncertainty-aware Probabilistic 3D Human Motion Forecasting via Invertible Networks

【速读】:该论文旨在解决3D人体运动预测中不确定性量化(uncertainty quantification)的难题,尤其在人机协作等安全关键场景下,现有方法因隐式概率表示难以建模不确定性而表现不足。其解决方案的关键在于提出ProbHMI框架,通过引入可逆网络(invertible networks)将姿态映射到解耦的潜在空间(disentangled latent space),从而显式地建模概率动力学,并由预测模块直接输出未来潜在分布,实现有效的不确定性估计与校准,显著提升风险感知决策能力。

链接: https://arxiv.org/abs/2507.14694
作者: Yue Ma,Kanglei Zhou,Fuyang Yu,Frederick W. B. Li,Xiaohui Liang
机构: Beihang University (北京航空航天大学); Durham University (杜伦大学); Zhongguancun Laboratory (中关村实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D human motion forecasting aims to enable autonomous applications. Estimating uncertainty for each prediction (i.e., confidence based on probability density or quantile) is essential for safety-critical contexts like human-robot collaboration to minimize risks. However, existing diverse motion forecasting approaches struggle with uncertainty quantification due to implicit probabilistic representations hindering uncertainty modeling. We propose ProbHMI, which introduces invertible networks to parameterize poses in a disentangled latent space, enabling probabilistic dynamics modeling. A forecasting module then explicitly predicts future latent distributions, allowing effective uncertainty quantification. Evaluated on benchmarks, ProbHMI achieves strong performance for both deterministic and diverse prediction while validating uncertainty calibration, critical for risk-aware decision making.
zh

[CV-118] From Semantics Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景下地面情境识别(Grounded Situation Recognition, GSR)任务中表现不足,且难以部署于边缘设备的问题,同时克服传统GSR模型泛化能力弱、难以识别未见和罕见情境的局限。解决方案的关键在于提出一种新颖的多模态交互式提示蒸馏(Multimodal Interactive Prompt Distillation, MIPD)框架,通过从教师MLLM中蒸馏富含语义信息的多模态知识,增强学生模型在开放词汇下的情境识别能力(Open-vocabulary Grounded Situation Recognition, Ov-GSR)。其核心机制包括:利用基于语言模型的判断性推理生成器(Judgmental Rationales Generator, JRG)构建正负样本的视觉瞥见(glimpse)与凝视(gaze)推理文本;引入场景感知与实例感知提示,结合负向引导的多模态提示对齐模块(Negative-Guided Multimodal Prompting Alignment, NMPA),实现推理文本与视觉特征的有效对齐;最终将对齐后的多模态知识注入学生模型,显著提升其对未见情境的识别能力、减少罕见类别预测偏差,并改善整体泛化性能。

链接: https://arxiv.org/abs/2507.14686
作者: Chen Cai,Tianyi Liu,Jianjun Gao,Wenyang Liu,Kejun Wu,Ruoyu Wang,Yi Wang,Soo Chin Liew
机构: National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学); Huazhong University of Science and Technology(华中科技大学); The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.
zh

[CV-119] WSI-Agents : A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

【速读】:该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在全切片图像(Whole Slide Images, WSIs)分析中普遍存在任务泛化能力与特定任务准确率难以兼顾的问题,以及协作式多智能体系统在病理学领域应用潜力尚未充分挖掘的现状。其解决方案的关键在于提出 WSI-Agents——一个面向多模态 WSI 分析的协作式多智能体系统,通过三个核心组件实现:(1) 基于模型库的智能体任务分配模块,将不同任务指派给具备 Patch 和 WSI 层级理解能力的专家智能体;(2) 结合内部一致性检查与外部病理知识库及领域专用模型验证的准确性保障机制;(3) 生成包含可视化解释图的最终摘要模块,从而在提升单任务精度的同时增强多任务适应性。

链接: https://arxiv.org/abs/2507.14680
作者: Xinheng Lyu,Yuci Liang,Wenting Chen,Meidan Ding,Jiaqi Yang,Guolin Huang,Daokun Zhang,Xiangjian He,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi-modal large language models (MLLMs) allow multi-task WSI analysis through natural language, they often underperform compared to task-specific models. Collaborative multi-agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology-specific domains. To address these issues, we propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis. WSI-Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task-specific accuracy and multi-task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain-specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi-modal WSI benchmarks show WSI-Agents’s superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.
zh

[CV-120] Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

【速读】:该论文旨在解决从组织病理图像(histopathology images)中准确预测基因表达(gene expression)的问题,其核心挑战在于现有方法未能充分挖掘图像与基因表达谱在多个表征层次上的跨模态对齐关系,从而限制了预测性能。解决方案的关键在于提出Gene-DML框架,通过双路径多层级判别机制结构化潜在空间:一是多尺度实例级判别路径,将局部、邻域和全局层次的组织病理学表征与基因表达进行对齐,捕捉尺度感知的形态-转录关联;二是跨层级实例-组判别路径,强制个体实例与跨模态组之间的一致性,增强模态间对齐。该设计联合建模细粒度与结构性判别信息,显著提升了跨模态表示的鲁棒性,从而在多个生物场景下实现更精准且泛化的基因表达预测。

链接: https://arxiv.org/abs/2507.14670
作者: Yaxuan Song,Jianan Fan,Hang Chang,Weidong Cai
机构: The University of Sydney (悉尼大学); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 15 tables, 8 figures

点击查看摘要

Abstract:Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modelling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and checkpoints will be released soon.
zh

[CV-121] Artificial Intelligence in the Food Industry: Food Waste Estimation based on Computer Vision a Brief Case Study in a University Dining Hall

【速读】:该论文旨在解决机构餐饮场景中后消费食物浪费(post-consumer food waste)的量化难题,以支持数据驱动的可持续发展策略。其核心解决方案是提出了一种基于RGB图像语义分割(semantic segmentation)的低成本计算机视觉框架,通过对比餐前与餐后图像来估算每盘食物的浪费量。关键创新在于采用四类全监督模型(U-Net、U-Net++及其轻量版本),结合截断动态逆频损失(capped dynamic inverse-frequency loss)和AdamW优化器进行训练,并引入自定义的分布像素一致性指标(Distributional Pixel Agreement, DPA)以更精准评估像素级比例估计性能。实验表明,所有模型均表现良好,部分模型DPA超过90%,且轻量模型在NVIDIA T4 GPU上实现实时推理,为大规模食品服务环境中的自动化、无接触式食物浪费监测提供了可行路径。

链接: https://arxiv.org/abs/2507.14662
作者: Shayan Rokhva,Babak Teimourpour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Questions Recommendations: shayanrokhva1999@gmail.com; shayan1999rokh@yahoo.com

点击查看摘要

Abstract:Quantifying post-consumer food waste in institutional dining settings is essential for supporting data-driven sustainability strategies. This study presents a cost-effective computer vision framework that estimates plate-level food waste by utilizing semantic segmentation of RGB images taken before and after meal consumption across five Iranian dishes. Four fully supervised models (U-Net, U-Net++, and their lightweight variants) were trained using a capped dynamic inverse-frequency loss and AdamW optimizer, then evaluated through a comprehensive set of metrics, including Pixel Accuracy, Dice, IoU, and a custom-defined Distributional Pixel Agreement (DPA) metric tailored to the task. All models achieved satisfying performance, and for each food type, at least one model approached or surpassed 90% DPA, demonstrating strong alignment in pixel-wise proportion estimates. Lighter models with reduced parameter counts offered faster inference, achieving real-time throughput on an NVIDIA T4 GPU. Further analysis showed superior segmentation performance for dry and more rigid components (e.g., rice and fries), while more complex, fragmented, or viscous dishes, such as stews, showed reduced performance, specifically post-consumption. Despite limitations such as reliance on 2D imaging, constrained food variety, and manual data collection, the proposed framework is pioneering and represents a scalable, contactless solution for continuous monitoring of food consumption. This research lays foundational groundwork for automated, real-time waste tracking systems in large-scale food service environments and offers actionable insights and outlines feasible future directions for dining hall management and policymakers aiming to reduce institutional food waste.
zh

[CV-122] AI-Powered Precision in Sport Taekwondo: Enhancing Fairness Speed and Trust in Competition (FST.ai)

【速读】:该论文旨在解决体育裁判决策中长期存在的延迟、主观性和执行不一致问题,尤其是在跆拳道(Taekwondo)比赛中对头部击打动作的实时识别与评分难题。传统人工判罚结合即时视频回放(Instant Video Replay, IVR)系统往往效率低下且难以保证公平性,影响运动员信任。解决方案的关键在于提出一个基于计算机视觉(computer vision)、深度学习(deep learning)和边缘推理(edge inference)的AI驱动框架,通过姿态估计(pose estimation)、运动分类(motion classification)和冲击分析(impact analysis)实现关键动作的自动化识别与分类,将判罚时间从数分钟缩短至秒级,同时显著提升判罚的一致性与透明度。该方法不仅适用于跆拳道,还可扩展至柔道、空手道、击剑乃至足球、篮球等需要精准动作检测或犯规识别的多种运动项目,展现出良好的泛化能力与可迁移性。

链接: https://arxiv.org/abs/2507.14657
作者: Keivan Shariatmadar,Ahmad Osman
机构: htw Saar University of Applied Science (htw萨尔应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces this http URL, a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework – based on pose estimation, motion classification, and impact analysis – can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo’s most challenging scenarios – head kick scoring – we demonstrate the robustness, scalability, and sport-agnostic potential of this http URL to transform officiating standards across multiple disciplines.
zh

[CV-123] Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection

【速读】:该论文旨在解决多光谱目标检测中两个关键问题:一是局部互补特征偏好过强而忽视跨模态共享语义,导致泛化性能下降;二是感受野大小与计算复杂度之间的权衡限制了可扩展的特征建模。解决方案的核心是提出一种基于状态空间模型(State Space Model, SSM)的新型多光谱状态空间特征融合框架MS2Fusion,其采用双路径参数交互机制实现高效且有效的特征融合:第一条路径通过交叉参数交互继承跨注意力机制优势,利用SSM中的跨模态隐藏状态解码挖掘互补信息;第二条路径通过共享参数机制探索跨模态对齐,借助联合嵌入获得跨模态语义一致的特征与结构。两条路径在统一框架下联合优化,使MS2Fusion兼具功能互补性和共享语义空间特性,在FLIR、M3FD和LLVIP等多个主流基准上显著优于现有方法,且具备良好的通用性,可直接应用于RGB-T语义分割和RGB-T显著目标检测任务。

链接: https://arxiv.org/abs/2507.14643
作者: Jifeng Shen,Haibo Zhan,Shaohua Dong,Xin Zuo,Wankou Yang,Haibin Ling
机构: Jiangsu University (江苏大学); University of North Texas (北德克萨斯大学); Jiangsu University of Science and Technology (江苏科技大学); Southeast University (东南大学); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted on 30/4/2025, Under Major Revision

点击查看摘要

Abstract:Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at this https URL.
zh

[CV-124] BusterX: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

【速读】:该论文旨在解决当前合成媒体检测系统在面对跨模态伪造内容时的局限性问题,即现有单模态检测方法无法有效识别融合多种媒体格式(如图像与视频)的复杂伪造内容。其解决方案的关键在于提出BusterX++框架,该框架通过引入先进的强化学习(reinforcement learning, RL)后训练策略,实现冷启动问题的消除,并结合多阶段训练(Multi-stage Training)、思维奖励机制(Thinking Reward)和混合推理(Hybrid Reasoning),显著提升跨模态合成媒体检测的稳定性和准确性。

链接: https://arxiv.org/abs/2507.14632
作者: Haiquan Wen,Tianxiao Li,Zhenglin Huang,Yiwei He,Guangliang Cheng
机构: University of Liverpool, UK (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbfBusterX++, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbfGenBuster++, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.
zh

[CV-125] Real-Time Scene Reconstruction using Light Field Probes

【速读】:该论文旨在解决大规模复杂场景(如城市尺度)的高保真图像重建与新视角合成问题,传统神经渲染方法在处理大场景时面临场景规模、保真度与渲染速度之间的权衡困境,而基于显式几何的方法则因维护大量三维数据导致计算成本随场景增长显著上升。其解决方案的关键在于提出一种无需显式依赖场景几何的新型视图合成方法:通过稀疏实拍图像重建多尺度隐式场景表示,并引入“探针数据结构”(probe data structure)来存储密集点的高精度深度信息,从而避免了对显式几何的直接建模;该策略使得渲染成本与场景复杂度无关,且探针数据的压缩和流式传输比显式几何更高效,为虚拟现实(VR)和增强现实(AR)应用提供了可行的神经表示方案。

链接: https://arxiv.org/abs/2507.14624
作者: Yaru Liu,Derek Nowrouzezahri,Morgan Mcguire
机构: University of Cambridge (剑桥大学); McGill University (麦吉尔大学); Roblox (罗布洛克斯)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing photo-realistic large-scale scenes from images, for example at city scale, is a long-standing problem in computer graphics. Neural rendering is an emerging technique that enables photo-realistic image synthesis from previously unobserved viewpoints; however, state-of-the-art neural rendering methods have difficulty efficiently rendering a high complex large-scale scene because these methods typically trade scene size, fidelity, and rendering speed for quality. The other stream of techniques utilizes scene geometries for reconstruction. But the cost of building and maintaining a large set of geometry data increases as scene size grows. Our work explores novel view synthesis methods that efficiently reconstruct complex scenes without explicit use of scene geometries. Specifically, given sparse images of the scene (captured from the real world), we reconstruct intermediate, multi-scale, implicit representations of scene geometries. In this way, our method avoids explicitly relying on scene geometry, significantly reducing the computational cost of maintaining large 3D data. Unlike current methods, we reconstruct the scene using a probe data structure. Probe data hold highly accurate depth information of dense data points, enabling the reconstruction of highly complex scenes. By reconstructing the scene using probe data, the rendering cost is independent of the complexity of the scene. As such, our approach combines geometry reconstruction and novel view synthesis. Moreover, when rendering large-scale scenes, compressing and streaming probe data is more efficient than using explicit scene geometry. Therefore, our neural representation approach can potentially be applied to virtual reality (VR) and augmented reality (AR) applications.
zh

[CV-126] Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2

【速读】:该论文旨在解决当前基于深度学习的医学图像分割方法在动态医学成像场景中适应性差、依赖模态特定设计以及在医疗视频任务中因数据稀缺导致模型微调困难的问题。其关键解决方案是提出DD-SAM2框架,该框架引入一种轻量级Depthwise-Dilated Adapter(DD-Adapter),以最小参数开销增强多尺度特征提取能力,从而实现对SAM2模型在有限医疗视频数据下的高效微调;同时充分利用SAM2的流式记忆机制,实现对医学视频目标的实时跟踪与分割,显著提升了模型在肿瘤分割(TrackRad2025)和左心室追踪(EchoNet-Dynamic)等任务上的性能表现。

链接: https://arxiv.org/abs/2507.14613
作者: Guoping Xu,Christopher Kabat,You Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2’s streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at this https URL.
zh

[CV-127] Exp-Graph: How Connections Learn Facial Attributes in Graph-based Expression Recognition

【速读】:该论文旨在解决面部表情识别中因面部属性结构随表情变化而难以准确建模的问题,核心挑战在于如何有效融合面部结构信息以提升识别精度。解决方案的关键在于提出Exp-Graph框架,通过图结构建模面部属性间的空间关系:利用面部关键点作为节点(vertex),基于局部外观相似性和几何邻近性构建边(edge),并引入图卷积网络(Graph Convolutional Networks, GCN)捕获和整合这些结构依赖关系,从而增强面部属性的语义表征能力;同时结合视觉Transformer(Vision Transformer)提取局部与全局特征,实现对复杂表情变化的精准识别。

链接: https://arxiv.org/abs/2507.14608
作者: Nandani Sharma,Dinesh Singh
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Facial expression recognition is crucial for human-computer interaction applications such as face animation, video surveillance, affective computing, medical analysis, etc. Since the structure of facial attributes varies with facial expressions, incorporating structural information into facial attributes is essential for facial expression recognition. In this paper, we propose Exp-Graph, a novel framework designed to represent the structural relationships among facial attributes using graph-based modeling for facial expression recognition. For facial attributes graph representation, facial landmarks are used as the graph’s vertices. At the same time, the edges are determined based on the proximity of the facial landmark and the similarity of the local appearance of the facial attributes encoded using the vision transformer. Additionally, graph convolutional networks are utilized to capture and integrate these structural dependencies into the encoding of facial attributes, thereby enhancing the accuracy of expression recognition. Thus, Exp-Graph learns from the facial attribute graphs highly expressive semantic representations. On the other hand, the vision transformer and graph convolutional blocks help the framework exploit the local and global dependencies among the facial attributes that are essential for the recognition of facial expressions. We conducted comprehensive evaluations of the proposed Exp-Graph model on three benchmark datasets: Oulu-CASIA, eNTERFACE05, and AFEW. The model achieved recognition accuracies of 98.09%, 79.01%, and 56.39%, respectively. These results indicate that Exp-Graph maintains strong generalization capabilities across both controlled laboratory settings and real-world, unconstrained environments, underscoring its effectiveness for practical facial expression recognition applications.
zh

[CV-128] owards a Proactive Autoscaling Framework for Data Stream Processing at the Edge using GRU and Transfer Learning

【速读】:该论文旨在解决边缘流处理(Edge Stream Processing)中的动态资源自动扩展问题,核心挑战在于应对负载波动导致的资源分配不足或过剩,从而影响服务质量(SLA)和资源利用率。解决方案的关键在于提出一个三步协同框架:首先,采用门控循环单元(GRU)神经网络基于真实与合成数据集预测上游负载;其次,通过时间规整(DTW)算法和联合分布适应技术实现迁移学习,缓解离线训练与在线运行之间的分布偏移(distribution shift)和概念漂移(concept drift)问题;最后,设计轻量级水平自动扩展模块,根据预测负载动态调整算子并行度,在满足边缘资源约束的前提下实现主动式资源调度。该方案在真实数据集上实现了最高1.3%的SMAPE误差,并优于CNN、ARIMA和Prophet等模型,同时训练耗时显著低于强化学习方法。

链接: https://arxiv.org/abs/2507.14597
作者: Eugene Armah,Linda Amoako Bannning
机构: Kwame Nkrumah University of Science and Technology (夸梅·恩克鲁玛科技大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Processing data at high speeds is becoming increasingly critical as digital economies generate enormous data. The current paradigms for timely data processing are edge computing and data stream processing (DSP). Edge computing places resources closer to where data is generated, while stream processing analyzes the unbounded high-speed data in motion. However, edge stream processing faces rapid workload fluctuations, complicating resource provisioning. Inadequate resource allocation leads to bottlenecks, whereas excess allocation results in wastage. Existing reactive methods, such as threshold-based policies and queuing theory scale only after performance degrades, potentially violating SLAs. Although reinforcement learning (RL) offers a proactive approach through agents that learn optimal runtime adaptation policies, it requires extensive simulation. Furthermore, predictive machine learning models face online distribution and concept drift that minimize their accuracy. We propose a three-step solution to the proactive edge stream processing autoscaling problem. Firstly, a GRU neural network forecasts the upstream load using real-world and synthetic DSP datasets. Secondly, a transfer learning framework integrates the predictive model into an online stream processing system using the DTW algorithm and joint distribution adaptation to handle the disparities between offline and online domains. Finally, a horizontal autoscaling module dynamically adjusts the degree of operator parallelism, based on predicted load while considering edge resource constraints. The lightweight GRU model for load predictions recorded up to 1.3% SMAPE value on a real-world data set. It outperformed CNN, ARIMA, and Prophet on the SMAPE and RMSE evaluation metrics, with lower training time than the computationally intensive RL models.
zh

[CV-129] DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF ICCV’25

【速读】:该论文旨在解决3D语义分割中难以同时适应场景内容和用户查询的问题,即如何在不依赖大量标注数据的情况下实现开放词汇(open-vocabulary)的细粒度语义理解。传统方法通常仅针对特定任务目标(如开放词汇分割)或场景内容(如无监督分割)进行优化,无法兼顾两者。解决方案的关键在于提出DiSCO-3D,该方法基于神经场(Neural Fields)表示,将无监督分割与弱开放词汇引导相结合,从而实现对3D场景中子概念(sub-concepts)的自动发现与灵活响应,显著提升了在开放词汇和无监督分割边缘情况下的性能表现。

链接: https://arxiv.org/abs/2507.14596
作者: Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe
机构: Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France (法国巴黎-萨克雷大学, 法国原子能和替代能源委员会, 列斯实验室, 91120, 巴黎-萨克雷, 法国); IRIT, Université de Toulouse, CNRS, France (图卢兹大学信息与技术研究所, 法国国家科学研究中心, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV’25

点击查看摘要

Abstract:3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textitetc. Traditional methods adapt exclusively to either task-specific goals (open-vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO-3D, the first method addressing the broader problem of 3D Open-Vocabulary Sub-concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance. Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery and exhibits state-of-the-art results in the edge cases of both open-vocabulary and unsupervised segmentation.
zh

[CV-130] Performance comparison of medical image classification systems using TensorFlow Keras PyTorch and JAX

【速读】:该论文旨在解决深度学习框架在医学图像分类任务中性能差异不明确的问题,尤其是针对血细胞图像分析场景下不同框架的推理效率与分类准确率对比不足的现状。其解决方案的关键在于系统性地比较TensorFlow with Keras、PyTorch和JAX三个主流深度学习框架在BloodMNIST数据集上的表现,重点评估推理时间差异及不同图像尺寸下的分类性能,从而揭示框架特性(如特定优化机制)对医疗影像处理效率的影响,为临床应用中框架选择提供实证依据。

链接: https://arxiv.org/abs/2507.14587
作者: Merjem Bećirović,Amina Kurtović,Nordin Smajlović,Medina Kapo,Amila Akagić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical imaging plays a vital role in early disease diagnosis and monitoring. Specifically, blood microscopy offers valuable insights into blood cell morphology and the detection of hematological disorders. In recent years, deep learning-based automated classification systems have demonstrated high potential in enhancing the accuracy and efficiency of blood image analysis. However, a detailed performance analysis of specific deep learning frameworks appears to be lacking. This paper compares the performance of three popular deep learning frameworks, TensorFlow with Keras, PyTorch, and JAX, in classifying blood cell images from the publicly available BloodMNIST dataset. The study primarily focuses on inference time differences, but also classification performance for different image sizes. The results reveal variations in performance across frameworks, influenced by factors such as image resolution and framework-specific optimizations. Classification accuracy for JAX and PyTorch was comparable to current benchmarks, showcasing the efficiency of these frameworks for medical image classification.
zh

[CV-131] Benchmarking GANs Diffusion Models and Flow Matching for T1w-to-T2w MRI Translation

【速读】:该论文旨在解决多模态磁共振成像(MRI)中因获取所有所需对比度(如T1加权图像T1w与T2加权图像T2w)导致的扫描时间延长和成本增加问题,提出通过计算方法实现跨模态图像合成以减少采集时间并保持诊断质量。其解决方案的关键在于采用图像到图像(Image-to-image, I2I)翻译框架,系统性地比较了生成对抗网络(Generative Adversarial Networks, GANs)、扩散模型(Diffusion Models)及流匹配(Flow Matching, FM)三类生成模型在T1w到T2w二维MRI图像转换任务中的性能表现。实验结果表明,基于GAN的Pix2Pix模型在结构保真度、图像质量和计算效率方面优于其他方法,为临床实际应用提供了可靠的技术路径,并揭示了流模型在小数据集上易过拟合的问题,指明了未来研究方向。

链接: https://arxiv.org/abs/2507.14575
作者: Andrea Moschetto,Lemuel Puglisi,Alec Sargood,Pierluigi Dell’Acqua,Francesco Guarnera,Sebastiano Battiato,Daniele Ravì
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) enables the acquisition of multiple image contrasts, such as T1-weighted (T1w) and T2-weighted (T2w) scans, each offering distinct diagnostic insights. However, acquiring all desired modalities increases scan time and cost, motivating research into computational methods for cross-modal synthesis. To address this, recent approaches aim to synthesize missing MRI contrasts from those already acquired, reducing acquisition time while preserving diagnostic quality. Image-to-image (I2I) translation provides a promising framework for this task. In this paper, we present a comprehensive benchmark of generative models \unicodex2013 specifically, Generative Adversarial Networks (GANs), diffusion models, and flow matching (FM) techniques \unicodex2013 for T1w-to-T2w 2D MRI I2I translation. All frameworks are implemented with comparable settings and evaluated on three publicly available MRI datasets of healthy adults. Our quantitative and qualitative analyses show that the GAN-based Pix2Pix model outperforms diffusion and FM-based methods in terms of structural fidelity, image quality, and computational efficiency. Consistent with existing literature, these results suggest that flow-based models are prone to overfitting on small datasets and simpler tasks, and may require more data to match or surpass GAN performance. These findings offer practical guidance for deploying I2I translation techniques in real-world MRI workflows and highlight promising directions for future research in cross-modal medical image synthesis. Code and models are publicly available at this https URL.
zh

[CV-132] he Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers

【速读】:该论文旨在解决当前深度学习模型中自注意力机制(self-attention)缺乏理论统一性的问题,即如何从更广泛的计算范式出发理解其本质。解决方案的关键在于将自注意力视为一种特定的基于亲和矩阵(affinity matrix)的计算方法,其核心是通过多跳传播(multi-hop propagation)来建模特征间的关联关系。论文提出无限特征选择(Infinite Feature Selection, Inf-FS)作为基础框架,该框架允许亲和矩阵A通过领域知识或学习获得,并在图结构上进行特征重要性传播;而自注意力可被视作Inf-FS的一个特例——仅使用单跳亲和计算,且A由token间相似度动态构建。因此,该研究揭示了不同任务(如计算机视觉、自然语言处理和图学习)中共享的数学基础:基于成对关系的推理结构,从而为多种模型提供了统一的理论视角。

链接: https://arxiv.org/abs/2507.14560
作者: Giorgio Roffo
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures, submitted for review. Companion code and reproducibility materials available

点击查看摘要

Abstract:The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.
zh

[CV-133] LEAD: Exploring Logit Space Evolution for Model Selection CVPR2024

【速读】:该论文旨在解决预训练模型在下游任务中迁移性能预测的难题,核心挑战在于如何准确捕捉微调(fine-tuning)过程中的动态演化特性,以高效选择最适配的预训练模型。现有方法多基于特征空间的线性变换建模微调过程,难以反映优化目标的本质且忽略关键的非线性特性。解决方案的关键在于提出LEAD方法,其基于网络输出的logits构建一个与微调目标对齐的理论框架,并推导出描述logits状态非线性演化的常微分方程(ODE),同时引入类别感知分解策略以刻画不同类别的差异化演化动态,从而在单步计算中有效弥合优化差距,无需实际微调即可实现高精度迁移性能预测。

链接: https://arxiv.org/abs/2507.14559
作者: Zixuan Hu,Xiaotong Li,Shixiang Tang,Jun Liu,Yichun Hu,Ling-Yu Duan
机构: Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2024

点击查看摘要

Abstract:The remarkable success of pretrain-then-finetune paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations, which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end, we present LEAD, a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally, we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation, our method offers a concise solution to effectively bridge the optimization gap in a single step, bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios.
zh

[CV-134] Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

【速读】:该论文旨在解决当前3D场景语言模型在理解物体间空间与语义关系方面的不足,尤其是在仅依赖视觉嵌入(visual embeddings)时难以准确表达物体角色和交互关系的问题。其解决方案的关键在于提出了一种名为Descrip3D的新框架,通过自然语言显式编码物体之间的关系:为每个物体附加包含内在属性和上下文关系的文本描述,并采用双层融合机制——嵌入级融合(embedding fusion)与提示级注入(prompt-level injection),从而实现统一的跨任务推理(如定位、描述生成和问答),无需任务特定头或额外监督信号。

链接: https://arxiv.org/abs/2507.14555
作者: Jintang Xue,Ganning Zhao,Jie-En Yao,Hong-En Chen,Yue Hu,Meida Chen,Suya You,C.-C. Jay Kuo
机构: University of Southern California (南加州大学); DEVCOM Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes.
zh

[CV-135] Clutter Detection and Removal by Multi-Objective Analysis for Photographic Guidance

【速读】:该论文旨在解决摄影初学者因缺乏经验或无意识疏忽而导致照片中出现杂乱背景(clutter)的问题,从而干扰情感表达与叙事效果。其核心解决方案在于构建一个相机引导系统,通过两个关键技术实现:一是基于美学评估的杂乱区分算法(clutter distinguishment algorithm),用于量化并可视化场景中各物体对整体图像美感和内容的贡献度,使用户能够交互式识别干扰元素;二是基于生成对抗网络(Generative Adversarial Networks, GANs)的迭代图像修复算法(iterative image inpainting algorithm),可对移除杂乱物体后的高分辨率图像区域进行高质量重建,确保视觉连贯性。实验表明,该系统能有效提升用户识别干扰物的能力,并在更短时间内拍摄出更具美感的照片。

链接: https://arxiv.org/abs/2507.14553
作者: Xiaoran Wu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Clutter in photos is a distraction preventing photographers from conveying the intended emotions or stories to the audience. Photography amateurs frequently include clutter in their photos due to unconscious negligence or the lack of experience in creating a decluttered, aesthetically appealing scene for shooting. We are thus motivated to develop a camera guidance system that provides solutions and guidance for clutter identification and removal. We estimate and visualize the contribution of objects to the overall aesthetics and content of a photo, based on which users can interactively identify clutter. Suggestions on getting rid of clutter, as well as a tool that removes cluttered objects computationally, are provided to guide users to deal with different kinds of clutter and improve their photographic work. Two technical novelties underpin interactions in our system: a clutter distinguishment algorithm with aesthetics evaluations for objects and an iterative image inpainting algorithm based on generative adversarial nets that reconstructs missing regions of removed objects for high-resolution images. User studies demonstrate that our system provides flexible interfaces and accurate algorithms that allow users to better identify distractions and take higher quality images within less time.
zh

[CV-136] Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering Human Perceptual Variability on Facial Expressions IJCNN2025

【速读】:该论文旨在解决情感认知科学中的核心挑战,即如何准确建模外部情绪刺激与人类内部情绪体验之间的关系,特别是个体间在情绪分类上的显著差异(inter-individual differences in emotion categorization)。其关键解决方案在于提出一种新颖的“感知边界采样方法”(perceptual boundary sampling method),用于生成位于人工神经网络(ANN)决策边界上的模糊面部表情样本,从而构建出varEmotion数据集。通过大规模人类行为实验发现,这些ANN难以分类的样本也引发了人类参与者更高的感知不确定性,表明ANN决策边界与人类感知变异存在共享的计算机制;进一步地,利用行为数据对ANN表征进行微调,实现了ANN预测与群体及个体层面人类感知模式的一致性对齐,建立了ANN决策边界与人类感知变异性之间的系统性关联。

链接: https://arxiv.org/abs/2507.14549
作者: Haotian Deng,Chi Zhang,Chen Wei,Quanying Liu
机构: Southern University of Science and Technology (南方科技大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:A fundamental challenge in affective cognitive science is to develop models that accurately capture the relationship between external emotional stimuli and human internal experiences. While ANNs have demonstrated remarkable accuracy in facial expression recognition, their ability to model inter-individual differences in human perception remains underexplored. This study investigates the phenomenon of high perceptual variability-where individuals exhibit significant differences in emotion categorization even when viewing the same stimulus. Inspired by the similarity between ANNs and human perception, we hypothesize that facial expression samples that are ambiguous for ANN classifiers also elicit divergent perceptual judgments among human observers. To examine this hypothesis, we introduce a novel perceptual boundary sampling method to generate facial expression stimuli that lie along ANN decision boundaries. These ambiguous samples form the basis of the varEmotion dataset, constructed through large-scale human behavioral experiments. Our analysis reveals that these ANN-confusing stimuli also provoke heightened perceptual uncertainty in human participants, highlighting shared computational principles in emotion perception. Finally, by fine-tuning ANN representations using behavioral data, we achieve alignment between ANN predictions and both group-level and individual-level human perceptual patterns. Our findings establish a systematic link between ANN decision boundaries and human perceptual variability, offering new insights into personalized modeling of emotional interpretation.
zh

[CV-137] Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

【速读】:该论文旨在解决胃肠道内镜图像的视觉问答(Visual Question Answering, VQA)问题,即通过分析内镜图像并回答与之相关的临床问题。其解决方案的关键在于采用Florence模型——一个大规模多模态基础模型——作为VQA管道的核心架构,该模型结合强大的视觉编码器和文本编码器以理解内镜图像并生成具有临床意义的回答;同时,为提升模型泛化能力,引入了保留医学特征的同时增加训练多样性的领域特定数据增强策略。实验表明,在KASVIR数据集上的微调结果在官方挑战指标上表现优异,验证了大模型在医疗VQA中的潜力,并为后续可解释性、鲁棒性和临床整合研究提供了强有力的基线。

链接: https://arxiv.org/abs/2507.14544
作者: Sujata Gaihre,Amir Thapa Magar,Prasuna Pokharel,Laxmi Tiwari
机构: NCIT(尼泊尔计算机学院); Fusemachine; Logictronix Technologies(逻辑电子科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted to ImageCLEF 2025, to be published in the lab proceedings

点击查看摘要

Abstract:This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: this https URL
zh

[CV-138] Real Time Captioning of Sign Language Gestures in Video Meetings

【速读】:该论文旨在解决听力障碍者与普通人之间因语言差异导致的沟通障碍问题,尤其是在线视频会议场景下,听力障碍者更倾向于使用手语而非打字进行交流,但普通参会者无法理解手语内容。解决方案的关键在于开发一个浏览器扩展程序,能够实时将手语翻译为字幕,从而实现无障碍沟通;其核心技术依赖于基于计算机视觉的手语识别方法,并利用包含超过2000个词级美国手语(American Sign Language, ASL)视频的大规模数据集进行模型训练,该数据集由100多名不同手语使用者录制。

链接: https://arxiv.org/abs/2507.14543
作者: Sharanya Mukherjee,Md Hishaam Akhtar,Kannadasan R
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 1 table, Presented at ICCMDE 2021

点击查看摘要

Abstract:It has always been a rather tough task to communicate with someone possessing a hearing impairment. One of the most tested ways to establish such a communication is through the use of sign based languages. However, not many people are aware of the smaller intricacies involved with sign language. Sign language recognition using computer vision aims at eliminating the communication barrier between deaf-mute and ordinary people so that they can properly communicate with others. Recently the pandemic has left the whole world shaken up and has transformed the way we communicate. Video meetings have become essential for everyone, even people with a hearing disability. In recent studies, it has been found that people with hearing disabilities prefer to sign over typing during these video calls. In this paper, we are proposing a browser extension that will automatically translate sign language to subtitles for everyone else in the video call. The Large-scale dataset which contains more than 2000 Word-Level ASL videos, which were performed by over 100 signers will be used.
zh

[CV-139] Self-Supervised Distillation of Legacy Rule-Based Methods for Enhanced EEG-Based Decision-Making

【速读】:该论文旨在解决癫痫治疗中高频率振荡(High-frequency oscillations, HFOs)定位的精准性问题,尤其是传统规则检测器因假阳性率高需人工复核、而监督学习方法受限于标注数据稀缺与标注一致性差的问题。其解决方案的关键在于提出Self-Supervised to Label Discovery (SS2LD)框架:利用遗留检测器生成的大规模候选事件,通过变分自编码器(Variational Autoencoder, VAE)进行形态学预训练以学习有意义的潜在表示,再基于这些表示聚类获得弱监督信号来识别病理HFO,并结合真实数据与VAE增强数据训练分类器精炼检测边界,从而实现高效、低依赖标签且临床有效的病理HFO识别。

链接: https://arxiv.org/abs/2507.14542
作者: Yipeng Zhang,Yuanyi Ding,Chenda Duan,Atsuro Daida,Hiroki Nariai,Vwani Roychowdhury
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-frequency oscillations (HFOs) in intracranial Electroencephalography (iEEG) are critical biomarkers for localizing the epileptogenic zone in epilepsy treatment. However, traditional rule-based detectors for HFOs suffer from unsatisfactory precision, producing false positives that require time-consuming manual review. Supervised machine learning approaches have been used to classify the detection results, yet they typically depend on labeled datasets, which are difficult to acquire due to the need for specialized expertise. Moreover, accurate labeling of HFOs is challenging due to low inter-rater reliability and inconsistent annotation practices across institutions. The lack of a clear consensus on what constitutes a pathological HFO further challenges supervised refinement approaches. To address this, we leverage the insight that legacy detectors reliably capture clinically relevant signals despite their relatively high false positive rates. We thus propose the Self-Supervised to Label Discovery (SS2LD) framework to refine the large set of candidate events generated by legacy detectors into a precise set of pathological HFOs. SS2LD employs a variational autoencoder (VAE) for morphological pre-training to learn meaningful latent representation of the detected events. These representations are clustered to derive weak supervision for pathological events. A classifier then uses this supervision to refine detection boundaries, trained on real and VAE-augmented data. Evaluated on large multi-institutional interictal iEEG datasets, SS2LD outperforms state-of-the-art methods. SS2LD offers a scalable, label-efficient, and clinically effective strategy to identify pathological HFOs using legacy detectors.
zh

[CV-140] ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

【速读】:该论文旨在解决当前图像美学评估(Image Aesthetics Assessment, IAA)方法在实际应用中面临的两大核心问题:一是传统方法难以同时提供量化评分与专业级理解,二是基于多模态大语言模型(Multimodal Large Language Model, MLLM)的方案存在模态偏倚(如仅输出分数或文本描述)且缺乏细粒度属性分解,从而限制了进一步的美学分析。解决方案的关键在于提出ArtiMuse模型,该模型具备联合评分与专家级理解能力,并配套构建了ArtiMuse-10K数据集,该数据集由10,000张覆盖5大类15子类的专业标注图像组成,每张图像均包含8维属性分析和整体评分,有效支持细粒度美学解析与高质量模型训练,推动IAA向更精准、可解释的方向发展。

链接: https://arxiv.org/abs/2507.14533
作者: Shuo Cao,Nan Ma,Jiayang Li,Xiaohui Li,Lihao Shao,Kaiwen Zhu,Yu Zhou,Yuandong Pu,Jiarui Wu,Jiaquan Wang,Bo Qu,Wenhai Wang,Yu Qiao,Dajuin Yao,Yihao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 31 figures, 13 tables

点击查看摘要

Abstract:The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.
zh

[CV-141] DCHM: Depth-Consistent Human Modeling for Multiview Detection

【速读】:该论文旨在解决多视角行人检测中的人体建模问题,即如何在缺乏人工标注的情况下实现高精度、低噪声的三维人体表示,从而提升行人定位准确性。现有方法常因依赖昂贵的多视角3D标注而引入噪声,且难以跨场景泛化。解决方案的关键在于提出深度一致的人体建模(Depth-Consistent Human Modeling, DCHM)框架,通过超像素级高斯点绘(superpixel-wise Gaussian Splatting)实现稀疏视角、大尺度及拥挤场景下的多视角深度一致性估计,生成精确点云用于后续行人定位,显著降低建模噪声并首次在复杂场景中完成行人的重建与多视角分割。

链接: https://arxiv.org/abs/2507.14505
作者: Jiahao Ma,Tianyu Wang,Miaomiao Liu,David Ahmedt-Aristizabal,Chuong Nguyen
机构: Australian National University (澳大利亚国立大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: multi-view detection, sparse-view reconstruction

点击查看摘要

Abstract:Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the \hrefthis https URLproject page.
zh

[CV-142] Generative Distribution Distillation

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中两个核心问题:一是高维优化带来的“维度灾难”(curse of high-dimensional optimization),二是标签监督信号在无监督场景下的缺失。为此,作者提出生成式分布蒸馏(Generative Distribution Distillation, GenDD)框架,其关键创新在于引入了分割标记化策略(Split Tokenization),实现稳定有效的无监督蒸馏;同时设计了分布收缩技术(Distribution Contraction),将标签监督信息融入重构目标,使GenDD在梯度层面成为多任务学习的代理模型,从而在无需显式分类损失的情况下实现高效的监督训练。

链接: https://arxiv.org/abs/2507.14503
作者: Jiequan Cui,Beier Zhu,Qingshan Xu,Xiaogang Xu,Pengguang Chen,Xiaojuan Qi,Bei Yu,Hanwang Zhang,Richang Hong
机构: HFUT(合肥工业大学); NTU(南洋理工大学); HKU(香港大学); CUHK(香港中文大学); SmartMore(思谋科技)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Technique report

点击查看摘要

Abstract:In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textitGenerative Distribution Distillation (GenDD) framework. A naive \textitGenDD baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textitSplit Tokenization strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textitDistribution Contraction technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textitGenDD with \textitDistribution Contraction serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textitGenDD performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf16.29% on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf82.28% top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.
zh

[CV-143] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

【速读】:该论文旨在解决传统3D重建与视图合成方法在计算复杂度高、迭代优化链路繁琐等问题,从而限制了其在现实场景中的应用。其解决方案的关键在于引入基于深度学习的前馈式(feed-forward)方法,通过端到端的模型架构实现快速且泛化能力强的3D重建与视图合成,尤其聚焦于点云、3D高斯溅射(3D Gaussian Splatting, 3DGS)、神经辐射场(Neural Radiance Fields, NeRF)等表示结构的建模,并覆盖无位姿重建、动态3D重建及3D感知图像与视频生成等关键任务,显著提升了效率与实用性。

链接: https://arxiv.org/abs/2507.14501
作者: Jiahui Zhang,Yuelei Li,Anpei Chen,Muyu Xu,Kunhao Liu,Jianyuan Wang,Xiao-Xiao Long,Hanxue Liang,Zexiang Xu,Hao Su,Christian Theobalt,Christian Rupprecht,Andrea Vedaldi,Hanspeter Pfister,Shijian Lu,Fangneng Zhan
机构: Nanyang Technological University, Singapore(南洋理工大学); California Institute of Technology, USA(加州理工学院); Westlake University, China(西湖大学); University of Oxford, UK(牛津大学); Nanjing University, China(南京大学); University of Cambridge, UK(剑桥大学); Hillbot, USA; University of California, San Diego, USA(加州大学圣地亚哥分校); Max Planck Institute for Informatics, Germany(马克斯·普朗克信息研究所); Harvard University, USA(哈佛大学); Massachusetts Institute of Technology, USA(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A project page associated with this survey is available at this https URL

点击查看摘要

Abstract:3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.
zh

[CV-144] Motion Segmentation and Egomotion Estimation from Event-Based Normal Flow

【速读】:该论文旨在解决基于事件相机(event-based camera)的运动分割与自身运动估计(egomotion estimation)问题,尤其针对传统方法依赖密集光流或显式深度估计所带来的计算复杂性和精度不足的问题。其解决方案的关键在于利用事件数据的稀疏性与高时间分辨率特性,结合法向流(normal flow)、场景结构及惯性测量之间的几何约束,构建一个优化驱动的迭代流程:首先进行事件过分割(over-segmentation),再通过残差分析分离独立运动物体,并最终借助基于运动相似性和时间一致性的层次聚类细化分割结果。此方法无需完整光流计算,在EVIMO2v2数据集上验证了其在物体边界处的高精度分割和准确平移运动估计能力,具备良好的实时性与可扩展性,适用于机器人导航等应用。

链接: https://arxiv.org/abs/2507.14500
作者: Zhiyuan Hua,Dehao Yuan,Cornelia Fermüller
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper introduces a robust framework for motion segmentation and egomotion estimation using event-based normal flow, tailored specifically for neuromorphic vision sensors. In contrast to traditional methods that rely heavily on optical flow or explicit depth estimation, our approach exploits the sparse, high-temporal-resolution event data and incorporates geometric constraints between normal flow, scene structure, and inertial measurements. The proposed optimization-based pipeline iteratively performs event over-segmentation, isolates independently moving objects via residual analysis, and refines segmentations using hierarchical clustering informed by motion similarity and temporal consistency. Experimental results on the EVIMO2v2 dataset validate that our method achieves accurate segmentation and translational motion estimation without requiring full optical flow computation. This approach demonstrates significant advantages at object boundaries and offers considerable potential for scalable, real-time robotic and navigation applications.
zh

[CV-145] Benefit from Reference: Retrieval-Augmented Cross-modal Point Cloud Completion

【速读】:该论文旨在解决基于不完整点云的三维结构补全问题,尤其针对残差点云缺乏典型结构特征时生成质量下降的挑战。其解决方案的关键在于提出一种检索增强的点云补全框架,通过引入跨模态检索机制从相似参考样本中学习结构先验信息;核心创新包括:(1)设计结构共享特征编码器(Structural Shared Feature Encoder, SSFE),联合提取跨模态特征并重建参考特征作为先验,结合双通道控制门机制增强相关结构特征、抑制无关干扰;(2)提出渐进式检索增强生成器(Progressive Retrieval-Augmented Generator, PRAG),采用分层特征融合机制由全局到局部整合参考先验与输入特征,从而提升细粒度重建效果及对稀疏数据和未见类别的泛化能力。

链接: https://arxiv.org/abs/2507.14485
作者: Hongye Hou,Liu Zhan,Yang Yang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Completing the whole 3D structure based on an incomplete point cloud is a challenging task, particularly when the residual point cloud lacks typical structural characteristics. Recent methods based on cross-modal learning attempt to introduce instance images to aid the structure feature learning. However, they still focus on each particular input class, limiting their generation abilities. In this work, we propose a novel retrieval-augmented point cloud completion framework. The core idea is to incorporate cross-modal retrieval into completion task to learn structural prior information from similar reference samples. Specifically, we design a Structural Shared Feature Encoder (SSFE) to jointly extract cross-modal features and reconstruct reference features as priors. Benefiting from a dual-channel control gate in the encoder, relevant structural features in the reference sample are enhanced and irrelevant information interference is suppressed. In addition, we propose a Progressive Retrieval-Augmented Generator (PRAG) that employs a hierarchical feature fusion mechanism to integrate reference prior information with input features from global to local. Through extensive evaluations on multiple datasets and real-world scenes, our method shows its effectiveness in generating fine-grained point clouds, as well as its generalization capability in handling sparse data and unseen categories.
zh

[CV-146] DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning

【速读】:该论文针对数据无感知量化(Data-Free Quantization, DFQ)中合成样本质量不足以及量化模型与全精度模型在推理过程中中间层激活分布不一致的问题展开研究。现有方法难以充分捕捉并平衡合成样本中的全局与局部特征,导致量化性能下降;同时,量化后模型在中间层激活分布上存在显著偏差,进一步加剧性能损失。解决方案的关键在于提出DFQ-ViT框架:首先按难度递增顺序生成合成样本以提升样本质量,其次在校准和推理阶段引入激活修正矩阵(activation correction matrix),对量化模型的中间层激活进行对齐,从而有效缓解分布偏移问题。该方法无需微调即可实现媲美真实数据量化的效果,显著降低边缘设备部署门槛,符合绿色学习原则。

链接: https://arxiv.org/abs/2507.14481
作者: Yujia Tong,Jingling Yuan,Tian Zhang,Jianquan Liu,Chuang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-Free Quantization (DFQ) enables the quantization of Vision Transformers (ViTs) without requiring access to data, allowing for the deployment of ViTs on devices with limited resources. In DFQ, the quantization model must be calibrated using synthetic samples, making the quality of these synthetic samples crucial. Existing methods fail to fully capture and balance the global and local features within the samples, resulting in limited synthetic data quality. Moreover, we have found that during inference, there is a significant difference in the distributions of intermediate layer activations between the quantized and full-precision models. These issues lead to a severe performance degradation of the quantized model. To address these problems, we propose a pipeline for Data-Free Quantization for Vision Transformers (DFQ-ViT). Specifically, we synthesize samples in order of increasing difficulty, effectively enhancing the quality of synthetic data. During the calibration and inference stage, we introduce the activation correction matrix for the quantized model to align the intermediate layer activations with those of the full-precision model. Extensive experiments demonstrate that DFQ-ViT achieves remarkable superiority over existing DFQ methods and its performance is on par with models quantized through real data. For example, the performance of DeiT-T with 3-bit weights quantization is 4.29% higher than the state-of-the-art. Our method eliminates the need for fine-tuning, which not only reduces computational overhead but also lowers the deployment barriers for edge devices. This characteristic aligns with the principles of Green Learning by improving energy efficiency and facilitating real-world applications in resource-constrained environments.
zh

[CV-147] OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition

【速读】:该论文旨在解决动态和感知混淆环境下长期定位中的视觉位置识别(Visual Place Recognition, VPR)难题,特别是现有基于深度学习的方法多依赖单帧嵌入而忽视图像序列中蕴含的时间一致性问题。其解决方案的关键在于提出OptiCorNet框架,该框架将空间特征提取与时间差分建模统一为一个可微的端到端可训练模块;其中核心组件是轻量级一维卷积编码器与可学习的时间差分算子——可微序列差分(Differentiable Sequence Delta, DSD),DSD通过固定权重差分核建模序列方向差异,并结合LSTM精炼及可选残差投影,从而同时捕捉短期空间上下文与长程时间过渡,生成对视角和外观变化鲁棒的紧凑判别性描述符。

链接: https://arxiv.org/abs/2507.14477
作者: Zhenyu Li,Tianyi Shang,Pengjie Xu,Ruirui Zhang,Fanchen Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 figures

点击查看摘要

Abstract:Visual Place Recognition (VPR) in dynamic and perceptually aliased environments remains a fundamental challenge for long-term localization. Existing deep learning-based solutions predominantly focus on single-frame embeddings, neglecting the temporal coherence present in image sequences. This paper presents OptiCorNet, a novel sequence modeling framework that unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module. Central to our approach is a lightweight 1D convolutional encoder combined with a learnable differential temporal operator, termed Differentiable Sequence Delta (DSD), which jointly captures short-term spatial context and long-range temporal transitions. The DSD module models directional differences across sequences via a fixed-weight differencing kernel, followed by an LSTM-based refinement and optional residual projection, yielding compact, discriminative descriptors robust to viewpoint and appearance shifts. To further enhance inter-class separability, we incorporate a quadruplet loss that optimizes both positive alignment and multi-negative divergence within each batch. Unlike prior VPR methods that treat temporal aggregation as post-processing, OptiCorNet learns sequence-level embeddings directly, enabling more effective end-to-end place recognition. Comprehensive evaluations on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.
zh

[CV-148] VisGuard: Securing Visualization Dissemination through Tamper-Resistant Data Retrieval IEEE-VIS2025

【速读】:该论文旨在解决可视化图像(Visualization Image)在在线传播过程中因常见图像篡改(如裁剪、编辑等)导致嵌入的元数据(metadata)难以恢复的问题,从而保障可视化信息的完整性与可追溯性。其解决方案的关键在于提出VisGuard框架,通过重复数据平铺(repetitive data tiling)、可逆信息广播(invertible information broadcasting)以及基于锚点的裁剪定位机制(anchor-based scheme for crop localization),实现对嵌入数据链接的抗篡改鲁棒性,确保即使在严重图像处理后仍能可靠恢复元数据,支持交互式图表重建、篡改检测与版权保护等应用。

链接: https://arxiv.org/abs/2507.14459
作者: Huayuan Ye,Juntong Chen,Shenzhuo Zhang,Yipeng Zhang,Changbo Wang,Chenhui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, IEEE VIS 2025

点击查看摘要

Abstract:The dissemination of visualizations is primarily in the form of raster images, which often results in the loss of critical information such as source code, interactive features, and metadata. While previous methods have proposed embedding metadata into images to facilitate Visualization Image Data Retrieval (VIDR), most existing methods lack practicability since they are fragile to common image tampering during online distribution such as cropping and editing. To address this issue, we propose VisGuard, a tamper-resistant VIDR framework that reliably embeds metadata link into visualization images. The embedded data link remains recoverable even after substantial tampering upon images. We propose several techniques to enhance robustness, including repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization. VisGuard enables various applications, including interactive chart reconstruction, tampering detection, and copyright protection. We conduct comprehensive experiments on VisGuard’s superior performance in data retrieval accuracy, embedding capacity, and security against tampering and steganalysis, demonstrating VisGuard’s competence in facilitating and safeguarding visualization dissemination and information conveyance.
zh

[CV-149] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶中单一模式规划方法在复杂多变交通环境下的适应性与鲁棒性不足的问题,尤其是难以学习多样化驾驶技能以应对不同场景的挑战。其解决方案的关键在于提出GEMINUS框架,该框架采用专家混合(Mixture-of-Experts)结构,包含一个全局专家(Global Expert)和一个场景自适应专家组(Scene-Adaptive Experts Group),并通过双感知路由器(Dual-aware Router)动态激活专家模块;该路由器同时考虑场景级特征和路由不确定性,实现全局与局部专家的有效耦合,从而在多样场景下实现自适应且鲁棒的驾驶决策。

链接: https://arxiv.org/abs/2507.14456
作者: Chi Wan,Yixin Cui,Jiatong Du,Shuo Yang,Yulong Bai,Yanjun Huang
机构: School of Automotive Studies, Tongji University, Shanghai, China(同济大学汽车学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert, a Scene-Adaptive Experts Group, and equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves adaptive and robust performance in diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. Furthermore, ablation studies demonstrate significant improvements over the original single-expert baseline: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean. The code will be available at this https URL.
zh

[CV-150] Adaptive 3D Gaussian Splatting Video Streaming: Visual Saliency-Aware Tiling and Meta-Learning-Based Bitrate Adaptation

【速读】:该论文针对3D Gaussian splatting视频(3DGS)流媒体传输中的三个核心挑战——分块(tiling)、质量评估和码率自适应(bitrate adaptation)——提出了一套系统性解决方案。其关键在于:首先,设计了一种基于显著性分析的自适应分块技术,融合空间与时间特征,将每个tile编码为包含专用形变场(deformation field)和多质量层级的版本,以支持动态选择;其次,构建了一个新颖的质量评估框架,联合评估3DGS表示在流式传输中的空间域退化与最终2D渲染图像质量;最后,开发了一种面向3DGS视频流的元学习驱动自适应码率算法,在不同网络条件下实现最优性能。

链接: https://arxiv.org/abs/2507.14454
作者: Han Gong,Qiyue Li,Jie Li,Zhi Liu
机构: Hefei University of Technology (合肥工业大学); Engineering Technology Research Center of Industrial Automation of Anhui Province (安徽省工业自动化工程技术研究中心); The University of Electro-Communications (电波通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:3D Gaussian splatting video (3DGS) streaming has recently emerged as a research hotspot in both academia and industry, owing to its impressive ability to deliver immersive 3D video experiences. However, research in this area is still in its early stages, and several fundamental challenges, such as tiling, quality assessment, and bitrate adaptation, require further investigation. In this paper, we tackle these challenges by proposing a comprehensive set of solutions. Specifically, we propose an adaptive 3DGS tiling technique guided by saliency analysis, which integrates both spatial and temporal features. Each tile is encoded into versions possessing dedicated deformation fields and multiple quality levels for adaptive selection. We also introduce a novel quality assessment framework for 3DGS video that jointly evaluates spatial-domain degradation in 3DGS representations during streaming and the quality of the resulting 2D rendered images. Additionally, we develop a meta-learning-based adaptive bitrate algorithm specifically tailored for 3DGS video streaming, achieving optimal performance across varying network conditions. Extensive experiments demonstrate that our proposed approaches significantly outperform state-of-the-art methods.
zh

[CV-151] GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration IJCAI2025

【速读】:该论文旨在解决特征点云配准中高质量对应关系识别的难题,特别是如何有效融合局部与全局特征以应对特征冗余和复杂空间关系的问题。其解决方案的关键在于提出一种基于格式塔(Gestalt)原理的正交几何一致性引导并行交互网络(GPI-Net),通过引入正交集成策略最小化冗余信息并构建紧凑的全局结构,同时设计了融合自注意力与交叉注意力机制的格式塔特征注意力(GFA)模块以捕捉对应关系中的几何特征,并创新性地提出双路径多粒度并行交互聚合(DMG)模块,促进不同粒度层级间的信息交换,从而显著提升对应关系的质量和配准精度。

链接: https://arxiv.org/abs/2507.14452
作者: Weikang Gu,Mingyue Han,Li Xue,Heng Dong,Changcai Yang,Riqing Chen,Lifang Wei
机构: Fujian Agriculture and Forestry University (福建农林大学); Fuzhou Institute of Technology (福州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures. Accepted to IJCAI 2025

点击查看摘要

Abstract:The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at this https URL.
zh

[CV-152] IRGPT : Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark ICCV2025

【速读】:该论文旨在解决真实红外图像在视觉-语言模型中的应用难题,主要挑战包括红外模态特有的数据稀缺性与文本对齐不足问题。现有方法依赖于通过风格迁移从可见光图像生成的合成红外图像,难以准确捕捉红外成像的独特特性。解决方案的关键在于构建首个面向真实红外图像的多模态大语言模型IRGPT,并基于一个包含超过26万对真实红外图像与人工精修文本的大规模红外-文本数据集(InfraRed-Text Dataset, IR-TD)。该数据集通过两种互补方式生成文本:一是由大语言模型(LLM)对可见光图像生成描述,二是基于规则的标注描述;同时提出一种双交叉模态课程迁移学习策略,系统性地将可见光域知识迁移到红外域,考虑红外-可见光和红外-文本之间的难度得分,从而实现更有效的跨模态学习。

链接: https://arxiv.org/abs/2507.14449
作者: Zhe Cao,Jin Zhang,Ruiheng Zhang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures. This paper is accepted by ICCV 2025

点击查看摘要

Abstract:Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.
zh

[CV-153] Adaptive 3D Gaussian Splatting Video Streaming

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)视频在流媒体传输中面临的挑战,即由于其数据量庞大和压缩与传输复杂度高而导致的传输效率低下问题。解决方案的关键在于提出一种基于高斯形变场(Gaussian deformation field)的3DGS视频构建方法,并结合混合显著性切片(hybrid saliency tiling)与差异化质量建模(differentiated quality modeling),实现高效数据压缩并适应带宽波动,从而在保障传输质量的前提下提升整体传输性能。

链接: https://arxiv.org/abs/2507.14432
作者: Han Gong,Qiyue Li,Zhi Liu,Hao Zhou,Peng Yuan Zhou,Zhu Li,Jie Li
机构: Hefei University of Technology (合肥工业大学); Engineering Technology Research Center of Industrial Automation Anhui Province (安徽省工业自动化工程技术创新中心); The University of Electro-Communications (电波通信大学); University of Science and Technology of China (中国科学技术大学); Aarhus University (奥胡斯大学); University of Missouri–Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The advent of 3D Gaussian splatting (3DGS) has significantly enhanced the quality of volumetric video representation. Meanwhile, in contrast to conventional volumetric video, 3DGS video poses significant challenges for streaming due to its substantially larger data volume and the heightened complexity involved in compression and transmission. To address these issues, we introduce an innovative framework for 3DGS volumetric video streaming. Specifically, we design a 3DGS video construction method based on the Gaussian deformation field. By employing hybrid saliency tiling and differentiated quality modeling of 3DGS video, we achieve efficient data compression and adaptation to bandwidth fluctuations while ensuring high transmission quality. Then we build a complete 3DGS video streaming system and validate the transmission performance. Through experimental evaluation, our method demonstrated superiority over existing approaches in various aspects, including video quality, compression effectiveness, and transmission rate.
zh

[CV-154] CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding

【速读】:该论文旨在解决场景中可操作性(affordance)的可解释性定位问题,即识别出能够支持特定动作(如“切”)的物体。其解决方案的关键在于提出一种神经符号框架CRAFT,通过整合ConceptNet中的结构化常识先验与语言模型、以及CLIP提供的视觉证据,并借助能量基推理循环(energy-based reasoning loop)迭代优化预测结果,从而实现符号结构与感知结构的透明、目标驱动式对齐,显著提升多物体无标签场景下的准确性与可解释性。

链接: https://arxiv.org/abs/2507.14426
作者: Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur
机构: Auburn University (奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeSy 2025

点击查看摘要

Abstract:We introduce CRAFT, a neuro-symbolic framework for interpretable affordance grounding, which identifies the objects in a scene that enable a given action (e.g., “cut”). CRAFT integrates structured commonsense priors from ConceptNet and language models with visual evidence from CLIP, using an energy-based reasoning loop to refine predictions iteratively. This process yields transparent, goal-driven decisions to ground symbolic and perceptual structures. Experiments in multi-object, label-free settings demonstrate that CRAFT enhances accuracy while improving interpretability, providing a step toward robust and trustworthy scene understanding.
zh

[CV-155] DUSTrack: Semi-automated point tracking in ultrasound videos

【速读】:该论文旨在解决B-mode超声视频中组织运动轨迹跟踪的难题,尤其针对斑点噪声(speckle noise)、边缘对比度低及非平面运动导致的解剖标志点难以准确追踪的问题。其核心解决方案是提出DUSTrack框架,该框架融合深度学习与光流(optical flow)技术,通过构建半自动化流程实现高鲁棒性与高精度的点追踪;关键创新在于引入一种基于光流的新颖滤波方法,在抑制高频帧间噪声的同时保留快速组织运动特征,从而显著提升追踪稳定性与准确性,适用于多种解剖结构和运动模式。

链接: https://arxiv.org/abs/2507.14368
作者: Praneeth Namburi,Roger Pallarès-López,Jessica Rosendorf,Duarte Folgado,Brian W. Anthony
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack’s versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at this https URL.
zh

[CV-156] Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

【速读】:该论文旨在解决生成式超分辨率(Generative Super-Resolution, GSR)模型中存在的“幻觉”(hallucination)问题,即生成细节与低分辨率图像(LRI)或真实图像(GTI)在感知上不一致的 artifacts,这类问题虽未被现有图像指标充分刻画,却显著影响模型的实际应用效果。解决方案的关键在于引入多模态大语言模型(Multimodal Large Language Model, MLLM),设计特定提示(prompt)以量化视觉幻觉程度并生成“幻觉评分”(Hallucination Score, HS),该评分与人类评价高度一致,并揭示了某些深层特征距离与HS存在强相关性;基于此,论文进一步提出将这些特征作为可微奖励函数用于优化GSR模型,从而有效缓解幻觉现象。

链接: https://arxiv.org/abs/2507.14367
作者: Weiming Ren,Raghav Goyal,Zhiming Hu,Tristan Ty Aumentado-Armstrong,Iqbal Mohomed,Alex Levinshtein
机构: University of Waterloo (滑铁卢大学); AI Center – Toronto, Samsung Electronics (三星电子多伦多人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 17 figures and 7 tables

点击查看摘要

Abstract:Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the “regression-to-the-mean” blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under studied issue in GSR, limiting its practical deployments. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., “hallucinations”). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a “Hallucination Score” (HS). We find that our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. In addition, we find certain deep feature distances have strong correlations with HS. We therefore propose to align the GSR models by using such features as differentiable reward functions to mitigate hallucinations.
zh

[CV-157] A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中因模型注意力分散而导致的特征提取不充分问题,即在处理未标注数据时,模型不仅关注图像中的关键对象,还会被无关背景区域干扰,从而影响分类性能。解决方案的关键在于提出一种轻量级、可插拔的注意力聚焦机制(Attention Focusing, AF),其由两个核心组件构成:多尺度令牌重要性度量(Token Importance Measurement, TIME)和令牌自适应剪枝(Token Adaptive Pruning, TAP)。TIME通过多尺度评估每个令牌的重要性,TAP则基于TIME输出的多尺度重要性分数对非信息性令牌进行剪枝,从而增强模型对关键内容的关注,提升GCD任务的准确性,且计算开销极低。

链接: https://arxiv.org/abs/2507.14315
作者: Qiyu Xu,Zhanxuan Hu,Yu Duan,Ercheng Pei,Yonghang Tai
机构: Yunnan Normal University (云南师范大学); Xidian University (西安电子科技大学); Xi’an University of Posts and Telecommunications (西安邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model’s focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to 15.4% performance improvement over the baseline with minimal computational overhead. The implementation code is provided in this https URL.
zh

[CV-158] CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在分布偏移(distribution shifts)下泛化能力不足的问题,尤其是在测试时适应(Test-Time Adaptation, TTA)场景中,传统基于熵最小化的策略因与VLM的对比学习(contrastive learning)预训练目标不一致,导致性能受限并引发伪标签漂移(pseudo-label drift)和类别坍缩(class collapse)等失败模式。解决方案的关键在于提出CLIPTTA,一种基于梯度的TTA方法,其核心创新是采用与CLIP预训练目标对齐的软对比损失(soft contrastive loss),并通过批处理感知(batch-aware)设计有效缓解类别坍缩风险,同时进一步扩展至开放集设置,引入异常值对比暴露(Outlier Contrastive Exposure, OCE)损失提升对分布外(Out-of-Distribution, OOD)样本的检测能力,从而在75个数据集上实现更稳定且优于现有方法的适应性能。

链接: https://arxiv.org/abs/2507.14312
作者: Marc Lafon,Gustavo Adolfo Vargas Hakim,Clément Rambour,Christian Desrosier,Nicolas Thome
机构: Conservatoire National des Arts et Métiers, CEDRIC (国立高等艺术与工艺学院,CEDRIC); Sorbonne Université, CNRS, ISIR (索邦大学,法国国家科学研究中心,ISIR); ETS Montreal (蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP’s pre-training objective. We provide a theoretical analysis of CLIPTTA’s gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.
zh

[CV-159] Semantic Segmentation based Scene Understanding in Autonomous Vehicles

【速读】:该论文旨在解决自动驾驶场景中环境理解的难题,核心在于通过语义分割(semantic segmentation)提升对复杂交通场景的认知精度。解决方案的关键在于设计并评估多种高效的深度学习模型,并系统性地比较不同Backbone(编码器)对模型性能的影响。实验表明,选择合适的Backbone能显著改善语义分割的准确性、平均交并比(mean IoU)和损失函数收敛效果,从而增强对周围环境的感知能力,为自动驾驶决策提供更可靠的输入。

链接: https://arxiv.org/abs/2507.14303
作者: Ehsan Rassekh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 74 pages, 35 figures, Master’s Thesis, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran, 2023

点击查看摘要

Abstract:In recent years, the concept of artificial intelligence (AI) has become a prominent keyword because it is promising in solving complex tasks. The need for human expertise in specific areas may no longer be needed because machines have achieved successful results using artificial intelligence and can make the right decisions in critical situations. This process is possible with the help of deep learning (DL), one of the most popular artificial intelligence technologies. One of the areas in which the use of DL is used is in the development of self-driving cars, which is very effective and important. In this work, we propose several efficient models to investigate scene understanding through semantic segmentation. We use the BDD100k dataset to investigate these models. Another contribution of this work is the usage of several Backbones as encoders for models. The obtained results show that choosing the appropriate backbone has a great effect on the performance of the model for semantic segmentation. Better performance in semantic segmentation allows us to understand better the scene and the environment around the agent. In the end, we analyze and evaluate the proposed models in terms of accuracy, mean IoU, and loss function, and the results show that these metrics are improved.
zh

[CV-160] LOVO: Efficient Complex Object Query in Large-Scale Video Datasets ICDE

【速读】:该论文旨在解决大规模视频数据集中复杂对象查询的挑战,包括处理海量且持续增长的数据、支持多样化的查询需求以及保证低延迟执行。现有方法在应对未见类别对象时适应性不足或查询延迟较高。其解决方案的关键在于提出LOVO系统:该系统采用预训练视觉编码器对关键帧进行一次性特征提取,生成紧凑的视觉嵌入(visual embeddings),并结合边界框信息构建倒排多索引结构存储于向量数据库中;查询阶段通过将对象查询映射为查询嵌入,并在视觉嵌入上执行快速近似最近邻搜索,最终利用跨模态重排序融合视觉与文本特征以精炼结果。此设计实现了高精度、低延迟和低成本的复杂对象查询能力,显著优于现有方法。

链接: https://arxiv.org/abs/2507.14301
作者: Yuxin Liu,Yuezhang Peng,Hefeng Zhou,Hongze Liu,Xinyu Lu,Jiong Lou,Chentao Wu,Wei Zhao,Jie Li
机构: Shanghai Jiao Tong University (上海交通大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: @inproceedings{liu2025lovo,title={LOVO: Efficient Complex Object Query in Large-Scale Video Datasets},author={Liu, Yuxin and Peng, Yuezhang and Zhou, Hefeng and Liu, Hongze and Lu, Xinyu and Lou, Jiong and Wu, Chentao and Zhao, Wei and Li, Jie},booktitle={2025 IEEE 41st International Conference on Data Engineering (ICDE)},pages={1938–1951},year={2025},organization={IEEE Computer Society}}

点击查看摘要

Abstract:The widespread deployment of cameras has led to an exponential increase in video data, creating vast opportunities for applications such as traffic management and crime surveillance. However, querying specific objects from large-scale video datasets presents challenges, including (1) processing massive and continuously growing data volumes, (2) supporting complex query requirements, and (3) ensuring low-latency execution. Existing video analysis methods struggle with either limited adaptability to unseen object classes or suffer from high query latency. In this paper, we present LOVO, a novel system designed to efficiently handle comp \underlineL ex \underlineO bject queries in large-scale \underlineV ide \underlineO datasets. Agnostic to user queries, LOVO performs one-time feature extraction using pre-trained visual encoders, generating compact visual embeddings for key frames to build an efficient index. These visual embeddings, along with associated bounding boxes, are organized in an inverted multi-index structure within a vector database, which supports queries for any objects. During the query phase, LOVO transforms object queries to query embeddings and conducts fast approximate nearest-neighbor searches on the visual embeddings. Finally, a cross-modal rerank is performed to refine the results by fusing visual features with detailed textual features. Evaluation on real-world video datasets demonstrates that LOVO outperforms existing methods in handling complex queries, with near-optimal query accuracy and up to 85x lower search latency, while significantly reducing index construction costs. This system redefines the state-of-the-art object query approaches in video analysis, setting a new benchmark for complex object queries with a novel, scalable, and efficient approach that excels in dynamic environments.
zh

[CV-161] APTx Neuron: A Unified Trainable Neuron Architecture Integrating Activation and Computation

【速读】:该论文旨在解决传统神经网络中非线性激活函数与线性变换需分层实现所带来的计算冗余和架构复杂性问题。其解决方案的关键在于提出一种统一的神经元单元——APTx Neuron,将非线性激活(non-linear activation)与线性变换(linear transformation)融合于单一可训练表达式中,形式为 $ y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta $,其中所有参数均可学习。这种设计不仅简化了网络结构,还提升了模型的表达能力和计算效率,在MNIST数据集上仅用20个epoch即达到96.69%测试准确率,验证了其优越性能。

链接: https://arxiv.org/abs/2507.14270
作者: Ravin Kumar
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, 1 table, and GitHub repository for the source code

点击查看摘要

Abstract:We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both computationally efficient and elegant. The proposed neuron follows the functional form y = \sum_i=1^n ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta , where all parameters \alpha_i , \beta_i , \gamma_i , and \delta are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to 96.69% test accuracy in just 20 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and computational efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it.
zh

[CV-162] Comparative Analysis of Algorithms for the Fitting of Tessellations to 3D Image Data

【速读】:该论文旨在解决如何高效且准确地将镶嵌模型(tessellation models)拟合到材料的三维图像数据(如多晶和泡沫结构)的问题,以生成能够逼近体素(voxel-based)晶粒结构的 Voronoi、Laguerre 和广义平衡幂图(GBPDs)。其解决方案的关键在于系统比较基于优化的方法——包括线性与非线性规划、基于交叉熵方法的随机优化以及梯度下降法——并利用差异度量(discrepancy measures)从晶粒体积、表面积和拓扑结构等方面量化拟合质量,从而在模型复杂度、优化算法复杂度与逼近精度之间权衡,为不同数据特征和应用场景提供方法选择依据。

链接: https://arxiv.org/abs/2507.14268
作者: Andreas Alpers,Orkun Furat,Christian Jung,Matthias Neumann,Claudia Redenbach,Aigerim Saken,Volker Schmidt
机构: University of Liverpool (利物浦大学); Ulm University (乌尔姆大学); University of Kaiserslautern-Landau (RPTU) (开姆尼茨-兰道大学); Graz University of Technology (格拉茨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Optimization and Control (math.OC)
备注: 31 pages, 16 figures, 8 tables

点击查看摘要

Abstract:This paper presents a comparative analysis of algorithmic strategies for fitting tessellation models to 3D image data of materials such as polycrystals and foams. In this steadily advancing field, we review and assess optimization-based methods – including linear and nonlinear programming, stochastic optimization via the cross-entropy method, and gradient descent – for generating Voronoi, Laguerre, and generalized balanced power diagrams (GBPDs) that approximate voxelbased grain structures. The quality of fit is evaluated on real-world datasets using discrepancy measures that quantify differences in grain volume, surface area, and topology. Our results highlight trade-offs between model complexity, the complexity of the optimization routines involved, and the quality of approximation, providing guidance for selecting appropriate methods based on data characteristics and application needs.
zh

[CV-163] Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)模型在结合解释模型(interpretation model)后仍可能受到对抗攻击的问题,即现有攻击方法通常仅关注最小扰动以欺骗分类器,而忽略了对解释模型的影响,从而导致对抗样本难以被检测。解决方案的关键在于提出一种名为AdViT的新型攻击方法,其能够同时误导ViT模型及其耦合的解释模型,通过生成既具有高误分类置信度又保留“正确”解释的对抗样本,在白盒和黑盒场景下均实现100%攻击成功率,显著提升了对抗样本的隐蔽性和攻击有效性。

链接: https://arxiv.org/abs/2507.14248
作者: Eldor Abdukhamidov,Mohammed Abuhamad,Simon S. Woo,Hyoungshick Kim,Tamer Abuhmed
机构: Sungkyunkwan University (成均馆大学); Loyola University (洛约拉大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision transformer (ViT) models, when coupled with interpretation models, are regarded as secure and challenging to deceive, making them well-suited for security-critical domains such as medical applications, autonomous vehicles, drones, and robotics. However, successful attacks on these systems can lead to severe consequences. Recent research on threats targeting ViT models primarily focuses on generating the smallest adversarial perturbations that can deceive the models with high confidence, without considering their impact on model interpretations. Nevertheless, the use of interpretation models can effectively assist in detecting adversarial examples. This study investigates the vulnerability of transformer models to adversarial attacks, even when combined with interpretation models. We propose an attack called “AdViT” that generates adversarial examples capable of misleading both a given transformer model and its coupled interpretation model. Through extensive experiments on various transformer models and two transformer-based interpreters, we demonstrate that AdViT achieves a 100% attack success rate in both white-box and black-box scenarios. In white-box scenarios, it reaches up to 98% misclassification confidence, while in black-box scenarios, it reaches up to 76% misclassification confidence. Remarkably, AdViT consistently generates accurate interpretations in both scenarios, making the adversarial examples more difficult to detect.
zh

[CV-164] On Splitting Lightweight Semantic Image Segmentation for Wireless Communications

【速读】:该论文旨在解决资源受限环境下语义通信中图像分割的带宽与计算效率难以平衡的问题,尤其在信道条件变化时难以同时保证高分割精度和低传输开销。其解决方案的关键在于将语义图像分割过程拆分至资源受限的发送端和接收端之间协作完成:发送端仅传输压缩后的中间语义特征,而非原始图像或完整分割结果,从而显著降低传输比特率;同时减轻发送端的计算负担,使整个系统更适配边缘设备与未来6G通信场景。

链接: https://arxiv.org/abs/2507.14199
作者: Ebrahim Abu-Helalah,Jordi Serra,Jordi Perez-Romero
机构: Centre Tecnològic de Telecomunicacions de Catalunya (CTTC/CERCA)(加泰罗尼亚电信技术中心); Universitat Politecnica de Catalunya (UPC)(加泰罗尼亚理工大学)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IEEE International Mediterranean Conference on Communications and Networking

点击查看摘要

Abstract:Semantic communication represents a promising technique towards reducing communication costs, especially when dealing with image segmentation, but it still lacks a balance between computational efficiency and bandwidth requirements while maintaining high image segmentation accuracy, particularly in resource-limited environments and changing channel conditions. On the other hand, the more complex and larger semantic image segmentation models become, the more stressed the devices are when processing data. This paper proposes a novel approach to implementing semantic communication based on splitting the semantic image segmentation process between a resource constrained transmitter and the receiver. This allows saving bandwidth by reducing the transmitted data while maintaining the accuracy of the semantic image segmentation. Additionally, it reduces the computational requirements at the resource constrained transmitter compared to doing all the semantic image segmentation in the transmitter. The proposed approach is evaluated by means of simulation-based experiments in terms of different metrics such as computational resource usage, required bit rate and segmentation accuracy. The results when comparing the proposal with the full semantic image segmentation in the transmitter show that up to 72% of the bit rate was reduced in the transmission process. In addition, the computational load of the transmitter is reduced by more than 19%. This reflects the interest of this technique for its application in communication systems, particularly in the upcoming 6G systems.
zh

[CV-165] RARE-UNet: Resolution-Aligned Routing Entry for Adaptive Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决现有医学图像分割模型在面对实际应用中低分辨率输入时性能显著下降的问题,尤其针对固定高分辨率假设下模型泛化能力不足的局限性。其解决方案的关键在于提出一种分辨率感知的多尺度分割架构 RARE-UNet,核心创新包括:在编码器多个深度层级集成多尺度块以捕获不同分辨率下的特征表示;引入分辨率感知路由机制动态调整推理路径以适配输入空间分辨率;以及采用一致性驱动训练策略,使多分辨率特征与全分辨率表征对齐。该设计实现了在多种分辨率下的稳定性能提升和推理效率优化,显著增强了模型的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2507.15524
作者: Simon Winther Albertsen,Hjalte Svaneborg Bjørnstrup,Mostafa Mehdipour Ghazi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: EMA4MICCAI 2025

点击查看摘要

Abstract:Accurate segmentation is crucial for clinical applications, but existing models often assume fixed, high-resolution inputs and degrade significantly when faced with lower-resolution data in real-world scenarios. To address this limitation, we propose RARE-UNet, a resolution-aware multi-scale segmentation architecture that dynamically adapts its inference path to the spatial resolution of the input. Central to our design are multi-scale blocks integrated at multiple encoder depths, a resolution-aware routing mechanism, and consistency-driven training that aligns multi-resolution features with full-resolution representations. We evaluate RARE-UNet on two benchmark brain imaging tasks for hippocampus and tumor segmentation. Compared to standard UNet, its multi-resolution augmented variant, and nnUNet, our model achieves the highest average Dice scores of 0.84 and 0.65 across resolution, while maintaining consistent performance and significantly reduced inference time at lower resolutions. These results highlight the effectiveness and scalability of our architecture in achieving resolution-robust segmentation. The codes are available at: this https URL.
zh

[CV-166] DeSamba: Decoupled Spectral Adaptive Framework for 3D Multi-Sequence MRI Lesion Classification AAAI2026

【速读】:该论文旨在解决多序列磁共振成像(MRI)数据在三维病变分类中难以有效融合的问题,以提升分类的鲁棒性和准确性。其解决方案的关键在于提出了一种名为DeSamba(Decoupled Spectral Adaptive Network and Mamba-Based Model)的新框架,核心创新包括两个模块:一是解耦表示学习模块(DRLM),通过自重建与交叉重建机制从不同MRI序列中解耦特征;二是谱自适应调制块(SAMB),嵌入于SAMNet中,可根据病变特性动态融合频域与空间信息,从而实现更精准的多模态特征整合与分类。

链接: https://arxiv.org/abs/2507.15487
作者: Dezhen Wang,Sheng Miao,Rongxin Chai,Jiufa Cui
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures, 3 tables, submitted to AAAI2026

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) sequences provide rich spatial and frequency domain information, which is crucial for accurate lesion classification in medical imaging. However, effectively integrating multi-sequence MRI data for robust 3D lesion classification remains a challenge. In this paper, we propose DeSamba (Decoupled Spectral Adaptive Network and Mamba-Based Model), a novel framework designed to extract decoupled representations and adaptively fuse spatial and spectral features for lesion classification. DeSamba introduces a Decoupled Representation Learning Module (DRLM) that decouples features from different MRI sequences through self-reconstruction and cross-reconstruction, and a Spectral Adaptive Modulation Block (SAMB) within the proposed SAMNet, enabling dynamic fusion of spectral and spatial information based on lesion characteristics. We evaluate DeSamba on two clinically relevant 3D datasets. On a six-class spinal metastasis dataset (n=1,448), DeSamba achieves 62.10% Top-1 accuracy, 63.62% F1-score, 87.71% AUC, and 93.55% Top-3 accuracy on an external validation set (n=372), outperforming all state-of-the-art (SOTA) baselines. On a spondylitis dataset (n=251) involving a challenging binary classification task, DeSamba achieves 70.00%/64.52% accuracy and 74.75/73.88 AUC on internal and external validation sets, respectively. Ablation studies demonstrate that both DRLM and SAMB significantly contribute to overall performance, with over 10% relative improvement compared to the baseline. Our results highlight the potential of DeSamba as a generalizable and effective solution for 3D lesion classification in multi-sequence medical imaging.
zh

[CV-167] A Steel Surface Defect Detection Method Based on Lightweight Convolution Optimization

【速读】:该论文旨在解决钢铁表面多尺度缺陷检测难题,特别是传统图像处理与检测方法在复杂工业环境中对小目标缺陷识别精度不足、漏检率高的问题。其解决方案的关键在于提出一种基于深度学习的检测框架,核心创新包括:引入SCConv模块以重构空间与通道维度,降低特征冗余并优化特征表达;集成C3Ghost模块减少冗余计算与参数量,提升模型效率;采用CARAFE上采样算子实现内容感知的特征图精细化重组,增强高分辨率缺陷区域的恢复能力。实验表明,该方法显著提升了检测准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.15476
作者: Cong Chen,Ming Chen,Hoileong Lee,Yan Li,Jiyang Yu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model’s feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.
zh

[CV-168] Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation

【速读】:该论文旨在解决医学图像分割中因数据稀缺导致的模型性能受限问题,特别是在息肉(polyp)检测任务中,由于标注需要专业医学知识,高质量标注数据极为有限。其解决方案的关键在于提出SynDiff框架,通过文本引导的合成数据生成与基于扩散模型的高效分割相结合:首先利用潜空间扩散模型(latent diffusion models)实现文本条件下的图像修复(text-conditioned inpainting),生成具有语义多样性的临床真实感合成息肉图像,从而有效扩充训练数据;其次创新性地引入直接潜变量估计策略,避免传统扩散方法所需的迭代去噪过程,实现单步推理并获得T倍计算加速。该方案在CVC-ClinicDB数据集上实现了96.0%的Dice系数和92.9%的IoU,同时保持实时推理能力,证明了可控合成增强可在不引起分布偏移的前提下提升分割鲁棒性,为资源受限的临床环境提供了高效的部署方案。

链接: https://arxiv.org/abs/2507.15361
作者: Muhammad Aqeel,Maham Nazir,Zanxi Ruan,Francesco Setti
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVGMMI Workshop at ICIAP 2025

点击查看摘要

Abstract:Medical image segmentation suffers from data scarcity, particularly in polyp detection where annotation requires specialized expertise. We present SynDiff, a framework combining text-guided synthetic data generation with efficient diffusion-based segmentation. Our approach employs latent diffusion models to generate clinically realistic synthetic polyps through text-conditioned inpainting, augmenting limited training data with semantically diverse samples. Unlike traditional diffusion methods requiring iterative denoising, we introduce direct latent estimation enabling single-step inference with T x computational speedup. On CVC-ClinicDB, SynDiff achieves 96.0% Dice and 92.9% IoU while maintaining real-time capability suitable for clinical deployment. The framework demonstrates that controlled synthetic augmentation improves segmentation robustness without distribution shift. SynDiff bridges the gap between data-hungry deep learning models and clinical constraints, offering an efficient solution for deployment in resourcelimited medical settings.
zh

[CV-169] MedSR-Impact: Transformer-Based Super-Resolution for Lung CT Segmentation Radiomics Classification and Prognosis

【速读】:该论文旨在解决高分辨率胸部CT在临床应用中因辐射剂量和硬件成本限制而难以普及的问题,其核心挑战在于如何在低剂量CT图像上恢复精细的解剖结构细节,以支持精准诊断与治疗规划。解决方案的关键在于提出一种基于Transformer架构的体积超分辨率网络(TVSRN-V2),该模型通过可扩展组件如跨平面注意力模块(Through-Plane Attention Blocks, TAB)和Swin Transformer V2,有效重建低剂量CT体积中的细微结构,并具备良好的下游任务兼容性;同时引入伪低分辨率增强策略,提升模型对不同扫描协议的鲁棒性,无需依赖私有数据即可模拟设备多样性,从而实现剂量高效成像与定量分析的临床落地。

链接: https://arxiv.org/abs/2507.15340
作者: Marc Boubnovski Martell,Kristofer Linton-Reid,Mitchell Chen,Sumeet Hindocha,Benjamin Hunter,Marco A. Calzado,Richard Lee,Joram M. Posma,Eric O. Aboagye
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution volumetric computed tomography (CT) is essential for accurate diagnosis and treatment planning in thoracic diseases; however, it is limited by radiation dose and hardware costs. We present the Transformer Volumetric Super-Resolution Network (\textbfTVSRN-V2), a transformer-based super-resolution (SR) framework designed for practical deployment in clinical lung CT analysis. Built from scalable components, including Through-Plane Attention Blocks (TAB) and Swin Transformer V2 – our model effectively reconstructs fine anatomical details in low-dose CT volumes and integrates seamlessly with downstream analysis pipelines. We evaluate its effectiveness on three critical lung cancer tasks – lobe segmentation, radiomics, and prognosis – across multiple clinical cohorts. To enhance robustness across variable acquisition protocols, we introduce pseudo-low-resolution augmentation, simulating scanner diversity without requiring private data. TVSRN-V2 demonstrates a significant improvement in segmentation accuracy (+4% Dice), higher radiomic feature reproducibility, and enhanced predictive performance (+0.06 C-index and AUC). These results indicate that SR-driven recovery of structural detail significantly enhances clinical decision support, positioning TVSRN-V2 as a well-engineered, clinically viable system for dose-efficient imaging and quantitative analysis in real-world CT workflows.
zh

[CV-170] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

【速读】:该论文旨在解决内窥镜手术中微弱血管运动可视化难题,这一问题对术中精准决策和操作至关重要,但因手术场景复杂多变而极具挑战性。解决方案的关键在于提出一种无需训练的拉格朗日框架EndoControlMag,其核心创新为两个模块:一是周期性参考帧重置(Periodic Reference Resetting, PRR)机制,通过短重叠片段划分与动态参考帧更新,在防止误差累积的同时保持时间一致性;二是分层组织感知放大(Hierarchical Tissue-aware Magnification, HTM)框架,结合双模式掩码膨胀策略——基于运动的软化策略根据组织位移调节放大强度,基于距离的指数衰减策略模拟生物力学力衰减,从而适应不同手术场景下的组织变形与光学流不确定性,显著提升放大精度与视觉质量。

链接: https://arxiv.org/abs/2507.15292
作者: An Wanga,Rulin Zhou,Mengya Xu,Yiru Ye,Longfei Gou,Yiting Chang,Hao Chen,Chwee Ming Lim,Jiankun Wang,Hongliang Ren
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at this https URL.
zh

[CV-171] Personalized 4D Whole Heart Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

【速读】:该论文旨在解决当前心脏数字孪生(Cardiac Digital Twins, CDTs)模型中缺乏能够模拟全部四个心腔全器官尺度电机械特性的完整心脏模型的问题。现有方法难以从临床可用的多视角2D心脏电影磁共振成像(cine MRIs)中高效重建高时空分辨率的4D(3D+t)心脏网格,从而限制了个性化心脏模型的构建与应用。解决方案的关键在于提出一种弱监督学习模型,通过学习2D cine MRI与4D心脏网格之间的自监督映射关系,直接从多视角2D图像重建出与输入影像高度匹配的4D心脏几何结构,从而实现对心室容积变化和射血分数等关键心脏参数的自动、高时间分辨率提取,为构建高效的心脏数字孪生平台提供了可行路径。

链接: https://arxiv.org/abs/2507.15203
作者: Xiaoyue Liu,Xicheng Sheng,Xiahai Zhuang,Vicente Grau,Mark YY Chan,Ching-Hui Sia,Lei Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac digital twins (CDTs) provide personalized in-silico cardiac representations and hold great potential for precision medicine in cardiology. However, whole-heart CDT models that simulate the full organ-scale electromechanics of all four heart chambers remain limited. In this work, we propose a weakly supervised learning model to reconstruct 4D (3D+t) heart mesh directly from multi-view 2D cardiac cine MRIs. This is achieved by learning a self-supervised mapping between cine MRIs and 4D cardiac meshes, enabling the generation of personalized heart models that closely correspond to input cine MRIs. The resulting 4D heart meshes can facilitate the automatic extraction of key cardiac variables, including ejection fraction and dynamic chamber volume changes with high temporal resolution. It demonstrates the feasibility of inferring personalized 4D heart models from cardiac MRIs, paving the way for an efficient CDT platform for precision medicine. The code will be publicly released once the manuscript is accepted.
zh

[CV-172] Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI with Explicit Cardiac Motion Modeling

【速读】:该论文旨在解决心肌梗死(Myocardial Infarction, MI)患者中,基于临床标准的心脏成像技术难以实现高保真三维心肌梗死几何重建的问题。传统方法依赖于含钆延迟增强磁共振成像(Late Gadolinium Enhancement MRI, LGE-MRI),虽为金标准但需使用对比剂,存在副作用和不适感,且常基于稀疏的二维切片重建,空间分辨率受限。解决方案的关键在于提出一个无需对比剂的端到端框架:首先利用自动深度形状拟合模型 biv-me 从多视角 cine MRI 中重建四维(4D)双心室网格;随后设计 CMotion2Infarct-Net 模型,通过显式建模该动态几何中的心脏运动模式来定位梗死区域。该方法在126名MI患者的205例cine MRI数据上验证,与人工标注具有较好一致性,证明了基于心脏运动驱动的无对比剂三维梗死重建的可行性,为MI数字孪生提供了高效路径。

链接: https://arxiv.org/abs/2507.15194
作者: Yilin Lyu,Fan Yang,Xiaoyue Liu,Zichen Jiang,Joshua Dillon,Debbie Zhao,Martyn Nash,Charlene Mauger,Alistair Young,Ching-Hui Sia,Mark YY Chan,Lei Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Accurate representation of myocardial infarct geometry is crucial for patient-specific cardiac modeling in MI patients. While Late gadolinium enhancement (LGE) MRI is the clinical gold standard for infarct detection, it requires contrast agents, introducing side effects and patient discomfort. Moreover, infarct reconstruction from LGE often relies on sparsely sampled 2D slices, limiting spatial resolution and accuracy. In this work, we propose a novel framework for automatically reconstructing high-fidelity 3D myocardial infarct geometry from 2D clinically standard cine MRI, eliminating the need for contrast agents. Specifically, we first reconstruct the 4D biventricular mesh from multi-view cine MRIs via an automatic deep shape fitting model, biv-me. Then, we design a infarction reconstruction model, CMotion2Infarct-Net, to explicitly utilize the motion patterns within this dynamic geometry to localize infarct regions. Evaluated on 205 cine MRI scans from 126 MI patients, our method shows reasonable agreement with manual delineation. This study demonstrates the feasibility of contrast-free, cardiac motion-driven 3D infarct reconstruction, paving the way for efficient digital twin of MI.
zh

[CV-173] A Study of Anatomical Priors for Deep Learning-Based Segmentation of Pheochromocytoma in Abdominal CT

【速读】:该论文旨在解决肾上腺嗜铬细胞瘤(pheochromocytoma, PCC)在腹部CT图像中精准分割的问题,这对于肿瘤负荷评估、预后判断及治疗规划具有重要意义,同时可能通过影像特征推断遗传亚型,减少对昂贵基因检测的依赖。解决方案的关键在于引入基于邻近器官解剖先验信息的多类标注策略,特别是将肾上腺瘤周围常见结构(如肾脏和腹主动脉)纳入标注体系,构建“肿瘤+肾脏+腹主动脉”(Tumor + Kidney + Aorta, TKA)的新型多类标注方案,相比以往仅使用“肿瘤+体部”(Tumor + Body, TB)的单一标注方式,在Dice相似系数(DSC)、归一化表面距离(NSD)和实例级F1分数上均显著提升,且在不同遗传亚型和交叉验证中表现出更强的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2507.15193
作者: Tanjin Taher Toma,Tejas Sudharshan Mathai,Bikash Santra,Pritam Mukherjee,Jianfei Liu,Wesley Jong,Darwish Alabyad,Vivek Batheja,Abhishek Jha,Mayank Patel,Darko Pucar,Jayadira del Rivero,Karel Pacak,Ronald M. Summers
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of pheochromocytoma (PCC) in abdominal CT scans is essential for tumor burden estimation, prognosis, and treatment planning. It may also help infer genetic clusters, reducing reliance on expensive testing. This study systematically evaluates anatomical priors to identify configurations that improve deep learning-based PCC segmentation. We employed the nnU-Net framework to evaluate eleven annotation strategies for accurate 3D segmentation of pheochromocytoma, introducing a set of novel multi-class schemes based on organ-specific anatomical priors. These priors were derived from adjacent organs commonly surrounding adrenal tumors (e.g., liver, spleen, kidney, aorta, adrenal gland, and pancreas), and were compared against a broad body-region prior used in previous work. The framework was trained and tested on 105 contrast-enhanced CT scans from 91 patients at the NIH Clinical Center. Performance was measured using Dice Similarity Coefficient (DSC), Normalized Surface Distance (NSD), and instance-wise F1 score. Among all strategies, the Tumor + Kidney + Aorta (TKA) annotation achieved the highest segmentation accuracy, significantly outperforming the previously used Tumor + Body (TB) annotation across DSC (p = 0.0097), NSD (p = 0.0110), and F1 score (25.84% improvement at an IoU threshold of 0.5), measured on a 70-30 train-test split. The TKA model also showed superior tumor burden quantification (R^2 = 0.968) and strong segmentation across all genetic subtypes. In five-fold cross-validation, TKA consistently outperformed TB across IoU thresholds (0.1 to 0.5), reinforcing its robustness and generalizability. These findings highlight the value of incorporating relevant anatomical context in deep learning models to achieve precise PCC segmentation, supporting clinical assessment and longitudinal monitoring.
zh

[CV-174] Performance Analysis of Post-Training Quantization for CNN-based Conjunctival Pallor Anemia Detection

【速读】:该论文旨在解决低资源环境中儿童贫血(anemia)早期准确诊断难的问题,传统方法依赖昂贵设备和专业人员,难以普及。解决方案的关键在于利用深度学习模型通过结膜苍白(conjunctival pallor)图像自动检测贫血,采用MobileNet作为主干网络,并基于CP-AnemiC数据集进行端到端微调,结合数据增强与交叉验证策略,实现了高精度识别(准确率0.9313、F1分数0.9773)。为进一步适配边缘设备部署,研究还评估了不同位宽(FP32至INT4)的后训练量化对模型性能的影响,发现FP16量化可在保持较高准确率的同时显著压缩模型体积,为移动医疗场景下轻量化、高效化诊断系统提供了可行路径。

链接: https://arxiv.org/abs/2507.15151
作者: Sebastian A. Cruz Romero,Wilfredo E. Lugo Beauchamp
机构: University of Puerto Rico at Mayagüez (波多黎各大学马亚圭兹分校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Symposium on Intelligent Computing Networks 2025

点击查看摘要

Abstract:Anemia is a widespread global health issue, particularly among young children in low-resource settings. Traditional methods for anemia detection often require expensive equipment and expert knowledge, creating barriers to early and accurate diagnosis. To address these challenges, we explore the use of deep learning models for detecting anemia through conjunctival pallor, focusing on the CP-AnemiC dataset, which includes 710 images from children aged 6-59 months. The dataset is annotated with hemoglobin levels, gender, age and other demographic data, enabling the development of machine learning models for accurate anemia detection. We use the MobileNet architecture as a backbone, known for its efficiency in mobile and embedded vision applications, and fine-tune our model end-to-end using data augmentation techniques and a cross-validation strategy. Our model implementation achieved an accuracy of 0.9313, a precision of 0.9374, and an F1 score of 0.9773 demonstrating strong performance on the dataset. To optimize the model for deployment on edge devices, we performed post-training quantization, evaluating the impact of different bit-widths (FP32, FP16, INT8, and INT4) on model performance. Preliminary results suggest that while FP16 quantization maintains high accuracy (0.9250), precision (0.9370), and F1 Score (0.9377), more aggressive quantization (INT8 and INT4) leads to significant performance degradation. Overall, our study supports further exploration of quantization schemes and hardware optimizations to assess trade-offs between model size, inference time, and diagnostic accuracy in mobile healthcare applications.
zh

[CV-175] PET Image Reconstruction Using Deep Diffusion Image Prior

【速读】:该论文旨在解决正电子发射断层成像(Positron Emission Tomography, PET)中因示踪剂特异性对比度差异和高计算成本导致的图像重建质量受限问题。其核心解决方案是提出一种基于扩散模型的解剖先验引导型PET图像重建方法,通过在扩散采样与由PET sinogram指导的模型微调之间交替迭代,实现利用一个示踪剂预训练的得分函数(score function)对多种示踪剂数据进行高质量重建;同时引入半二次分裂(half-quadratic splitting, HQS)算法将网络优化与迭代PET重建解耦,显著提升计算效率。该方法在模拟和临床数据上均表现出跨示踪剂分布和扫描仪类型的鲁棒泛化能力,为低剂量PET成像提供了高效且通用的重建框架。

链接: https://arxiv.org/abs/2507.15078
作者: Fumio Hashimoto,Kuang Gong
机构: University of Florida (佛罗里达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Diffusion models have shown great promise in medical image denoising and reconstruction, but their application to Positron Emission Tomography (PET) imaging remains limited by tracer-specific contrast variability and high computational demands. In this work, we proposed an anatomical prior-guided PET image reconstruction method based on diffusion models, inspired by the deep diffusion image prior (DDIP) framework. The proposed method alternated between diffusion sampling and model fine-tuning guided by the PET sinogram, enabling the reconstruction of high-quality images from various PET tracers using a score function pretrained on a dataset of another tracer. To improve computational efficiency, the half-quadratic splitting (HQS) algorithm was adopted to decouple network optimization from iterative PET reconstruction. The proposed method was evaluated using one simulation and two clinical datasets. For the simulation study, a model pretrained on [ ^18 F]FDG data was tested on amyloid-negative PET data to assess out-of-distribution (OOD) performance. For the clinical-data validation, ten low-dose [ ^18 F]FDG datasets and one [ ^18 F]Florbetapir dataset were tested on a model pretrained on data from another tracer. Experiment results show that the proposed PET reconstruction method can generalize robustly across tracer distributions and scanner types, providing an efficient and versatile reconstruction framework for low-dose PET imaging.
zh

[CV-176] QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

【速读】:该论文旨在解决深度学习模型在科学与医学成像任务(如MRI和显微镜去噪)中出现的幻觉问题,即模型生成看似合理但实际并不存在的图像伪影,这会严重影响结果的准确性。现有不确定性量化方法(如保形预测)虽能识别异常值并提供图像回归任务的置信保证,但其依赖线性恒定缩放因子校准不确定度边界,导致边界过宽、信息量不足。解决方案的关键在于提出QUTCC(Quantile Uncertainty Training and Calibration),一种基于分位数嵌入的非线性、非均匀校准技术,通过迭代查询网络获取上下分位数来逐步收紧不确定性区间,在保持相同统计覆盖率的前提下显著提升边界精度,从而更可靠地识别幻觉并提供紧致的不确定性估计。

链接: https://arxiv.org/abs/2507.14760
作者: Cassandra Tong Ye,Shamus Li,Tyler King,Kristina Monakhova
机构: Cornell University (康奈尔大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models often hallucinate, producing realistic artifacts that are not truly present in the sample. This can have dire consequences for scientific and medical inverse problems, such as MRI and microscopy denoising, where accuracy is more important than perceptual quality. Uncertainty quantification techniques, such as conformal prediction, can pinpoint outliers and provide guarantees for image regression tasks, improving reliability. However, existing methods utilize a linear constant scaling factor to calibrate uncertainty bounds, resulting in larger, less informative bounds. We propose QUTCC, a quantile uncertainty training and calibration technique that enables nonlinear, non-uniform scaling of quantile predictions to enable tighter uncertainty estimates. Using a U-Net architecture with a quantile embedding, QUTCC enables the prediction of the full conditional distribution of quantiles for the imaging task. During calibration, QUTCC generates uncertainty bounds by iteratively querying the network for upper and lower quantiles, progressively refining the bounds to obtain a tighter interval that captures the desired coverage. We evaluate our method on several denoising tasks as well as compressive MRI reconstruction. Our method successfully pinpoints hallucinations in image estimates and consistently achieves tighter uncertainty intervals than prior methods while maintaining the same statistical coverage.
zh

[CV-177] Classification of Histopathology Slides with Persistence Homology Convolutions

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在医学图像分析中因典型模型架构导致拓扑信息丢失的问题,尤其是在组织病理学(histopathology)领域,拓扑结构是区分病变组织的关键特征。现有方法虽尝试通过持久同调(persistent homology)引入拓扑信息,但依赖全局拓扑摘要,无法保留拓扑特征的空间局部性。论文提出的关键解决方案是引入一种基于持久同调的新型卷积操作——持久同调卷积(Persistent Homology Convolutions),该方法在保持拓扑特征局部性和平移不变性的同时,提取出具有几何意义的局部拓扑信息。实验表明,使用该方法训练的模型优于传统CNN,并对超参数不敏感,验证了其有效性。

链接: https://arxiv.org/abs/2507.14378
作者: Shrunal Pothagoni,Benjamin Schweinhart
机构: George Mason University (乔治梅森大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are a standard tool for computer vision tasks such as image classification. However, typical model architectures may result in the loss of topological information. In specific domains such as histopathology, topology is an important descriptor that can be used to distinguish between disease-indicating tissue by analyzing the shape characteristics of cells. Current literature suggests that reintroducing topological information using persistent homology can improve medical diagnostics; however, previous methods utilize global topological summaries which do not contain information about the locality of topological features. To address this gap, we present a novel method that generates local persistent homology-based data using a modified version of the convolution operator called Persistent Homology Convolutions. This method captures information about the locality and translation invariance of topological features. We perform a comparative study using various representations of histopathology slides and find that models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters. These results indicate that persistent homology convolutions extract meaningful geometric information from the histopathology slides.
zh

[CV-178] Self-Supervised Joint Reconstruction and Denoising of T2-Weighted PROPELLER MRI of the Lungs at 0.55T

【速读】:该论文旨在解决低场强(0.55T)下T2加权PROPELLER肺部磁共振成像(MRI)中存在的图像质量差、信噪比低的问题。其核心挑战在于如何在不依赖干净标签数据的情况下实现高质量重建与去噪。解决方案的关键在于提出一种自监督联合重建与去噪模型:将PROPELLER采集的每一刀片(blade)沿读出方向分割为两个互不重叠的子集,其中一个用于训练未展开的重建网络,另一个用于损失计算,从而利用k空间子集间的结构冗余性和匹配的噪声统计特性实现无监督学习。该方法不仅提升了图像清晰度和结构完整性,还支持扫描时间减半(仅需原刀片数量的一半),且经放射科医生评估显著优于基于MPPCA的去噪方法(p<0.001)。

链接: https://arxiv.org/abs/2507.14308
作者: Jingjia Chen,Haoyang Pei,Christoph Maier,Mary Bruno,Qiuting Wen,Seon-Hi Shin,William Moore,Hersh Chandarana,Li Feng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: This study aims to improve 0.55T T2-weighted PROPELLER lung MRI through a self-supervised joint reconstruction and denoising model. Methods: T2-weighted 0.55T lung MRI dataset including 44 patients with previous covid infection were used. A self-supervised learning framework was developed, where each blade of the PROPELLER acquisition was split along the readout direction into two partitions. One subset trains the unrolled reconstruction network, while the other subset is used for loss calculation, enabling self-supervised training without clean targets and leveraging matched noise statistics for denoising. For comparison, Marchenko-Pastur Principal Component Analysis (MPPCA) was performed along the coil dimension, followed by conventional parallel imaging reconstruction. The quality of the reconstructed lung MRI was assessed visually by two experienced radiologists independently. Results: The proposed self-supervised model improved the clarity and structural integrity of the lung images. For cases with available CT scans, the reconstructed images demonstrated strong alignment with corresponding CT images. Additionally, the proposed model enables further scan time reduction by requiring only half the number of blades. Reader evaluations confirmed that the proposed method outperformed MPPCA-denoised images across all categories (Wilcoxon signed-rank test, p0.001), with moderate inter-reader agreement (weighted Cohen’s kappa=0.55; percentage of exact and within +/-1 point agreement=91%). Conclusion: By leveraging intrinsic structural redundancies between two disjoint splits of k-space subsets, the proposed self-supervised learning model effectively reconstructs the image while suppressing the noise for 0.55T T2-weighted lung MRI with PROPELLER sampling. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.14308 [eess.IV] (or arXiv:2507.14308v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.14308 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jingjia Chen [view email] [v1] Fri, 18 Jul 2025 18:29:08 UTC (12,719 KB)
zh

[CV-179] NuSeC: A Dataset for Nuclei Segmentation in Breast Cancer Histopathology Images

【速读】:该论文旨在解决病理图像中细胞核(nuclei)检测与分割的标准化评估问题,以支持未来研究方法的可比性分析。解决方案的关键在于构建了一个结构清晰、划分合理的公开数据集——NuSeC,其包含来自25名患者的100张高分辨率(1024×1024像素)病理图像,并采用严格的随机抽样策略将数据划分为75%训练集(75张图像,约30,000个细胞核结构)和25%测试集(25张图像,约6,000个细胞核结构),确保不同研究方法在相同基准下进行公平比较。

链接: https://arxiv.org/abs/2507.14272
作者: Refik Samet,Nooshin Nemati,Emrah Hancer,Serpil Sak,Bilge Ayca Kirmizi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The NuSeC dataset is created by selecting 4 images with the size of 1024*1024 pixels from the slides of each patient among 25 patients. Therefore, there are a total of 100 images in the NuSeC dataset. To carry out a consistent comparative analysis between the methods that will be developed using the NuSeC dataset by the researchers in the future, we divide the NuSeC dataset 75% as the training set and 25% as the testing set. In detail, an image is randomly selected from 4 images of each patient among 25 patients to build the testing set, and then the remaining images are reserved for the training set. While the training set includes 75 images with around 30000 nuclei structures, the testing set includes 25 images with around 6000 nuclei structures.
zh

[CV-180] MiDeSeC: A Dataset for Mitosis Detection and Segmentation in Breast Cancer Histopathology Images

【速读】:该论文旨在解决乳腺浸润性导管癌(invasive breast carcinoma, no special type, NST)中细胞有丝分裂(mitosis)检测的难题,尤其针对形态多样的有丝分裂结构难以被准确识别的问题。解决方案的关键在于构建一个高质量、标注详尽的MiDeSeC数据集,通过从25名患者的HE染色切片中采集50个1024×1024像素的区域(共包含超过500个有丝分裂),并采用高倍率扫描设备(40×放大)确保图像分辨率与细节完整性,从而为深度学习模型提供充分多样性的训练和测试样本,提升有丝分裂自动识别的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.14271
作者: Refik Samet,Nooshin Nemati,Emrah Hancer,Serpil Sak,Bilge Ayca Kirmizi,Zeynep Yildirim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The MiDeSeC dataset is created through HE stained invasive breast carcinoma, no special type (NST) slides of 25 different patients captured at 40x magnification from the Department of Medical Pathology at Ankara University. The slides have been scanned by 3D Histech Panoramic p250 Flash-3 scanner and Olympus BX50 microscope. As several possible mitosis shapes exist, it is crucial to have a large dataset to cover all the cases. Accordingly, a total of 50 regions is selected from glass slides for 25 patients, each of regions with a size of 1024*1024 pixels. There are more than 500 mitoses in total in these 50 regions. Two-thirds of the regions are reserved for training, the other third for testing.
zh

[CV-181] Hyper-spectral Unmixing algorithms for remote compositional surface mapping: a review of the state of the art

【速读】:该论文旨在解决从高光谱遥感图像中推断地表或天体表面覆盖物质类型及其丰度与空间分布的问题,这是遥感数据处理中的核心挑战之一。其解决方案的关键在于系统梳理和比较当前最成功的高光谱解混(hyperspectral unmixing)方法,并结合广泛使用的公开数据集对这些方法进行验证与评估,从而为后续研究提供可靠的技术基准和改进方向。

链接: https://arxiv.org/abs/2507.14260
作者: Alfredo Gimenez Zapiola,Andrea Boselli,Alessandra Menafoglio,Simone Vantini
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work concerns a detailed review of data analysis methods used for remotely sensed images of large areas of the Earth and of other solid astronomical objects. In detail, it focuses on the problem of inferring the materials that cover the surfaces captured by hyper-spectral images and estimating their abundances and spatial distributions within the region. The most successful and relevant hyper-spectral unmixing methods are reported as well as compared, as an addition to analysing the most recent methodologies. The most important public data-sets in this setting, which are vastly used in the testing and validation of the former, are also systematically explored. Finally, open problems are spotlighted and concrete recommendations for future research are provided.
zh

[CV-182] Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification MICCAI2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗图像分类任务中,于少样本上下文学习(few-shot in-context learning)场景下存在的校准偏差(calibration bias)与人群不公平性(demographic unfairness)问题,尤其是在不同人口统计学子群体中的预测置信度不准确现象。解决方案的关键在于提出一种推理时校准方法 CALIN,其核心机制是通过双层优化流程——从整体人群层面逐步细化至子群体层面——估计所需的校准量(以校准矩阵表示),并在推理阶段应用该估计对预测置信度进行动态调整,从而实现公平且准确的置信度校准,同时最小化公平性与性能之间的权衡。

链接: https://arxiv.org/abs/2506.23298
作者: Xing Shen,Justin Szeto,Mingyang Li,Hengguan Huang,Tal Arbel
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. The peer-reviewed version of this paper has been accepted to MICCAI 2025 main conference

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs’ predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN’s effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at this https URL.
zh

人工智能

[AI-0] Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在国际数学奥林匹克(International Mathematical Olympiad, IMO)级别问题上表现不佳的问题,这类问题要求深度洞察力、创造力和严格的逻辑推理能力。研究者使用谷歌的Gemini 2.5 Pro模型,在未发生数据泄露的前提下对2025年IMO新发布的题目进行测试,通过精心设计的推理流程(pipeline design)与提示工程(prompt engineering),成功解决了6道题中的5道(存在一个需特别说明的例外情况),表明优化模型使用方式是提升其在高难度数学任务中性能的关键因素。

链接: https://arxiv.org/abs/2507.15855
作者: Yichen Huang,Lin F. Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The International Mathematical Olympiad (IMO) poses uniquely challenging problems requiring deep insight, creativity, and formal reasoning. While Large Language Models (LLMs) perform well on mathematical benchmarks like AIME, they struggle with Olympiad-level tasks. We use Google’s Gemini 2.5 Pro on the newly released IMO 2025 problems, avoiding data contamination. With pipeline design and prompt engineering, 5 (out of 6) problems are solved correctly (up to a caveat discussed below), highlighting the importance of finding the optimal way of using powerful models.
zh

[AI-1] he Other Mind: How Language Models Exhibit Human Temporal Cognition

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在未显式训练的情况下,如何自发地形成类似人类的时序认知模式,尤其是主观时间参考点的建立及其对时间距离感知的非线性压缩机制。解决方案的关键在于多层级分析揭示了三个核心机制:首先,在神经层面识别出一组“时间偏好型神经元”(temporal-preferential neurons),这些神经元在主观时间参考点处激活最小,并采用对数编码方案,与生物系统中发现的一致;其次,在表征层面发现年份信息从浅层的基本数值表示逐步演化为深层的抽象时间方向性结构,呈现层级构建过程;最后,在信息层面证明预训练语料库本身具有非线性的内在时间结构,为模型内部时间认知的构建提供了原始材料。综合而言,研究提出一种经验主义视角(experientialist perspective),将LLMs的认知视为其内部表征系统对客观世界的主观建构,从而为理解AI潜在的异质认知框架及未来对齐策略提供理论基础。

链接: https://arxiv.org/abs/2507.15851
作者: Lingyu Li,Yang Yao,Yixu Wang,Chubo Li,Yan Teng,Yingchun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures, 4 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to advance, they exhibit certain cognitive patterns similar to those of humans that are not directly specified in training data. This study investigates this phenomenon by focusing on temporal cognition in LLMs. Leveraging the similarity judgment task, we find that larger models spontaneously establish a subjective temporal reference point and adhere to the Weber-Fechner law, whereby the perceived distance logarithmically compresses as years recede from this reference point. To uncover the mechanisms behind this behavior, we conducted multiple analyses across neuronal, representational, and informational levels. We first identify a set of temporal-preferential neurons and find that this group exhibits minimal activation at the subjective reference point and implements a logarithmic coding scheme convergently found in biological systems. Probing representations of years reveals a hierarchical construction process, where years evolve from basic numerical values in shallow layers to abstract temporal orientation in deep layers. Finally, using pre-trained embedding models, we found that the training corpus itself possesses an inherent, non-linear temporal structure, which provides the raw material for the model’s internal construction. In discussion, we propose an experientialist perspective for understanding these findings, where the LLMs’ cognition is viewed as a subjective construction of the external world by its internal representational system. This nuanced perspective implies the potential emergence of alien cognitive frameworks that humans cannot intuitively predict, pointing toward a direction for AI alignment that focuses on guiding internal constructions. Our code is available at this https URL.
zh

[AI-2] Identifying Conditional Causal Effects in MPDAGs

【速读】:该论文旨在解决在已知因果图仅以最大定向部分有向无环图(MPDAG)形式给出的情况下,如何识别条件因果效应的问题。MPDAG代表了一个由背景知识限制的因果图等价类,且所有变量均可观测。解决方案的关键在于三个方面:首先,提出了一种当调整集不受处理因素影响时的识别公式;其次,将经典的do演算推广至MPDAG框架下;最后,设计了一个完备的算法用于识别此类条件因果效应,从而为基于不完全图结构的因果推断提供了理论保障和实用工具。

链接: https://arxiv.org/abs/2507.15842
作者: Sara LaPlante,Emilija Perković
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 67 pages, 8 figures

点击查看摘要

Abstract:We consider identifying a conditional causal effect when a graph is known up to a maximally oriented partially directed acyclic graph (MPDAG). An MPDAG represents an equivalence class of graphs that is restricted by background knowledge and where all variables in the causal model are observed. We provide three results that address identification in this setting: an identification formula when the conditioning set is unaffected by treatment, a generalization of the well-known do calculus to the MPDAG setting, and an algorithm that is complete for identifying these conditional effects.
zh

[AI-3] FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLM s

【速读】:该论文旨在解决大规模合成数据生成中因直接使用大语言模型(Large Language Models, LLMs)逐条生成记录而导致的时间和成本过高问题。其核心解决方案在于利用LLM自动识别表格字段类型(数值型、类别型或自由文本型),并将其分布信息编码为可复用的采样脚本,从而实现高效、规模化地生成多样化且真实感强的合成数据,避免了持续调用模型进行推理的开销。

链接: https://arxiv.org/abs/2507.15839
作者: Anh Nguyen,Sam Schafft,Nicholas Hale,John Alfaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field’s distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.
zh

[AI-4] Do AI models help produce verified bug fixes?

【速读】:该论文旨在解决生成式 AI(尤其是大语言模型,Large Language Models, LLMs)在自动程序修复(Automatic Program Repair, APR)中的实际效用问题,具体包括:LLMs 是否能真正提升程序员的调试效率与修复质量、如何验证所提出修复方案的有效性,以及程序员如何在实践中结合自身技能与LLMs协同工作。其解决方案的关键在于设计并实施了一项基于程序证明环境的对照实验研究,通过随机分组(一组使用LLMs,另一组不使用)并强制所有修复结果由形式化验证工具判定正确性,从而客观评估LLMs的实际作用;同时,研究构建了以目标-查询-度量(Goal-Query-Metric)框架为核心的可复用实验方法论,并借助完整会话记录对程序员行为进行细粒度分析,识别出7类LLM使用模式,最终为高效利用LLMs辅助调试和APR提供了实证依据与实践建议。

链接: https://arxiv.org/abs/2507.15822
作者: Li Huang,Ilgiz Mustafin,Marco Piccioni,Alessandro Schena,Reto Weber,Bertrand Meyer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Among areas of software engineering where AI techniques – particularly, Large Language Models – seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills? To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the Goal-Query-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs. These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a fine-grain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15822 [cs.SE] (or arXiv:2507.15822v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.15822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] Challenges of Trustworthy Federated Learning: Whats Done Current Trends and Remaining Work

【速读】:该论文旨在解决如何将联邦学习(Federated Learning, FL)与可信人工智能(Trustworthy Artificial Intelligence, TAI)框架有效对齐的问题。TAI要求AI系统在伦理、法律和技术层面均符合人类价值观、权利及社会期望,而FL因其分布式特性在实现这一目标时面临诸多挑战。论文的关键在于以TAI的要求为指导结构,系统性地分析FL在隐私保护、公平性、可解释性、鲁棒性等方面与TAI目标之间的差距,并分类梳理当前已有的研究进展、发展趋势以及尚未解决的核心问题,从而为未来FL向TAI演进提供清晰的研究路径和理论支撑。

链接: https://arxiv.org/abs/2507.15796
作者: Nuria Rodríguez-Barroso,Mario García-Márquez,M. Victoria Luzón,Francisco Herrera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the development of Trustworthy Artificial Intelligence (TAI) has emerged as a critical objective in the deployment of AI systems across sensitive and high-risk domains. TAI frameworks articulate a comprehensive set of ethical, legal, and technical requirements to ensure that AI technologies are aligned with human values, rights, and societal expectations. Among the various AI paradigms, Federated Learning (FL) presents a promising solution to pressing privacy concerns. However, aligning FL with the rest of the requirements of TAI presents a series of challenges, most of which arise from its inherently distributed nature. In this work, we adopt the requirements TAI as a guiding structure to systematically analyze the challenges of adapting FL to TAI. Specifically, we classify and examine the key obstacles to aligning FL with TAI, providing a detailed exploration of what has been done, the trends, and the remaining work within each of the identified challenges.
zh

[AI-6] Romance Relief and Regret: Teen Narratives of Chatbot Overreliance

【速读】:该论文旨在解决青少年在使用具备可定制人格(customizable personas)的生成式人工智能(Generative Artificial Intelligence, GenAI)聊天机器人过程中可能出现的情感依赖与数字过度依赖问题。研究通过分析318篇来自Reddit上自报年龄为13–17岁的用户帖子,揭示了青少年从寻求情感支持或创意表达开始,逐渐形成强烈依附并干扰现实人际关系和日常生活的典型路径。其关键解决方案在于:未来聊天机器人的设计应注重促进用户自我觉察、增强现实世界参与度,并鼓励青少年深度参与安全数字工具的开发过程,从而实现对青少年心理健康与数字素养的正向引导。

链接: https://arxiv.org/abs/2507.15783
作者: Mohammad ‘Matt’ Namvarpour,Brandon Brofsky,Jessica Medina,Mamtaj Akter,Afsaneh Razi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As Generative Artificial Intelligence (GenAI) driven chatbots like this http URL become embedded in adolescent life, they raise concerns about emotional dependence and digital overreliance. While studies have investigated the overreliance of adults on these chatbots, they have not investigated teens’ interactions with chatbots with customizable personas. We analyzed 318 Reddit posts made by users self-reported as 13-17 years old on the this http URL subreddit to understand patterns of overreliance. We found teens commonly begin using chatbots for emotional support or creative expression, but many develop strong attachments that interfere with offline relationships and daily routines. Their posts revealed recurring signs of psychological distress, cycles of relapse, and difficulty disengaging. Teens reported that their overreliance often ended when they reflect on the harm, return to in-person social settings, or become frustrated by platform restrictions. Based on the implications of our findings, we provide recommendations for future chatbot design so they can promote self-awareness, support real-world engagement, and involve teens in developing safer digital tools.
zh

[AI-7] Dynamics is what you need for time-series forecasting!

【速读】:该论文试图解决当前深度学习模型在时间序列预测任务中表现不佳的问题,尽管数据模态之间的界限正在消失,但传统深度模型仍难以超越简单模型。其核心假设是,此类任务需要能够学习数据底层动态特性的模型。解决方案的关键在于引入一个可学习的动力学模块(dynamics block),并将其置于模型末端作为最终预测器;实验证明,这一设计显著提升了模型性能,且动力学模块的位置对预测效果具有决定性影响。

链接: https://arxiv.org/abs/2507.15774
作者: Alexis-Raja Brachet,Pierre-Yves Richard,Céline Hudelot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, 1 table

点击查看摘要

Abstract:While boundaries between data modalities are vanishing, the usual successful deep models are still challenged by simple ones in the time-series forecasting task. Our hypothesis is that this task needs models that are able to learn the data underlying dynamics. We propose to validate it through both systemic and empirical studies. We develop an original \textttPRO-DYN nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerged: \textbf1 . under-performing architectures learn dynamics at most partially, \textbf2 . the location of the dynamics block at the model end is of prime importance. We conduct extensive experiments to confirm our observations on a set of performance-varying models with diverse backbones. Results support the need to incorporate a learnable dynamics block and its use as the final predictor.
zh

[AI-8] Deep-Learning Investigation of Vibrational Raman Spectra for Plant-Stress Analysis

【速读】:该论文旨在解决植物胁迫检测中传统拉曼光谱分析依赖人工预处理流程所带来的偏差与不一致性问题,尤其在荧光背景去除和特征峰识别环节易引入主观因素。其解决方案的关键在于提出了一种基于变分自编码器(Variational Autoencoder, VAE)的全自动深度学习工作流DIVA(Deep-learning-based Investigation of Vibrational Raman spectra for plant-stress Analysis),该方法可直接处理包含荧光背景的原始拉曼光谱数据,无需人工干预即可无偏地识别和量化关键光谱特征,从而实现对多种植物胁迫(包括非生物胁迫如遮荫、强光、高温及生物胁迫如细菌感染)的精准检测与持续健康监测。

链接: https://arxiv.org/abs/2507.15772
作者: Anoop C. Patil,Benny Jian Rong Sng,Yu-Wei Chang,Joana B. Pereira,Chua Nam-Hai,Rajani Sarojam,Gajendra Pratap Singh,In-Cheol Jang,Giovanni Volpe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: *Authors contributed equally to this work. +Supervised this work. 5 main figures and 1 extended data figure in manuscript. The PDF includes supplementary material

点击查看摘要

Abstract:Detecting stress in plants is crucial for both open-farm and controlled-environment agriculture. Biomolecules within plants serve as key stress indicators, offering vital markers for continuous health monitoring and early disease detection. Raman spectroscopy provides a powerful, non-invasive means to quantify these biomolecules through their molecular vibrational signatures. However, traditional Raman analysis relies on customized data-processing workflows that require fluorescence background removal and prior identification of Raman peaks of interest-introducing potential biases and inconsistencies. Here, we introduce DIVA (Deep-learning-based Investigation of Vibrational Raman spectra for plant-stress Analysis), a fully automated workflow based on a variational autoencoder. Unlike conventional approaches, DIVA processes native Raman spectra-including fluorescence backgrounds-without manual preprocessing, identifying and quantifying significant spectral features in an unbiased manner. We applied DIVA to detect a range of plant stresses, including abiotic (shading, high light intensity, high temperature) and biotic stressors (bacterial infections). By integrating deep learning with vibrational spectroscopy, DIVA paves the way for AI-driven plant health assessment, fostering more resilient and sustainable agricultural practices.
zh

[AI-9] Left Leaning Models: AI Assumptions on Economic Policy

链接: https://arxiv.org/abs/2507.15771
作者: Maxim Chupilkin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 8 pages, 5 tables

点击查看摘要

[AI-10] A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM -based Agent Intention Mining

【速读】:该论文旨在解决服务生态系统中由于智能体(Intelligent Agent)间复杂交互导致的异常涌现现象(Abnormal Emergence)难以分析的问题,传统因果方法仅关注个体轨迹,无法捕捉群体层面的动态演化。其解决方案的关键在于提出一种基于多智能体意图的涌现分析框架(EAMI),通过双视角思维链机制(Dual-Perspective Thought Track Mechanism)分别从有限理性和完全理性角度提取智能体意图,并结合k-means聚类识别群体意图的相变点,最终构建意图时间涌现图(Intention Temporal Emergence Diagram)实现动态、可解释的涌现分析。

链接: https://arxiv.org/abs/2507.15770
作者: Yifan Shen,Zihan Zhao,Xiao Xue,Yuwei Guo,Qun Ma,Deyu Zhou,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rise of service computing, cloud computing, and IoT, service ecosystems are becoming increasingly complex. The intricate interactions among intelligent agents make abnormal emergence analysis challenging, as traditional causal methods focus on individual trajectories. Large language models offer new possibilities for Agent-Based Modeling (ABM) through Chain-of-Thought (CoT) reasoning to reveal agent intentions. However, existing approaches remain limited to microscopic and static analysis. This paper introduces a framework: Emergence Analysis based on Multi-Agent Intention (EAMI), which enables dynamic and interpretable emergence analysis. EAMI first employs a dual-perspective thought track mechanism, where an Inspector Agent and an Analysis Agent extract agent intentions under bounded and perfect rationality. Then, k-means clustering identifies phase transition points in group intentions, followed by a Intention Temporal Emergence diagram for dynamic analysis. The experiments validate EAMI in complex online-to-offline (O2O) service system and the Stanford AI Town experiment, with ablation studies confirming its effectiveness, generalizability, and efficiency. This framework provides a novel paradigm for abnormal emergence and causal analysis in service ecosystems. The code is available at this https URL.
zh

[AI-11] GasAgent : A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

【速读】:该论文旨在解决智能合约中因非最优编码实践导致的Gas浪费问题,现有解决方案多依赖人工发现,效率低、难以维护且扩展性差;同时,基于大语言模型(Large Language Models, LLMs)的新方法虽能探索潜在Gas浪费模式,但存在与已有模式不兼容、冗余模式频出及需人工验证等问题。其关键解决方案是提出GasAgent——首个用于智能合约Gas优化的多代理系统,通过四个专业化代理(Seeker、Innovator、Executor和Manager)在闭环协作中实现对Gas节省改进的自动化识别、验证与应用,兼顾对现有优化模式的兼容性与新模式的自动发现与验证,从而实现端到端的智能合约Gas优化。

链接: https://arxiv.org/abs/2507.15761
作者: Jingyi Zheng,Zifan Peng,Yule Liu,Junfeng Wang,Yifan Liao,Wenhan Dong,Xinlei He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart contracts are trustworthy, immutable, and automatically executed programs on the blockchain. Their execution requires the Gas mechanism to ensure efficiency and fairness. However, due to non-optimal coding practices, many contracts contain Gas waste patterns that need to be optimized. Existing solutions mostly rely on manual discovery, which is inefficient, costly to maintain, and difficult to scale. Recent research uses large language models (LLMs) to explore new Gas waste patterns. However, it struggles to remain compatible with existing patterns, often produces redundant patterns, and requires manual validation/rewriting. To address this gap, we present GasAgent, the first multi-agent system for smart contract Gas optimization that combines compatibility with existing patterns and automated discovery/validation of new patterns, enabling end-to-end optimization. GasAgent consists of four specialized agents, Seeker, Innovator, Executor, and Manager, that collaborate in a closed loop to identify, validate, and apply Gas-saving improvements. Experiments on 100 verified real-world contracts demonstrate that GasAgent successfully optimizes 82 contracts, achieving an average deployment Gas savings of 9.97%. In addition, our evaluation confirms its compatibility with existing tools and validates the effectiveness of each module through ablation studies. To assess broader usability, we further evaluate 500 contracts generated by five representative LLMs across 10 categories and find that GasAgent optimizes 79.8% of them, with deployment Gas savings ranging from 4.79% to 13.93%, showing its usability as the optimization layer for LLM-assisted smart contract development.
zh

[AI-12] DiffuMeta: Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers

【速读】:该论文旨在解决三维超材料(metamaterials)逆向设计中因计算复杂度高和设计空间表达能力不足而导致的挑战,尤其是如何高效生成具有特定力学响应的结构。其解决方案的关键在于提出DiffuMeta框架,该框架结合扩散变换器(diffusion transformers)与一种新颖的代数语言表示法(algebraic language representation),将3D几何结构编码为数学语句,从而实现对多样化拓扑结构的紧凑统一参数化,并直接应用Transformer模型进行结构生成。此方法不仅能够生成具有精确应力-应变响应的壳体结构,还能在大变形条件下考虑屈曲和接触效应,并通过处理一对多映射关系产生多样解,同时支持对多个机械目标(包括训练域外的线性和非线性响应)的协同控制,实验验证进一步证明了其在加速定制化超材料设计中的有效性。

链接: https://arxiv.org/abs/2507.15753
作者: Li Zheng,Siddhant Kumar,Dennis M. Kochmann
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative machine learning models have revolutionized material discovery by capturing complex structure-property relationships, yet extending these approaches to the inverse design of three-dimensional metamaterials remains limited by computational complexity and underexplored design spaces due to the lack of expressive representations. Here, we present DiffuMeta, a generative framework integrating diffusion transformers with a novel algebraic language representation, encoding 3D geometries as mathematical sentences. This compact, unified parameterization spans diverse topologies while enabling direct application of transformers to structural design. DiffuMeta leverages diffusion models to generate novel shell structures with precisely targeted stress-strain responses under large deformations, accounting for buckling and contact while addressing the inherent one-to-many mapping by producing diverse solutions. Uniquely, our approach enables simultaneous control over multiple mechanical objectives, including linear and nonlinear responses beyond training domains. Experimental validation of fabricated structures further confirms the efficacy of our approach for accelerated design of metamaterials and structures with tailored properties.
zh

[AI-13] Explainable Anomaly Detection for Electric Vehicles Charging Stations

【速读】:该论文旨在解决电动汽车(Electric Vehicles, EV)充电基础设施中异常行为的检测与成因分析问题,以提升充电站的可靠性与效率。其关键解决方案是结合无监督异常检测技术与可解释人工智能(Explainable Artificial Intelligence, XAI)方法:首先使用孤立森林(Isolation Forest)识别充电行为中的异常模式,随后引入基于深度的孤立森林特征重要性(Depth-based Isolation Forest Feature Importance, DIFFI)方法,量化各特征对异常贡献的程度,从而揭示异常的根本原因,实现从“发现异常”到“理解异常”的跨越。

链接: https://arxiv.org/abs/2507.15718
作者: Matteo Cederle,Andrea Mazzucco,Andrea Demartini,Eugenio Mazza,Eugenia Suriani,Federico Vitti,Gian Antonio Susto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures. Paper accepted to J3C 2025 (Joint Conference on Computers, Cognition and Communication)

点击查看摘要

Abstract:Electric vehicles (EV) charging stations are one of the critical infrastructures needed to support the transition to renewable-energy-based mobility, but ensuring their reliability and efficiency requires effective anomaly detection to identify irregularities in charging behavior. However, in such a productive scenario, it is also crucial to determine the underlying cause behind the detected anomalies. To achieve this goal, this study investigates unsupervised anomaly detection techniques for EV charging infrastructure, integrating eXplainable Artificial Intelligence techniques to enhance interpretability and uncover root causes of anomalies. Using real-world sensors and charging session data, this work applies Isolation Forest to detect anomalies and employs the Depth-based Isolation Forest Feature Importance (DIFFI) method to identify the most important features contributing to such anomalies. The efficacy of the proposed approach is evaluated in a real industrial case. Comments: 4 pages, 3 figures. Paper accepted to J3C 2025 (Joint Conference on Computers, Cognition and Communication) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15718 [cs.LG] (or arXiv:2507.15718v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Agent ic AI for autonomous anomaly management in complex systems

【速读】:该论文旨在解决复杂系统中异常检测与响应依赖人工干预的效率低下问题,其解决方案的关键在于利用代理型人工智能(agentic AI)实现异常的自主检测与响应,从而变革传统以人力为核心的异常管理方式。

链接: https://arxiv.org/abs/2507.15676
作者: Reza Vatankhah Barenji,Sina Khoshgoftar
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:This paper explores the potential of agentic AI in autonomously detecting and responding to anomalies within complex systems, emphasizing its ability to transform traditional, human-dependent anomaly management methods.
zh

[AI-15] SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models

【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像生成模型(如 Stable Diffusion)在社会和环境可持续性方面存在的问题,包括图像生成过程中可能引入的性别与种族偏见以及高能耗问题。其解决方案的关键在于提出 SustainDiffusion——一种基于搜索的优化方法,通过自动寻找最优超参数组合和提示结构,在不修改模型架构或进行微调的前提下,显著降低偏见并减少能源消耗,同时保持与原模型相当的图像质量。实证结果表明,SustainDiffusion 在 SD3 模型上可分别降低 68% 的性别偏见、59% 的种族偏见及 48% 的能量消耗,且结果具有良好的稳定性和泛化能力。

链接: https://arxiv.org/abs/2507.15663
作者: Giordano d’Aloisio,Tosin Fadahunsi,Jay Choy,Rebecca Moussa,Federica Sarro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Text-to-image generation models are widely used across numerous domains. Among these models, Stable Diffusion (SD) - an open-source text-to-image generation model - has become the most popular, producing over 12 billion images annually. However, the widespread use of these models raises concerns regarding their social and environmental sustainability. Aims: To reduce the harm that SD models may have on society and the environment, we introduce SustainDiffusion, a search-based approach designed to enhance the social and environmental sustainability of SD models. Method: SustainDiffusion searches the optimal combination of hyperparameters and prompt structures that can reduce gender and ethnic bias in generated images while also lowering the energy consumption required for image generation. Importantly, SustainDiffusion maintains image quality comparable to that of the original SD model. Results: We conduct a comprehensive empirical evaluation of SustainDiffusion, testing it against six different baselines using 56 different prompts. Our results demonstrate that SustainDiffusion can reduce gender bias in SD3 by 68%, ethnic bias by 59%, and energy consumption (calculated as the sum of CPU and GPU energy) by 48%. Additionally, the outcomes produced by SustainDiffusion are consistent across multiple runs and can be generalised to various prompts. Conclusions: With SustainDiffusion, we demonstrate how enhancing the social and environmental sustainability of text-to-image generation models is possible without fine-tuning or changing the model’s architecture. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15663 [cs.SE] (or arXiv:2507.15663v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.15663 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giordano D’Aloisio [view email] [v1] Mon, 21 Jul 2025 14:24:31 UTC (605 KB) Full-text links: Access Paper: View a PDF of the paper titled SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models, by Giordano d’Aloisio and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-16] owards Explainable Anomaly Detection in Shared Mobility Systems

【速读】:该论文旨在解决共享出行系统(如共享单车网络)中异常行为的识别问题,以优化运营效率、提升服务可靠性并改善用户体验。其解决方案的关键在于构建一个可解释的异常检测框架,该框架融合多源数据(包括共享单车行程记录、天气状况和公共交通可用性),采用孤立森林(Isolation Forest)算法进行无监督异常检测,并引入基于深度的孤立森林特征重要性(DIFFI)算法实现结果的可解释性,从而在站点层面提供对异常成因的深入理解,特别是揭示了恶劣天气和公共交通受限等外部因素的影响。

链接: https://arxiv.org/abs/2507.15643
作者: Elnur Isgandarov,Matteo Cederle,Federico Chiariotti,Gian Antonio Susto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 8 figures. Paper accepted to J3C 2025 (Joint Conference on Computers, Cognition and Communication

点击查看摘要

Abstract:Shared mobility systems, such as bike-sharing networks, play a crucial role in urban transportation. Identifying anomalies in these systems is essential for optimizing operations, improving service reliability, and enhancing user experience. This paper presents an interpretable anomaly detection framework that integrates multi-source data, including bike-sharing trip records, weather conditions, and public transit availability. The Isolation Forest algorithm is employed for unsupervised anomaly detection, along with the Depth-based Isolation Forest Feature Importance (DIFFI) algorithm providing interpretability. Results show that station-level analysis offers a robust understanding of anomalies, highlighting the influence of external factors such as adverse weather and limited transit availability. Our findings contribute to improving decision-making in shared mobility operations.
zh

[AI-17] acticCraft: Natural Language-Driven Tactical Adaptation for StarCraft II

【速读】:该论文旨在解决当前星际争霸II(StarCraft II)AI代理缺乏根据高层战术指令动态调整策略能力的问题。解决方案的关键在于:在预训练策略网络(DI-Star)基础上,为每个动作头附加轻量级适配器模块(adapter modules),并通过一个编码战略偏好的战术张量(tactical tensor)对这些适配器进行条件控制;同时,在训练中引入KL散度约束,确保策略在保持核心能力的同时实现战术层面的多样性变化。这种方法实现了灵活的战术调控,且计算开销极低,适用于复杂实时战略游戏中的策略定制需求。

链接: https://arxiv.org/abs/2507.15618
作者: Weiyu Ma,Jiwen Jiang,Haobo Fu,Haifeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an adapter-based approach for tactical conditioning of StarCraft II AI agents. Current agents, while powerful, lack the ability to adapt their strategies based on high-level tactical directives. Our method freezes a pre-trained policy network (DI-Star) and attaches lightweight adapter modules to each action head, conditioned on a tactical tensor that encodes strategic preferences. By training these adapters with KL divergence constraints, we ensure the policy maintains core competencies while exhibiting tactical variations. Experimental results show our approach successfully modulates agent behavior across tactical dimensions including aggression, expansion patterns, and technology preferences, while maintaining competitive performance. Our method enables flexible tactical control with minimal computational overhead, offering practical strategy customization for complex real-time strategy games.
zh

[AI-18] Why cant Epidemiology be automated (yet)?

【速读】:该论文试图解决的问题是:当前生成式 AI(Generative AI)在流行病学研究中的应用潜力尚不明确,尤其是在具体任务层面哪些环节可受益于 AI 干预、存在哪些技术与系统性障碍,以及如何有效整合 AI 工具以提升研究效率。其解决方案的关键在于构建一个覆盖从文献综述到数据获取、分析、撰写与传播的全流程流行病学任务地图,并基于现有 AI 工具评估其在各环节的效率增益;同时通过实例展示代理型(agentic)系统已能自主设计并执行流行病学分析,尽管质量参差不齐,这为流行病学家提供了实证测试和基准评估 AI 系统的新机会,最终实现流行病学与人工智能工程师之间的双向协作以释放 AI 潜力。

链接: https://arxiv.org/abs/2507.15617
作者: David Bann,Ed Lowther,Liam Wright,Yevgeniya Kovalchuk
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 1 table

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) - particularly generative AI - present new opportunities to accelerate, or even automate, epidemiological research. Unlike disciplines based on physical experimentation, a sizable fraction of Epidemiology relies on secondary data analysis and thus is well-suited for such augmentation. Yet, it remains unclear which specific tasks can benefit from AI interventions or where roadblocks exist. Awareness of current AI capabilities is also mixed. Here, we map the landscape of epidemiological tasks using existing datasets - from literature review to data access, analysis, writing up, and dissemination - and identify where existing AI tools offer efficiency gains. While AI can increase productivity in some areas such as coding and administrative tasks, its utility is constrained by limitations of existing AI models (e.g. hallucinations in literature reviews) and human systems (e.g. barriers to accessing datasets). Through examples of AI-generated epidemiological outputs, including fully AI-generated papers, we demonstrate that recently developed agentic systems can now design and execute epidemiological analysis, albeit to varied quality (see this https URL). Epidemiologists have new opportunities to empirically test and benchmark AI systems; realising the potential of AI will require two-way engagement between epidemiologists and engineers.
zh

[AI-19] Accelerating HEC-RAS: A Recurrent Neural Operator for Rapid River Forecasting

【速读】:该论文旨在解决基于物理的水文模型(如HEC-RAS)在洪水事件中实时决策支持时计算成本过高、难以快速响应的问题。其核心挑战是在不牺牲精度的前提下显著加速模拟过程。解决方案的关键在于提出一种混合自回归深度学习代理模型,该模型将门控循环单元(GRU)与几何感知傅里叶神经算子(Geo-FNO)相结合:GRU用于捕捉河流断面的短期时间动态变化,Geo-FNO则建模沿河段的长距离空间依赖关系;模型通过从原生HEC-RAS文件中提取的8通道特征向量(包含动态状态、静态几何和边界强迫信息)隐式学习物理规律,从而实现对复杂水力过程的高效高精度预测。

链接: https://arxiv.org/abs/2507.15614
作者: Edward Holmberg,Pujan Pokhrel,Maximilian Zoch,Elias Ioup,Ken Pathak,Steven Sloan,Kendall Niles,Jay Ratcliff,Maik Flanagin,Christian Guetl,Julian Simeonov,Mahdi Abdelguerfi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Physics-based solvers like HEC-RAS provide high-fidelity river forecasts but are too computationally intensive for on-the-fly decision-making during flood events. The central challenge is to accelerate these simulations without sacrificing accuracy. This paper introduces a deep learning surrogate that treats HEC-RAS not as a solver but as a data-generation engine. We propose a hybrid, auto-regressive architecture that combines a Gated Recurrent Unit (GRU) to capture short-term temporal dynamics with a Geometry-Aware Fourier Neural Operator (Geo-FNO) to model long-range spatial dependencies along a river reach. The model learns underlying physics implicitly from a minimal eight-channel feature vector encoding dynamic state, static geometry, and boundary forcings extracted directly from native HEC-RAS files. Trained on 67 reaches of the Mississippi River Basin, the surrogate was evaluated on a year-long, unseen hold-out simulation. Results show the model achieves a strong predictive accuracy, with a median absolute stage error of 0.31 feet. Critically, for a full 67-reach ensemble forecast, our surrogate reduces the required wall-clock time from 139 minutes to 40 minutes, a speedup of nearly 3.5 times over the traditional solver. The success of this data-driven approach demonstrates that robust feature engineering can produce a viable, high-speed replacement for conventional hydraulic models, improving the computational feasibility of large-scale ensemble flood forecasting.
zh

[AI-20] Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems

【速读】:该论文针对企业环境中部署的大语言模型(Large Language Models, LLMs)所面临的一种新型安全威胁——多阶段提示推理攻击(multi-stage prompt inference attacks)展开研究。此类攻击通过串联看似无害的提示(prompts),逐步诱导LLM泄露敏感信息(如内部SharePoint文档或邮件),即使在采用标准安全措施的情况下仍能成功。解决方案的关键在于构建一个形式化的威胁模型,并从概率论、优化框架和信息论泄漏边界等角度对攻击进行量化分析;在此基础上提出多层次防御策略,包括基于差分隐私训练的信息泄漏边界控制、异常检测机制(AUC较高)、细粒度访问控制、提示净化技术以及架构级改进,其中“聚光灯”(spotlighting)方法通过输入变换隔离不可信提示内容,使攻击成功率降低一个数量级。最终验证了融合多种防御手段的纵深防御(defense-in-depth)策略的有效性,强调企业级LLM安全需从单轮提示过滤转向多阶段攻防视角。

链接: https://arxiv.org/abs/2507.15613
作者: Andrii Balashov,Olena Ponomarova,Xiaohua Zhai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:Large Language Models (LLMs) deployed in enterprise settings (e.g., as Microsoft 365 Copilot) face novel security challenges. One critical threat is prompt inference attacks: adversaries chain together seemingly benign prompts to gradually extract confidential data. In this paper, we present a comprehensive study of multi-stage prompt inference attacks in an enterprise LLM context. We simulate realistic attack scenarios where an attacker uses mild-mannered queries and indirect prompt injections to exploit an LLM integrated with private corporate data. We develop a formal threat model for these multi-turn inference attacks and analyze them using probability theory, optimization frameworks, and information-theoretic leakage bounds. The attacks are shown to reliably exfiltrate sensitive information from the LLM’s context (e.g., internal SharePoint documents or emails), even when standard safety measures are in place. We propose and evaluate defenses to counter such attacks, including statistical anomaly detection, fine-grained access control, prompt sanitization techniques, and architectural modifications to LLM deployment. Each defense is supported by mathematical analysis or experimental simulation. For example, we derive bounds on information leakage under differential privacy-based training and demonstrate an anomaly detection method that flags multi-turn attacks with high AUC. We also introduce an approach called “spotlighting” that uses input transformations to isolate untrusted prompt content, reducing attack success by an order of magnitude. Finally, we provide a formal proof of concept and empirical validation for a combined defense-in-depth strategy. Our work highlights that securing LLMs in enterprise settings requires moving beyond single-turn prompt filtering toward a holistic, multi-stage perspective on both attacks and defenses. Comments: 26 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15613 [cs.CR] (or arXiv:2507.15613v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.15613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

【速读】:该论文旨在解决当前安全关键场景下决策研究中因依赖低效的数据驱动场景生成或特定建模方法而导致无法充分捕捉真实世界中边缘案例(corner cases)的问题。解决方案的关键在于提出一种红队多智能体强化学习框架(Red-Team Multi-Agent Reinforcement Learning framework),将具备干扰能力的背景车辆视为红队智能体,通过主动干扰与探索,挖掘超出数据分布的极端场景;同时引入约束图表示的马尔可夫决策过程(Constraint Graph Representation Markov Decision Process)确保红队车辆遵守安全规则但持续扰动自动驾驶车辆(AVs),并构建策略威胁区模型(policy threat zone model)量化对AV的威胁程度,从而诱导更极端行为以提升场景危险等级,有效增强AV决策安全性。

链接: https://arxiv.org/abs/2507.15587
作者: Yinsong Chen,Kaifeng Wang,Xiaoqiang Meng,Xueyuan Li,Zirui Li,Xin Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current research on decision-making in safety-critical scenarios often relies on inefficient data-driven scenario generation or specific modeling approaches, which fail to capture corner cases in real-world contexts. To address this issue, we propose a Red-Team Multi-Agent Reinforcement Learning framework, where background vehicles with interference capabilities are treated as red-team agents. Through active interference and exploration, red-team vehicles can uncover corner cases outside the data distribution. The framework uses a Constraint Graph Representation Markov Decision Process, ensuring that red-team vehicles comply with safety rules while continuously disrupting the autonomous vehicles (AVs). A policy threat zone model is constructed to quantify the threat posed by red-team vehicles to AVs, inducing more extreme actions to increase the danger level of the scenario. Experimental results show that the proposed framework significantly impacts AVs decision-making safety and generates various corner cases. This method also offers a novel direction for research in safety-critical scenarios.
zh

[AI-22] Unequal Voices: How LLM s Construct Constrained Queer Narratives

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成关于性少数群体(queer people)相关内容时存在的刻板化、受限及话语他者化问题。其核心问题是:LLMs 在叙事中往往将性少数群体局限于狭隘的主题范畴,缺乏对个体复杂性的呈现,而主流群体则享有更丰富的表达空间。解决方案的关键在于通过识别和量化三种有害表征形式——有害表征(harmful representations)、窄化表征(narrow representations)和话语他者化(discursive othering),并据此提出可检验的假设,从而系统评估LLMs在性别与性取向多样性上的表现局限,为改进模型偏见提供实证基础。

链接: https://arxiv.org/abs/2507.15585
作者: Atreya Ghosal,Ashim Gupta,Vivek Srikumar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One way social groups are marginalized in discourse is that the narratives told about them often default to a narrow, stereotyped range of topics. In contrast, default groups are allowed the full complexity of human existence. We describe the constrained representations of queer people in LLM generations in terms of harmful representations, narrow representations, and discursive othering and formulate hypotheses to test for these phenomena. Our results show that LLMs are significantly limited in their portrayals of queer personas.
zh

[AI-23] Metric assessment protocol in the context of answer fluctuation on MCQ tasks

【速读】:该论文旨在解决多选题(Multiple-Choice Questions, MCQs)评估大语言模型(Large Language Models, LLMs)能力时存在的两个核心问题:一是现有评估指标缺乏系统性比较与验证,二是模型在微小提示变化下出现答案波动(answer fluctuation)的现象未被充分纳入评估框架。其解决方案的关键在于提出一种基于“波动率”(fluctuation rates)的指标评估协议,通过量化不同评估方法与答案波动之间的关联性,结合原始性能表现来综合判断指标的有效性;实验表明,该协议揭示了现有指标与答案波动之间存在强相关性,且新提出的“最差准确率”(worst accuracy)指标在此协议下展现出最高的关联度,从而为MCQ评估提供了更稳定、可靠的衡量标准。

链接: https://arxiv.org/abs/2507.15581
作者: Ekaterina Goliakova,Xavier Renard,Marie-Jeanne Lesot,Thibault Laugel,Christophe Marsala,Marcin Detyniecki
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.
zh

[AI-24] On the Role of AI in Managing Satellite Constellations: Insights from the ConstellAI Project

【速读】:该论文旨在解决近地轨道卫星星座快速扩张背景下,卫星网络管理面临的高效性、可扩展性和鲁棒性挑战,尤其是在数据路由和资源分配两大关键操作环节。解决方案的核心在于引入强化学习(Reinforcement Learning, RL)算法,通过从历史队列延迟中学习优化端到端延迟,在数据路由场景中超越传统最短路径算法;同时在资源分配场景中,利用RL优化任务调度以高效利用电池与内存等有限资源。实验验证表明,RL不仅性能优越,还具备更强的灵活性、可扩展性和泛化能力,为卫星编队的自主智能管理提供了更适应复杂动态环境的决策机制。

链接: https://arxiv.org/abs/2507.15574
作者: Gregory F. Stock,Juan A. Fraire,Holger Hermanns,Jędrzej Mosiężny,Yusra Al-Khazraji,Julio Ramírez Molina,Evridiki V. Ntagiou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18th International Conference on Space Operations (SpaceOps 2025), Montréal, Canada, 26-30 May 2025, this https URL

点击查看摘要

Abstract:The rapid expansion of satellite constellations in near-Earth orbits presents significant challenges in satellite network management, requiring innovative approaches for efficient, scalable, and resilient operations. This paper explores the role of Artificial Intelligence (AI) in optimizing the operation of satellite mega-constellations, drawing from the ConstellAI project funded by the European Space Agency (ESA). A consortium comprising GMV GmbH, Saarland University, and Thales Alenia Space collaborates to develop AI-driven algorithms and demonstrates their effectiveness over traditional methods for two crucial operational challenges: data routing and resource allocation. In the routing use case, Reinforcement Learning (RL) is used to improve the end-to-end latency by learning from historical queuing latency, outperforming classical shortest path algorithms. For resource allocation, RL optimizes the scheduling of tasks across constellations, focussing on efficiently using limited resources such as battery and memory. Both use cases were tested for multiple satellite constellation configurations and operational scenarios, resembling the real-life spacecraft operations of communications and Earth observation satellites. This research demonstrates that RL not only competes with classical approaches but also offers enhanced flexibility, scalability, and generalizability in decision-making processes, which is crucial for the autonomous and intelligent management of satellite fleets. The findings of this activity suggest that AI can fundamentally alter the landscape of satellite constellation management by providing more adaptive, robust, and cost-effective solutions.
zh

[AI-25] PhysGym: Benchmarking LLM s in Interactive Physics Discovery with Controlled Priors

【速读】:该论文旨在解决当前缺乏专门用于评估基于大语言模型(Large Language Model, LLM)智能体在科学发现能力方面表现的基准测试工具的问题,尤其是其在不同环境复杂度下如何利用先验知识进行推理的能力。解决方案的关键在于提出PhysGym——一个全新的基准套件与仿真平台,通过精细控制提供给智能体的先验知识水平,使研究者能够系统性地分析智能体在问题复杂度和知识水平两个维度上的性能差异。该平台包含一系列交互式物理模拟环境,要求智能体在约束条件下主动探测、顺序收集数据并提出关于底层物理规律的假设,同时提供标准化的评估协议与指标以衡量假设准确性与模型保真度。

链接: https://arxiv.org/abs/2507.15550
作者: Yimeng Chen,Piotr Piȩkos,Mateusz Ostaszewski,Firas Laakom,Jürgen Schmidhuber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注: 31 Pages

点击查看摘要

Abstract:Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym’s primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark’s utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.
zh

[AI-26] Data-Efficient Safe Policy Improvement Using Parametric Structure ECAI2025

【速读】:该论文致力于解决离线强化学习中的安全策略改进(Safe Policy Improvement, SPI)问题,即在仅使用数据集和行为策略的前提下,计算出一个能够以高置信度可靠优于行为策略的新策略。其核心挑战在于如何在有限数据下实现高效且可靠的策略优化。解决方案的关键在于三点:首先提出一种参数化SPI算法,利用状态转移分布间的已知参数依赖关系,提升对环境动态的估计精度;其次引入基于博弈论抽象的预处理技术,识别并移除冗余动作以简化问题;最后采用基于Satisfiability Modulo Theories (SMT) 求解的更高级预处理方法,进一步挖掘可剪枝的动作空间。实证结果表明,这些方法在保持相同可靠性保证的同时,使SPI的数据效率提升多个数量级。

链接: https://arxiv.org/abs/2507.15532
作者: Kasper Engelen,Guillermo A. Pérez,Marnix Suilen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ECAI 2025

点击查看摘要

Abstract:Safe policy improvement (SPI) is an offline reinforcement learning problem in which a new policy that reliably outperforms the behavior policy with high confidence needs to be computed using only a dataset and the behavior policy. Markov decision processes (MDPs) are the standard formalism for modeling environments in SPI. In many applications, additional information in the form of parametric dependencies between distributions in the transition dynamics is available. We make SPI more data-efficient by leveraging these dependencies through three contributions: (1) a parametric SPI algorithm that exploits known correlations between distributions to more accurately estimate the transition dynamics using the same amount of data; (2) a preprocessing technique that prunes redundant actions from the environment through a game-based abstraction; and (3) a more advanced preprocessing technique, based on satisfiability modulo theory (SMT) solving, that can identify more actions to prune. Empirical results and an ablation study show that our techniques increase the data efficiency of SPI by multiple orders of magnitude while maintaining the same reliability guarantees.
zh

[AI-27] LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning NEURIPS2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否具备构建和操作内部世界模型的能力,而非仅依赖输出层词元概率所表征的统计关联这一核心问题。其解决方案的关键在于借鉴认知科学中研究人类心理模型的方法,设计三组基于TikZ渲染滑轮系统的情境实验:首先通过机械优势(Mechanical Advantage, MA)估计任务验证LLMs能否识别滑轮数量与MA之间的统计规律(Study 1);其次通过对比功能连通系统与随机组件系统,检验LLMs是否能识别关键全局特征以区分功能性结构(Study 2);最后通过比较功能系统与等效但无力传递能力的“伪系统”,考察其对结构连通性推理的能力(Study 3)。结果表明,LLMs在一定程度上可利用滑轮计数启发式估算MA并近似表征空间关系,但在缺乏显式线索时难以准确理解结构连通性的因果机制,提示其世界建模能力仍局限于浅层统计关联与局部结构感知,尚未达到深层次因果推理水平。

链接: https://arxiv.org/abs/2507.15521
作者: Cole Robertson,Philip Wolff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Manuscript comprises 14 pages, 4 figures, 4 tables in the Technical Appendix and Supplementary Material, and is under review at NeurIPS 2025

点击查看摘要

Abstract:Do large language models (LLMs) construct and manipulate internal world models, or do they rely solely on statistical associations represented as output layer token probabilities? We adapt cognitive science methodologies from human mental models research to test LLMs on pulley system problems using TikZ-rendered stimuli. Study 1 examines whether LLMs can estimate mechanical advantage (MA). State-of-the-art models performed marginally but significantly above chance, and their estimates correlated significantly with ground-truth MA. Significant correlations between number of pulleys and model estimates suggest that models employed a pulley counting heuristic, without necessarily simulating pulley systems to derive precise values. Study 2 tested this by probing whether LLMs represent global features crucial to MA estimation. Models evaluated a functionally connected pulley system against a fake system with randomly placed components. Without explicit cues, models identified the functional system as having greater MA with F1=0.8, suggesting LLMs could represent systems well enough to differentiate jumbled from functional systems. Study 3 built on this by asking LLMs to compare functional systems with matched systems which were connected up but which transferred no force to the weight; LLMs identified the functional system with F1=0.46, suggesting random guessing. Insofar as they may generalize, these findings are compatible with the notion that LLMs manipulate internal world models, sufficient to exploit statistical associations between pulley count and MA (Study 1), and to approximately represent system components’ spatial relations (Study 2). However, they may lack the facility to reason over nuanced structural connectivity (Study 3). We conclude by advocating the utility of cognitive scientific methods to evaluate the world-modeling capacities of artificial intelligence systems.
zh

[AI-28] HAMLET: Hyperadaptive Agent -based Modeling for Live Embodied Theatrics

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的戏剧生成方法中存在的两大核心问题:一是AI角色缺乏主动性,难以与物理环境进行有效交互;二是系统通常依赖详尽的用户输入来驱动剧情发展,导致在线实时表演的沉浸感和互动性不足。解决方案的关键在于提出HAMLET框架,该框架通过构建一个多智能体系统,在给定简单主题后自动生成叙事蓝图,并在在线表演阶段赋予每个演员自主心智(autonomous mind),使其能够根据自身背景、目标与情绪状态独立决策;同时,演员可通过操作场景道具(如打开信件或拾取武器)改变环境状态,并将这些变化广播给其他相关角色,从而动态更新其认知与关注点,进一步影响后续行为,形成闭环的、具身化的互动叙事机制。

链接: https://arxiv.org/abs/2507.15518
作者: Sizhou Chen,Shufan Jiang,Chi Zhang,Xiao-Lei Zhang,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in AI agents that lack initiative and cannot interact with the physical environment. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences. Our code, dataset and models are available at this https URL.
zh

[AI-29] he Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

【速读】:该论文旨在解决自主代理在不确定环境中保持可靠且符合规则行为的根本性挑战。解决方案的关键在于提出一种名为宪法控制器(Constitutional Controller, CoCo)的神经符号系统框架,该框架通过融合概率性符号白盒推理模型与深度学习方法,实现对显式规则和基于噪声数据训练的神经模型的协同考虑,从而结合结构化推理的优势与灵活表示能力;同时引入“自我怀疑”机制,即基于旅行速度、传感器状态或健康因子等特征条件下的概率密度函数,使代理能够学习并适当地表达不确定性,从而在复杂和不确定环境中安全、合规地导航。

链接: https://arxiv.org/abs/2507.15478
作者: Simon Kohaut,Felix Divo,Navid Hamid,Benedict Flade,Julian Eggert,Devendra Singh Dhami,Kristian Kersting
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring reliable and rule-compliant behavior of autonomous agents in uncertain environments remains a fundamental challenge in modern robotics. Our work shows how neuro-symbolic systems, which integrate probabilistic, symbolic white-box reasoning models with deep learning methods, offer a powerful solution to this challenge. This enables the simultaneous consideration of explicit rules and neural models trained on noisy data, combining the strength of structured reasoning with flexible representations. To this end, we introduce the Constitutional Controller (CoCo), a novel framework designed to enhance the safety and reliability of agents by reasoning over deep probabilistic logic programs representing constraints such as those found in shared traffic spaces. Furthermore, we propose the concept of self-doubt, implemented as a probability density conditioned on doubt features such as travel velocity, employed sensors, or health factors. In a real-world aerial mobility study, we demonstrate CoCo’s advantages for intelligent autonomous systems to learn appropriate doubts and navigate complex and uncertain environments safely and compliantly.
zh

[AI-30] he Emergence of Deep Reinforcement Learning for Path Planning

【速读】:该论文旨在解决复杂动态环境中自主系统(如自动驾驶车辆、无人机和机器人平台)的智能路径规划问题,传统方法在应对环境不确定性与高维状态空间时存在适应性不足的问题。其解决方案的关键在于引入深度强化学习(Deep Reinforcement Learning, DRL)技术,使自主代理能够通过与环境交互自主学习最优导航策略;同时强调了将DRL与经典规划方法相结合的混合范式,以兼顾学习模型的适应性优势与传统算法的确定性可靠性,从而提升路径规划的计算效率、可扩展性、鲁棒性和实用性。

链接: https://arxiv.org/abs/2507.15469
作者: Thanh Thi Nguyen,Saeid Nahavandi,Imran Razzak,Dung Nguyen,Nhat Truong Pham,Quoc Viet Hung Nguyen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Proceedings of the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

点击查看摘要

Abstract:The increasing demand for autonomous systems in complex and dynamic environments has driven significant research into intelligent path planning methodologies. For decades, graph-based search algorithms, linear programming techniques, and evolutionary computation methods have served as foundational approaches in this domain. Recently, deep reinforcement learning (DRL) has emerged as a powerful method for enabling autonomous agents to learn optimal navigation strategies through interaction with their environments. This survey provides a comprehensive overview of traditional approaches as well as the recent advancements in DRL applied to path planning tasks, focusing on autonomous vehicles, drones, and robotic platforms. Key algorithms across both conventional and learning-based paradigms are categorized, with their innovations and practical implementations highlighted. This is followed by a thorough discussion of their respective strengths and limitations in terms of computational efficiency, scalability, adaptability, and robustness. The survey concludes by identifying key open challenges and outlining promising avenues for future research. Special attention is given to hybrid approaches that integrate DRL with classical planning techniques to leverage the benefits of both learning-based adaptability and deterministic reliability, offering promising directions for robust and resilient autonomous navigation.
zh

[AI-31] he New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

【速读】:该论文试图解决传统Transformer模型中多头注意力(Multi-Head Attention, MHA)层作为内存受限瓶颈的问题,这一问题长期驱动了专用硬件加速器的研究。论文指出,近年来架构上的两个关键变化——多头潜在注意力(Multi-head Latent Attention, MLA)和专家混合模型(Mixture-of-Experts, MoE)——正在重塑计算负载的分布特性。解决方案的关键在于:首先,MLA的算术强度(arithmetic intensity)比MHA高出两个数量级,使其接近计算密集型状态,更适合现代GPU等通用加速器;其次,通过将MoE专家分布在多个加速器上并利用批处理调节其算术强度,可使MoE层与前馈层的计算特性趋于平衡,从而减少对专用注意力硬件的依赖。因此,下一代Transformer系统的核心挑战不再是优化单一内存受限层,而是设计具备充足计算能力、内存容量、带宽及高带宽互联的均衡系统以应对大规模模型的多样化需求。

链接: https://arxiv.org/abs/2507.15465
作者: Sungmin Yun,Seonyong Park,Hwayong Nam,Younjoo Lee,Gunjun Lee,Kwanhee Kyung,Sangpyo Kim,Nam Sung Kim,Jongmin Kim,Hyungyo Kim,Juhwan Cho,Seungmin Baek,Jung Ho Ahn
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models. Comments: 15 pages, 11 figures Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15465 [cs.AR] (or arXiv:2507.15465v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.15465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Optimization of Activity Batching Policies in Business Processes

【速读】:该论文旨在解决业务流程中活动批处理策略的自动发现问题,目标是找到在等待时间、处理努力和成本之间实现最优权衡的批处理策略。其核心挑战在于如何系统性地生成并优化多种批处理方案,以同时改善多个性能指标。解决方案的关键在于提出一种基于干预启发式(intervention heuristics)的帕累托优化方法:通过识别每个活动批处理策略中可改进的机会(如减少等待时间或资源利用率优化),并施加针对性调整(即“干预”),再借助模拟评估干预效果;这些启发式被嵌入到元启发式算法(包括爬山法、模拟退火和强化学习)中,迭代更新帕累托前沿,从而高效探索多样且高质量的批处理策略集。

链接: https://arxiv.org/abs/2507.15457
作者: Orlenys López-Pintado,Jannis Rosenbaum,Marlon Dumas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In business processes, activity batching refers to packing multiple activity instances for joint execution. Batching allows managers to trade off cost and processing effort against waiting time. Larger and less frequent batches may lower costs by reducing processing effort and amortizing fixed costs, but they create longer waiting times. In contrast, smaller and more frequent batches reduce waiting times but increase fixed costs and processing effort. A batching policy defines how activity instances are grouped into batches and when each batch is activated. This paper addresses the problem of discovering batching policies that strike optimal trade-offs between waiting time, processing effort, and cost. The paper proposes a Pareto optimization approach that starts from a given set (possibly empty) of activity batching policies and generates alternative policies for each batched activity via intervention heuristics. Each heuristic identifies an opportunity to improve an activity’s batching policy with respect to a metric (waiting time, processing time, cost, or resource utilization) and an associated adjustment to the activity’s batching policy (the intervention). The impact of each intervention is evaluated via simulation. The intervention heuristics are embedded in an optimization meta-heuristic that triggers interventions to iteratively update the Pareto front of the interventions identified so far. The paper considers three meta-heuristics: hill-climbing, simulated annealing, and reinforcement learning. An experimental evaluation compares the proposed approach based on intervention heuristics against the same (non-heuristic guided) meta-heuristics baseline regarding convergence, diversity, and cycle time gain of Pareto-optimal policies.
zh

[AI-33] Solving nonconvex Hamilton–Jacobi–Isaacs equations with PINN-based policy iteration

【速读】:该论文旨在解决高维、非凸哈密顿-雅可比-伊斯克斯(Hamilton–Jacobi–Isaacs, HJI)方程的求解难题,这类方程广泛出现在随机微分博弈和鲁棒控制问题中。传统数值方法在高维场景下面临“维数灾难”,而直接使用物理信息神经网络(Physics-Informed Neural Networks, PINNs)求解非凸HJI方程时往往不稳定且精度不足。解决方案的关键在于提出一种无网格的策略迭代(policy iteration)框架:通过交替执行两步操作——固定反馈策略下求解线性二阶偏微分方程(PDE),以及利用自动微分进行逐点极小极大优化更新控制输入——从而实现对HJI方程的高效逼近。理论分析证明,在标准利普希茨连续性和一致椭圆性假设下,该方法生成的值函数迭代序列局部一致收敛至唯一粘性解,且无需哈密顿量凸性假设即可保证稳定性与收敛性。数值实验表明,该方法在二维路径规划博弈和五至十维出版商-订阅者微分博弈中均展现出高精度和良好可扩展性,优于直接PINN求解器。

链接: https://arxiv.org/abs/2507.15455
作者: Hee Jun Yang,Min Jung Kim,Yeoneung Kim
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注:

点击查看摘要

Abstract:We propose a mesh-free policy iteration framework that combines classical dynamic programming with physics-informed neural networks (PINNs) to solve high-dimensional, nonconvex Hamilton–Jacobi–Isaacs (HJI) equations arising in stochastic differential games and robust control. The method alternates between solving linear second-order PDEs under fixed feedback policies and updating the controls via pointwise minimax optimization using automatic differentiation. Under standard Lipschitz and uniform ellipticity assumptions, we prove that the value function iterates converge locally uniformly to the unique viscosity solution of the HJI equation. The analysis establishes equi-Lipschitz regularity of the iterates, enabling provable stability and convergence without requiring convexity of the Hamiltonian. Numerical experiments demonstrate the accuracy and scalability of the method. In a two-dimensional stochastic path-planning game with a moving obstacle, our method matches finite-difference benchmarks with relative L^2 -errors below %10^-2%. In five- and ten-dimensional publisher-subscriber differential games with anisotropic noise, the proposed approach consistently outperforms direct PINN solvers, yielding smoother value functions and lower residuals. Our results suggest that integrating PINNs with policy iteration is a practical and theoretically grounded method for solving high-dimensional, nonconvex HJI equations, with potential applications in robotics, finance, and multi-agent reinforcement learning.
zh

[AI-34] Predictive Process Monitoring Using Object-centric Graph Embeddings

【速读】:该论文旨在解决对象中心预测流程监控(object-centric predictive process monitoring)中如何从对象中心事件日志(object-centric event logs)中有效提取相关信息并构建高性能预测模型的问题。其解决方案的关键在于提出一种端到端模型,该模型结合图注意力网络(graph attention network, GAT)以编码活动及其相互关系,并融合长短期记忆网络(LSTM)以捕捉时间依赖性,从而同时实现对下一活动和下一事件时间的精准预测。

链接: https://arxiv.org/abs/2507.15411
作者: Wissam Gherissi(LAMSADE),Mehdi Acheli,Joyce El Haddad(LAMSADE),Daniela Grigori(LAMSADE)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICSOC Workshops 2024, Dec 2024, Tunis, Tunisia

点击查看摘要

Abstract:Object-centric predictive process monitoring explores and utilizes object-centric event logs to enhance process predictions. The main challenge lies in extracting relevant information and building effective models. In this paper, we propose an end-to-end model that predicts future process behavior, focusing on two tasks: next activity prediction and next event time. The proposed model employs a graph attention network to encode activities and their relationships, combined with an LSTM network to handle temporal dependencies. Evaluated on one reallife and three synthetic event logs, the model demonstrates competitive performance compared to state-of-the-art methods.
zh

[AI-35] Neuro-MSBG: An End-to-End Neural Model for Hearing Loss Simulation

【速读】:该论文旨在解决现有听觉损失模拟模型在实时应用中计算复杂度高、延迟大以及难以与语音处理系统直接集成的问题。其关键解决方案是提出一种轻量级端到端模型Neuro-MSBG,该模型包含个性化听力图编码器(personalized audiogram encoder),能够高效进行时频建模,并支持并行推理,显著提升运行效率——相较于原始MSBG模型,单秒输入的模拟时间从0.970秒降至0.021秒(加速46倍),同时保持了语音可懂度(STOI-SRCC=0.9247)和感知语音质量(PESQ-SRCC=0.8671)的高水平。

链接: https://arxiv.org/abs/2507.15396
作者: Hui-Guan Yuan,Ryandhimas E. Zezario,Shafique Ahmed,Hsin-Min Wang,Kai-Lung Hua,Yu Tsao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Hearing loss simulation models are essential for hearing aid deployment. However, existing models have high computational complexity and latency, which limits real-time applications and lack direct integration with speech processing systems. To address these issues, we propose Neuro-MSBG, a lightweight end-to-end model with a personalized audiogram encoder for effective time-frequency modeling. Experiments show that Neuro-MSBG supports parallel inference and retains the intelligibility and perceptual quality of the original MSBG, with a Spearman’s rank correlation coefficient (SRCC) of 0.9247 for Short-Time Objective Intelligibility (STOI) and 0.8671 for Perceptual Evaluation of Speech Quality (PESQ). Neuro-MSBG reduces simulation runtime by a factor of 46 (from 0.970 seconds to 0.021 seconds for a 1 second input), further demonstrating its efficiency and practicality.
zh

[AI-36] PiMRef: Detecting and Explaining Ever-evolving Spear Phishing Emails with Knowledge Base Invariants

【速读】:该论文旨在解决由大语言模型(Large Language Models, LLMs)生成的高迷惑性钓鱼邮件对传统检测机制失效的问题,此类邮件因具备心理说服力且能针对目标用户画像定制化,几乎可绕过现有商业与学术检测工具。解决方案的关键在于提出PiMRef——首个基于知识库不变性的钓鱼邮件检测框架,其核心思想是将钓鱼检测重构为身份事实核查任务:通过提取邮件中发件人声称的身份信息,验证其域名合法性是否符合预设知识库,并识别诱导用户行为的行动号召(Call-to-Action, CTA)提示;若发现矛盾声明,则标记为钓鱼指标并提供人类可理解的解释,从而在保持高召回率的同时显著提升精度(较D-Fence、HelpHed和ChatSpamDetector等方法提升8.8%),并在真实场景中实现92.1%精度与0.05秒中位运行时间。

链接: https://arxiv.org/abs/2507.15393
作者: Ruofan Liu,Yun Lin,Silas Yeo Shuen Yu,Xiwen Teoh,Zhenkai Liang,Jin Song Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phishing emails are a critical component of the cybercrime kill chain due to their wide reach and low cost. Their ever-evolving nature renders traditional rule-based and feature-engineered detectors ineffective in the ongoing arms race between attackers and defenders. The rise of large language models (LLMs) further exacerbates the threat, enabling attackers to craft highly convincing phishing emails at minimal cost. This work demonstrates that LLMs can generate psychologically persuasive phishing emails tailored to victim profiles, successfully bypassing nearly all commercial and academic detectors. To defend against such threats, we propose PiMRef, the first reference-based phishing email detector that leverages knowledge-based invariants. Our core insight is that persuasive phishing emails often contain disprovable identity claims, which contradict real-world facts. PiMRef reframes phishing detection as an identity fact-checking task. Given an email, PiMRef (i) extracts the sender’s claimed identity, (ii) verifies the legitimacy of the sender’s domain against a predefined knowledge base, and (iii) detects call-to-action prompts that push user engagement. Contradictory claims are flagged as phishing indicators and serve as human-understandable explanations. Compared to existing methods such as D-Fence, HelpHed, and ChatSpamDetector, PiMRef boosts precision by 8.8% with no loss in recall on standard benchmarks like Nazario and PhishPot. In a real-world evaluation of 10,183 emails across five university accounts over three years, PiMRef achieved 92.1% precision, 87.9% recall, and a median runtime of 0.05s, outperforming the state-of-the-art in both effectiveness and efficiency. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15393 [cs.CR] (or arXiv:2507.15393v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.15393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因数据集稀疏性和最优轨迹与次优轨迹间转移重叠不足而导致的长时程规划难题。传统基于合成数据增强或轨迹拼接的方法难以泛化至新状态,且依赖启发式拼接点。其解决方案的关键在于提出检索高质量示范(Retrieval High-quAlity Demonstrations, RAD),通过非参数检索结合扩散生成建模实现动态目标状态选择与条件引导的轨迹规划:具体而言,RAD根据状态相似性和回报估计从离线数据集中检索高回报状态作为目标,并利用条件扩散模型向这些目标状态进行规划,从而实现灵活的轨迹拼接并提升在低频或分布外状态下的泛化能力。

链接: https://arxiv.org/abs/2507.15356
作者: Lu Guo,Yixiang Shan,Zhengbang Zhu,Qifan Liang,Lichang Song,Ting Long,Weinan Zhang,Yi Chang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such retrieval-guided generation enables flexible trajectory stitching and improves generalization when encountered with underrepresented or out-of-distribution states. Extensive experiments confirm that RAD achieves competitive or superior performance compared to baselines across diverse benchmarks, validating its effectiveness.
zh

[AI-38] One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

【速读】:该论文旨在解决动态拼车平台在大规模、高不确定性环境下,因多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法对价值函数(Q-value 或 V-value)估计依赖过强而导致的训练不稳定与显著估计偏差问题。传统独立式MARL框架中,每个智能体将其他智能体视为环境的一部分,加剧了价值估计的不准确性。解决方案的关键在于摒弃价值函数估计环节:首先,提出基于组平均奖励(Group Reward-based Policy Optimization, GRPO)的方法,以组平均奖励替代PPO中的基线,消除评论家(critic)估计误差并降低训练偏差;其次,进一步设计了一种仅使用单步奖励即可训练最优策略的One-Step Policy Optimization (OSPO) 方法,在同质车队假设下实现高效策略优化,实验表明二者均能显著提升接单效率与服务订单数量。

链接: https://arxiv.org/abs/2507.15351
作者: Zijian Zhao,Sen Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:On-demand ride-sharing platforms face the fundamental challenge of dynamically bundling passengers with diverse origins and destinations and matching them with vehicles in real time, all under significant uncertainty. Recently, MARL has emerged as a promising solution for this problem, leveraging decentralized learning to address the curse of dimensionality caused by the large number of agents in the ride-hailing market and the resulting expansive state and action spaces. However, conventional MARL-based ride-sharing approaches heavily rely on the accurate estimation of Q-values or V-values, which becomes problematic in large-scale, highly uncertain environments. Specifically, most of these approaches adopt an independent paradigm, exacerbating this issue, as each agent treats others as part of the environment, leading to unstable training and substantial estimation bias in value functions. To address these challenges, we propose two novel alternative methods that bypass value function estimation. First, we adapt GRPO to ride-sharing, replacing the PPO baseline with the group average reward to eliminate critic estimation errors and reduce training bias. Second, inspired by GRPO’s full utilization of group reward information, we customize the PPO framework for ride-sharing platforms and show that, under a homogeneous fleet, the optimal policy can be trained using only one-step rewards - a method we term One-Step Policy Optimization (OSPO). Experiments on a real-world Manhattan ride-hailing dataset demonstrate that both GRPO and OSPO achieve superior performance across most scenarios, efficiently optimizing pickup times and the number of served orders using simple MLP networks.
zh

[AI-39] Scaling Decentralized Learning with FLock

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在去中心化环境下的安全与高效微调问题,尤其是在异构、无信任网络中难以实现中央控制和高计算通信开销的瓶颈。传统联邦学习(Federated Learning, FL)虽能保障数据隐私,但依赖中心服务器易成为单点攻击目标并面临投毒攻击风险。解决方案的关键在于提出 FLock 框架,通过集成基于区块链的信任层与经济激励机制,替代传统 FL 中的中心聚合器,构建一个安全且可审计的多方协作协议,从而在无需中央权威的情况下实现对 70B 参数规模模型的安全微调,并有效抵御后门投毒攻击,同时提升跨域泛化能力。

链接: https://arxiv.org/abs/2507.15349
作者: Zehua Cheng,Rui Sun,Jiahao Sun,Yike Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Fine-tuning the large language models (LLMs) are prevented by the deficiency of centralized control and the massive computing and communication overhead on the decentralized schemes. While the typical standard federated learning (FL) supports data privacy, the central server requirement creates a single point of attack and vulnerability to poisoning attacks. Generalizing the result in this direction to 70B-parameter models in the heterogeneous, trustless environments has turned out to be a huge, yet unbroken bottleneck. This paper introduces FLock, a decentralized framework for secure and efficient collaborative LLM fine-tuning. Integrating a blockchain-based trust layer with economic incentives, FLock replaces the central aggregator with a secure, auditable protocol for cooperation among untrusted parties. We present the first empirical validation of fine-tuning a 70B LLM in a secure, multi-domain, decentralized setting. Our experiments show the FLock framework defends against backdoor poisoning attacks that compromise standard FL optimizers and fosters synergistic knowledge transfer. The resulting models show a 68% reduction in adversarial attack success rates. The global model also demonstrates superior cross-domain generalization, outperforming models trained in isolation on their own specialized data.
zh

[AI-40] StackTrans: From Large Language Model to Large Pushdown Automata Model

【速读】:该论文旨在解决Transformer架构在处理Chomsky层级中更高阶语法结构(如确定性上下文无关文法)时能力不足的问题,因其缺乏对栈式存储机制的有效建模。解决方案的关键在于提出StackTrans模型,其核心创新是在Transformer层间显式引入可微分的隐藏状态栈(hidden state stack),通过堆栈操作(如压栈和弹栈)实现对上下文依赖关系的高效捕捉,且这些操作可端到端学习,保持与现有框架(如flash-attention)的兼容性。这一设计显著提升了模型在Chomsky层级任务和大规模自然语言理解上的性能表现。

链接: https://arxiv.org/abs/2507.15343
作者: Kechi Zhang,Ge Li,Jia Li,Huangzhao Zhang,Yihong Dong,Jia Li,Jingjing Xu,Zhi Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: currently under development

点击查看摘要

Abstract:The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations – such as pushing and popping hidden states – that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.
zh

[AI-41] Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design

【速读】:该论文旨在解决数据库研究中模型选择的静态性问题,即现有方法在面对多样化任务查询与模型架构变化时,未能充分捕捉细粒度的动态关系,导致匹配效果不佳且无法有效优化模型性能。其解决方案的关键在于提出M-DESIGN,一个基于结构化知识库(MKB)的模型精炼管道,通过构建图关系知识模式来显式编码数据属性、架构变体及性能差异,并将模型精炼重构为基于任务元数据的自适应查询问题,从而实现对候选模型的快速匹配与迭代优化,同时支持对分布外(OOD)任务的检测与适应。

链接: https://arxiv.org/abs/2507.15336
作者: Jialiang Wang,Hanmo Liu,Shimin Di,Zhili Wang,Jiachuan Wang,Lei Chen,Xiaofang Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Database systems have recently advocated for embedding machine learning (ML) capabilities, offering declarative model queries over large, managed model repositories, thereby circumventing the huge computational overhead of traditional ML-based algorithms in automated neural network model selection. Pioneering database studies aim to organize existing benchmark repositories as model bases (MB), querying them for the model records with the highest performance estimation metrics for given tasks. However, this static model selection practice overlooks the fine-grained, evolving relational dependencies between diverse task queries and model architecture variations, resulting in suboptimal matches and failing to further refine the model effectively. To fill the model refinement gap in database research, we propose M-DESIGN, a curated model knowledge base (MKB) pipeline for mastering neural network refinement by adaptively weaving prior insights about model architecture modification. First, we propose a knowledge weaving engine that reframes model refinement as an adaptive query problem over task metadata. Given a user’s task query, M-DESIGN quickly matches and iteratively refines candidate models by leveraging a graph-relational knowledge schema that explicitly encodes data properties, architecture variations, and pairwise performance deltas as joinable relations. This schema supports fine-grained relational analytics over architecture tweaks and drives a predictive query planner that can detect and adapt to out-of-distribution (OOD) tasks. We instantiate M-DESIGN for graph analytics tasks, where our model knowledge base enriches existing benchmarks with structured metadata covering 3 graph tasks and 22 graph datasets, contributing data records of 67,760 graph models. Empirical results demonstrate that M-DESIGN delivers the optimal model in 26 of 33 data-task pairs within limited budgets.
zh

[AI-42] QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agent ic AI

【速读】:该论文旨在解决认知退化(Cognitive Degradation)这一新型AI系统漏洞问题,其源于代理型人工智能(Agentic AI)内部的系统性弱点,如记忆饥饿、规划器递归、上下文淹没和输出抑制等,导致沉默的代理漂移、逻辑崩溃和持续幻觉。解决方案的关键在于提出Qorvex安全AI框架(QSAF Domain 10),该框架基于一个六阶段的认知退化生命周期模型,包含七个运行时控制机制(QSAF-BC-001至BC-007),通过实时监控代理子系统并触发主动缓解措施(如备用路由、饥饿检测与内存完整性保障),实现对代理行为的认知韧性防护。该方法借鉴认知神经科学原理,将代理架构映射为人类类比,从而实现疲劳、饥饿与角色崩溃的早期识别,首次建立了跨平台的代理行为韧性防御模型。

链接: https://arxiv.org/abs/2507.15330
作者: Hammad Atta,Muhammad Zeeshan Baig,Yasir Mehmood,Nadeem Shahzad,Ken Huang,Muhammad Aziz Ul Haq,Muhammad Awais,Kamal Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Cognitive Degradation as a novel vulnerability class in agentic AI systems. Unlike traditional adversarial external threats such as prompt injection, these failures originate internally, arising from memory starvation, planner recursion, context flooding, and output suppression. These systemic weaknesses lead to silent agent drift, logic collapse, and persistent hallucinations over time. To address this class of failures, we introduce the Qorvex Security AI Framework for Behavioral Cognitive Resilience (QSAF Domain 10), a lifecycle-aware defense framework defined by a six-stage cognitive degradation lifecycle. The framework includes seven runtime controls (QSAF-BC-001 to BC-007) that monitor agent subsystems in real time and trigger proactive mitigation through fallback routing, starvation detection, and memory integrity enforcement. Drawing from cognitive neuroscience, we map agentic architectures to human analogs, enabling early detection of fatigue, starvation, and role collapse. By introducing a formal lifecycle and real-time mitigation controls, this work establishes Cognitive Degradation as a critical new class of AI system vulnerability and proposes the first cross-platform defense model for resilient agentic behavior.
zh

[AI-43] Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems

【速读】:该论文旨在解决工具代理(tool agent)在执行过程中因参数失败(parameter failure)而导致的有效性受限问题。通过构建参数失败分类体系,作者从主流工具代理的调用链中提炼出五类失败模式,并基于15种输入扰动方法分析不同输入源与失败类别之间的关联性,发现参数名幻觉(parameter name hallucination)主要源于大语言模型(LLM)的内在局限,而其他失败模式则主要由输入源质量引发。解决方案的关键在于提升工具代理交互的可靠性与有效性,具体包括:标准化工具返回格式、优化错误反馈机制以及确保参数一致性,从而系统性缓解参数失败问题。

链接: https://arxiv.org/abs/2507.15296
作者: Qian Xiong,Yuekai Huang,Ziyou Jiang,Zhiyuan Chang,Yujia Zheng,Tianhao Li,Mingyang Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of the tool agent paradigm has broadened the capability boundaries of the Large Language Model (LLM), enabling it to complete more complex tasks. However, the effectiveness of this paradigm is limited due to the issue of parameter failure during its execution. To explore this phenomenon and propose corresponding suggestions, we first construct a parameter failure taxonomy in this paper. We derive five failure categories from the invocation chain of a mainstream tool agent. Then, we explore the correlation between three different input sources and failure categories by applying 15 input perturbation methods to the input. Experimental results show that parameter name hallucination failure primarily stems from inherent LLM limitations, while issues with input sources mainly cause other failure patterns. To improve the reliability and effectiveness of tool-agent interactions, we propose corresponding improvement suggestions, including standardizing tool return formats, improving error feedback mechanisms, and ensuring parameter consistency.
zh

[AI-44] Preferential subspace identification (PSID) with forward-backward smoothing

【速读】:该论文旨在解决多变量时间序列系统辨识中,传统优先子空间辨识(Preferential Subspace Identification, PSID)仅能基于历史主信号进行最优预测的问题,而忽略了在离线应用中通过引入同时期数据(滤波)或全部可用数据(平滑)以实现更优估计的可能性。其解决方案的关键在于:首先,证明了次级信号的存在使得从一组等价的状态空间模型中唯一确定一个具有最优卡尔曼更新步骤的模型,从而支持滤波;其次,提出了一种基于降秩回归的扩展方法,在PSID基础上嵌入直接从数据学习最优卡尔曼增益的步骤,形成“带滤波的PSID”(PSID with filtering)。进一步地,受双滤波卡尔曼平滑框架启发,作者设计了一种前向-后向PSID平滑算法:先对原始数据进行带滤波的PSID处理,再将该过程应用于残差信号的反向时间序列,从而实现最优平滑。此方法在模拟数据上验证了其能够精确恢复真实模型参数,并达到与理想模型一致的最优滤波和平滑解码性能,为多变量时间序列中的动态交互分析提供了原理严谨的线性滤波与平滑工具。

链接: https://arxiv.org/abs/2507.15288
作者: Omid G. Sani,Maryam M. Shanechi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:System identification methods for multivariate time-series, such as neural and behavioral recordings, have been used to build models for predicting one from the other. For example, Preferential Subspace Identification (PSID) builds a state-space model of a primary time-series (e.g., neural activity) to optimally predict a secondary time-series (e.g., behavior). However, PSID focuses on optimal prediction using past primary data, even though in offline applications, better estimation can be achieved by incorporating concurrent data (filtering) or all available data (smoothing). Here, we extend PSID to enable optimal filtering and smoothing. First, we show that the presence of a secondary signal makes it possible to uniquely identify a model with an optimal Kalman update step (to enable filtering) from a family of otherwise equivalent state-space models. Our filtering solution augments PSID with a reduced-rank regression step that directly learns the optimal gain required for the update step from data. We refer to this extension of PSID as PSID with filtering. Second, inspired by two-filter Kalman smoother formulations, we develop a novel forward-backward PSID smoothing algorithm where we first apply PSID with filtering and then apply it again in the reverse time direction on the residuals of the filtered secondary signal. We validate our methods on simulated data, showing that our approach recovers the ground-truth model parameters for filtering, and achieves optimal filtering and smoothing decoding performance of the secondary signal that matches the ideal performance of the true underlying model. This work provides a principled framework for optimal linear filtering and smoothing in the two-signal setting, significantly expanding the toolkit for analyzing dynamic interactions in multivariate time-series.
zh

[AI-45] Mixture of Autoencoder Experts Guidance using Unlabeled and Incomplete Data for Exploration in Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中代理在缺乏显式奖励信号或奖励密集环境中难以有效探索的问题,尤其是在现实场景中专家示范数据常为不完整或无标签的情况下,如何利用这些信息引导代理学习到专家行为模式。其解决方案的关键在于提出一种基于映射函数的内在奖励塑形机制:通过将代理状态与专家数据之间的相似性转化为结构化的内在奖励,结合多自动编码器专家混合模型(Mixture of Autoencoder Experts)来捕捉多样化的专家行为并处理示范中的缺失信息,从而实现对探索方向的灵活控制和高效适应。

链接: https://arxiv.org/abs/2507.15287
作者: Elias Malomgré,Pieter Simoens
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, accepted for the non-archival workshop “Workshop on Reinforcement Learning Beyond Rewards @ Reinforcement Learning Conference 2025”

点击查看摘要

Abstract:Recent trends in Reinforcement Learning (RL) highlight the need for agents to learn from reward-free interactions and alternative supervision signals, such as unlabeled or incomplete demonstrations, rather than relying solely on explicit reward maximization. Additionally, developing generalist agents that can adapt efficiently in real-world environments often requires leveraging these reward-free signals to guide learning and behavior. However, while intrinsic motivation techniques provide a means for agents to seek out novel or uncertain states in the absence of explicit rewards, they are often challenged by dense reward environments or the complexity of high-dimensional state and action spaces. Furthermore, most existing approaches rely directly on the unprocessed intrinsic reward signals, which can make it difficult to shape or control the agent’s exploration effectively. We propose a framework that can effectively utilize expert demonstrations, even when they are incomplete and imperfect. By applying a mapping function to transform the similarity between an agent’s state and expert data into a shaped intrinsic reward, our method allows for flexible and targeted exploration of expert-like behaviors. We employ a Mixture of Autoencoder Experts to capture a diverse range of behaviors and accommodate missing information in demonstrations. Experiments show our approach enables robust exploration and strong performance in both sparse and dense reward environments, even when demonstrations are sparse or incomplete. This provides a practical framework for RL in realistic settings where optimal data is unavailable and precise reward control is needed.
zh

[AI-46] IM-Chat: A Multi-agent LLM -based Framework for Knowledge Transfer in Injection Molding Industry

【速读】:该论文旨在解决注塑成型(Injection Molding)行业中知识传承困难的问题,尤其是在资深工人退休和多语言沟通障碍背景下,如何有效保存与传递现场经验。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的多智能体框架IM-Chat,其通过检索增强生成(Retrieval-Augmented Generation, RAG)策略与工具调用代理(Tool-calling Agents)构建模块化架构,整合有限的文档知识(如故障排除表、操作手册)与由环境输入(如温湿度)驱动的数据驱动工艺条件生成器所提取的大量现场数据,从而实现无需微调即可适应复杂制造场景的上下文感知决策支持。

链接: https://arxiv.org/abs/2507.15268
作者: Junhyeong Lee,Joon-Young Kim,Heekyu Kim,Inhyo Lee,Seunghwa Ryu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective communication. This study introduces IM-Chat, a multi-agent framework based on large language models (LLMs), designed to facilitate knowledge transfer in injection molding. IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context-aware task resolution. By adopting a retrieval-augmented generation (RAG) strategy and tool-calling agents within a modular architecture, IM-Chat ensures adaptability without the need for fine-tuning. Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT-4o guided by a domain-adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool-integrated scenarios. Overall, these findings demonstrate the viability of multi-agent LLM systems for industrial knowledge workflows and establish IM-Chat as a scalable and generalizable approach to AI-assisted decision support in manufacturing.
zh

[AI-47] User Head Movement-Predictive XR in Immersive H2M Collaborations over Future Enterprise Networks

【速读】:该论文旨在解决在人机协作(H2M)场景下,如何实现远端XR内容与人类头部运动的实时同步问题,以保障沉浸式体验并避免网络延迟导致的“网络晕动症”(cyber-sickness)。其核心挑战在于跨大范围地理区域传输XR帧时难以满足低延迟和低抖动要求。解决方案的关键在于提出一种基于双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)的人类头部运动预测机制,提前预判用户头部动作以指导机器端摄像头的朝向调整,并据此动态预测带宽需求,进而设计出人类-机器协同的动态带宽分配(Human-Machine Coordinated Dynamic Bandwidth Allocation, HMC-DBA)方案。仿真结果表明,该方案可在企业网络(如光纤到房间业务,Fiber-To-The-Room-Business)中显著降低带宽消耗的同时,有效满足XR帧的端到端延迟和抖动约束,提升网络资源利用效率。

链接: https://arxiv.org/abs/2507.15254
作者: Sourav Mondal,Elaine Wong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This article is accepted for publication in IEEE Internet of Things Journal. Copyright @ IEEE 2025

点击查看摘要

Abstract:The evolution towards future generation of mobile systems and fixed wireless networks is primarily driven by the urgency to support high-bandwidth and low-latency services across various vertical sectors. This endeavor is fueled by smartphones as well as technologies like industrial internet of things, extended reality (XR), and human-to-machine (H2M) collaborations for fostering industrial and social revolutions like Industry 4.0/5.0 and Society 5.0. To ensure an ideal immersive experience and avoid cyber-sickness for users in all the aforementioned usage scenarios, it is typically challenging to synchronize XR content from a remote machine to a human collaborator according to their head movements across a large geographic span in real-time over communication networks. Thus, we propose a novel H2M collaboration scheme where the human’s head movements are predicted ahead with highly accurate models like bidirectional long short-term memory networks to orient the machine’s camera in advance. We validate that XR frame size varies in accordance with the human’s head movements and predict the corresponding bandwidth requirements from the machine’s camera to propose a human-machine coordinated dynamic bandwidth allocation (HMC-DBA) scheme. Through extensive simulations, we show that end-to-end latency and jitter requirements of XR frames are satisfied with much lower bandwidth consumption over enterprise networks like Fiber-To-The-Room-Business. Furthermore, we show that better efficiency in network resource utilization is achieved by employing our proposed HMC-DBA over state-of-the-art schemes.
zh

[AI-48] Disentangling Homophily and Heterophily in Multimodal Graph Clustering

【速读】:该论文旨在解决多模态图聚类(Multimodal Graph Clustering)在无监督学习场景下的研究空白问题,尤其针对现实世界中多模态图常呈现混合邻域模式(hybrid neighborhood patterns),即同时包含同质性(homophilic)与异质性(heterophilic)关系的复杂结构。其解决方案的关键在于提出一种名为解耦式多模态图聚类(Disentangled Multimodal Graph Clustering, DMGC)的新框架:首先将原始混合图分解为两个互补视图——增强同质性的图以捕捉跨模态类别一致性,以及感知异质性的图以保留模态特有类间差异;进而引入多模态双频融合机制(Multimodal Dual-frequency Fusion),通过双通道策略联合过滤这两个解耦视图,实现有效的多模态信息融合并减少类别混淆;此外,利用自监督对齐目标引导学习过程,无需标签即可提升聚类性能。实验表明,DMGC在多种多模态与多关系图数据集上均达到最优效果,验证了其通用性和有效性。

链接: https://arxiv.org/abs/2507.15253
作者: Zhaochen Guo,Zhixiang Shen,Xuanting Xie,Liangjian Wen,Zhao Kang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Appear in ACM Multimedia 2025

点击查看摘要

Abstract:Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework – \textscDisentangled Multimodal Graph Clustering (DMGC) – which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a \emphMultimodal Dual-frequency Fusion mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at this https URL.
zh

[AI-49] Spatio-Temporal Demand Prediction for Food Delivery Using Attention-Driven Graph Neural Networks

【速读】:该论文旨在解决城市餐饮配送平台中订单量的空间异质性与时间波动性对运营决策效率的影响问题,核心挑战在于如何精准预测未来订单分布以支持动态调度与资源优化。解决方案的关键在于提出一种基于注意力机制的图神经网络(Graph Neural Network, GNN)框架,将城市配送区域建模为图结构,其中节点表示配送区,边反映空间邻近关系及历史订单流动模式;通过引入注意力机制动态加权邻域影响,使模型能够聚焦于上下文相关性强的区域,并联合学习时空依赖关系,从而实现高精度的订单量预测,为车队预置、资源配置和派单优化提供可扩展且自适应的支撑。

链接: https://arxiv.org/abs/2507.15246
作者: Rabia Latief Bhat,Iqra Altaf Gillani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate demand forecasting is critical for enhancing the efficiency and responsiveness of food delivery platforms, where spatial heterogeneity and temporal fluctuations in order volumes directly influence operational decisions. This paper proposes an attention-based Graph Neural Network framework that captures spatial-temporal dependencies by modeling the food delivery environment as a graph. In this graph, nodes represent urban delivery zones, while edges reflect spatial proximity and inter-regional order flow patterns derived from historical data. The attention mechanism dynamically weighs the influence of neighboring zones, enabling the model to focus on the most contextually relevant areas during prediction. Temporal trends are jointly learned alongside spatial interactions, allowing the model to adapt to evolving demand patterns. Extensive experiments on real-world food delivery datasets demonstrate the superiority of the proposed model in forecasting future order volumes with high accuracy. The framework offers a scalable and adaptive solution to support proactive fleet positioning, resource allocation, and dispatch optimization in urban food delivery operations.
zh

[AI-50] SPAR: Scholar Paper Retrieval with LLM -based Agents for Enhanced Academic Search

【速读】:该论文旨在解决当前学术文献检索系统依赖僵化流程且推理能力有限的问题,尤其在面对复杂查询时难以实现精准匹配。其解决方案的核心在于提出SPAR(多智能体框架),通过引入基于RefChain的查询分解与查询演化机制,使检索过程更具灵活性和适应性;同时构建了SPARBench这一专家标注相关性标签的挑战性基准,以支持系统性评估。实验表明,SPAR显著优于现有基线模型,在AutoScholar上F1提升达+56%,在SPARBench上提升+23%。

链接: https://arxiv.org/abs/2507.15245
作者: Xiaofeng Shi,Yuduo Li,Qian Kou,Longbin Yu,Jinxin Xie,Hua Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new opportunities for academic literature retrieval. However, existing systems often rely on rigid pipelines and exhibit limited reasoning capabilities. We introduce SPAR, a multi-agent framework that incorporates RefChain-based query decomposition and query evolution to enable more flexible and effective search. To facilitate systematic evaluation, we also construct SPARBench, a challenging benchmark with expert-annotated relevance labels. Experimental results demonstrate that SPAR substantially outperforms strong baselines, achieving up to +56% F1 on AutoScholar and +23% F1 on SPARBench over the best-performing baseline. Together, SPAR and SPARBench provide a scalable, interpretable, and high-performing foundation for advancing research in scholarly retrieval. Code and data will be available at: this https URL
zh

[AI-51] Explainable Artificial Intelligence based Soft Evaluation Indicator for Arc Fault Diagnosis

【速读】:该论文旨在解决基于人工智能(AI)的电弧故障诊断模型在实际应用中可信度不足的问题,即模型虽然分类准确率高,但其决策过程缺乏可解释性,难以获得使用者的信任。解决方案的关键在于提出一种软评价指标(soft evaluation indicator),通过定义电弧故障的正确解释,并结合可解释人工智能(Explainable Artificial Intelligence, XAI)技术与真实电弧故障实验来实现对模型输出的解释;同时设计了一种轻量级平衡神经网络,在保证竞争性分类精度的同时提升特征提取的可解释性得分,从而增强模型的透明性和可信度,使从业者能够做出更可靠和知情的决策。

链接: https://arxiv.org/abs/2507.15239
作者: Qianchao Wang,Yuxuan Ding,Chuanzhen Jia,Zhe Li,Yaping Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Novel AI-based arc fault diagnosis models have demonstrated outstanding performance in terms of classification accuracy. However, an inherent problem is whether these models can actually be trusted to find arc faults. In this light, this work proposes a soft evaluation indicator that explains the outputs of arc fault diagnosis models, by defining the the correct explanation of arc faults and leveraging Explainable Artificial Intelligence and real arc fault experiments. Meanwhile, a lightweight balanced neural network is proposed to guarantee competitive accuracy and soft feature extraction score. In our experiments, several traditional machine learning methods and deep learning methods across two arc fault datasets with different sample times and noise levels are utilized to test the effectiveness of the soft evaluation indicator. Through this approach, the arc fault diagnosis models are easy to understand and trust, allowing practitioners to make informed and trustworthy decisions.
zh

[AI-52] Solving Formal Math Problems by Decomposition and Iterative Reflection

【速读】:该论文旨在解决通用大语言模型(General-purpose Large Language Models, LLMs)在形式化证明生成任务中表现不足的问题,尤其是在Lean 4这类专用定理证明语言中的应用受限问题。当前主流方法依赖于对模型进行微调以适应特定形式语料库,导致数据收集和训练成本高昂。解决方案的关键在于提出一种基于智能体(agent-based)的框架——Delta Prover,其核心创新包括:一是设计了一种用于反射式分解与迭代修复的算法框架,使LLM能够动态调整和修正推理路径;二是构建了一个基于Lean 4的定制领域特定语言(Domain-Specific Language, DSL),用于高效管理子问题。该方案无需模型专门化即可实现高成功率(miniF2F测试集达95.9%),并展现出优于标准Best-of-N策略的测试时扩展性,验证了通用LLM在有效代理结构引导下具备强大的未开发定理证明能力。

链接: https://arxiv.org/abs/2507.15225
作者: Yichi Zhou,Jianqiu Zhao,Yongxin Zhang,Bohan Wang,Siran Wang,Luoxin Chen,Jiahui Wang,Haowei Chen,Allan Jie,Xinbo Zhang,Haocheng Wang,Luong Trung,Rong Ye,Phan Nhat Hoang,Huishuai Zhang,Peng Sun,Hang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:General-purpose Large Language Models (LLMs) have achieved remarkable success in intelligence, performing comparably to human experts on complex reasoning tasks such as coding and mathematical reasoning. However, generating formal proofs in specialized languages like Lean 4 remains a significant challenge for these models, limiting their application in complex theorem proving and automated verification. Current approaches typically require specializing models through fine-tuning on dedicated formal corpora, incurring high costs for data collection and training. In this work, we introduce \textbfDelta Prover, an agent-based framework that orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment. Delta Prover leverages the reflection and reasoning capabilities of general-purpose LLMs to interactively construct formal proofs in Lean 4, circumventing the need for model specialization. At its core, the agent integrates two novel, interdependent components: an algorithmic framework for reflective decomposition and iterative proof repair, and a custom Domain-Specific Language (DSL) built upon Lean 4 for streamlined subproblem management. \textbfDelta Prover achieves a state-of-the-art 95.9% success rate on the miniF2F-test benchmark, surpassing all existing approaches, including those requiring model specialization. Furthermore, Delta Prover exhibits a significantly stronger test-time scaling law compared to standard Best-of-N proof strategies. Crucially, our findings demonstrate that general-purpose LLMs, when guided by an effective agentic structure, possess substantial untapped theorem-proving capabilities. This presents a computationally efficient alternative to specialized models for robust automated reasoning in formal environments.
zh

[AI-53] SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成向量化代码(即使用SIMD(Single Instruction Multiple Data)指令集的内在函数)方面的能力评估缺失问题。现有代码生成基准主要聚焦于标量代码,缺乏对LLMs在SIMD内在编程任务中表现的系统性评测。为此,作者提出了SimdBench——首个专门针对SIMD内在函数代码生成设计的基准测试集,包含136个精心构造的任务,覆盖SSE、AVX、Neon、SVE和RVV五种主流SIMD扩展指令集。其关键解决方案是构建一个结构化、多样化的基准测试框架,并对18个代表性LLMs进行系统性评估(涵盖正确性和性能指标),从而揭示LLMs在SIMD代码生成中的普遍性能下降趋势及潜在改进方向,为未来LLM在高性能计算领域的应用提供重要参考。

链接: https://arxiv.org/abs/2507.15224
作者: Yibo He,Shuoran Zhao,Jiaming Huang,Yingjie Fu,Hao Yu,Cunjian Huang,Tao Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SIMD (Single Instruction Multiple Data) instructions and their compiler intrinsics are widely supported by modern processors to accelerate performance-critical tasks. SIMD intrinsic programming, a trade-off between coding productivity and high performance, is widely used in the development of mainstream performance-critical libraries and daily computing tasks. Large Language Models (LLMs), which have demonstrated strong and comprehensive capabilities in code generation, show promise in assisting programmers with the challenges of SIMD intrinsic programming. However, existing code-generation benchmarks focus on only scalar code, and it is unclear how LLMs perform in generating vectorized code using SIMD intrinsics. To fill this gap, we propose SimdBench, the first code benchmark specifically designed for SIMD-intrinsic code generation, comprising 136 carefully crafted tasks and targeting five representative SIMD intrinsics: SSE (x86 Streaming SIMD Extension), AVX (x86 Advanced Vector Extension), Neon (ARM Advanced SIMD Extension), SVE (ARM Scalable Vector Extension), and RVV (RISC-V Vector Extension). We conduct a systematic evaluation (measuring both correctness and performance) of 18 representative LLMs on SimdBench, resulting in a series of novel and insightful findings. Our evaluation results demonstrate that LLMs exhibit a universal decrease in pass@k during SIMD-intrinsic code generation compared to scalar-code generation. Our in-depth analysis highlights promising directions for the further advancement of LLMs in the challenging domain of SIMD-intrinsic code generation. SimdBench is fully open source at this https URL to benefit the broader research community.
zh

[AI-54] PromptArmor: Simple yet Effective Prompt Injection Defenses

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在实际应用中面临的提示注入攻击(prompt injection attack)问题,即恶意输入被注入到代理的输入中,诱导其执行攻击者指定的任务而非用户原始意图。解决方案的关键在于提出 PromptArmor,一种轻量且高效的防御机制:它利用现成的 LLM 对输入进行预处理,识别并移除潜在的注入提示,从而在代理执行任务前阻断攻击链路。实验表明,PromptArmor 在 AgentDojo 基准测试上可将误报率和漏报率控制在 1% 以下,并使攻击成功率降至 1% 以下,具备良好的鲁棒性和通用性。

链接: https://arxiv.org/abs/2507.15219
作者: Tianneng Shi,Kaijie Zhu,Zhun Wang,Yuqi Jia,Will Cai,Weida Liang,Haonan Wang,Hend Alzahrani,Joshua Lu,Kenji Kawaguchi,Basel Alomair,Xuandong Zhao,William Yang Wang,Neil Gong,Wenbo Guo,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent’s input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor’s effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.
zh

[AI-55] Can LLM s Generate User Stories and Assess Their Quality?

【速读】:该论文旨在解决需求获取(requirements elicitation)过程中存在的挑战,尤其是如何高效、高质量地将复杂用户需求转化为明确的软件需求,特别是在敏捷开发框架下以用户故事(User Story, US)形式表达时。传统方法依赖人工访谈与分析,存在效率低、主观性强及语义质量难以量化评估的问题。论文提出的关键解决方案是利用大语言模型(Large Language Models, LLMs)模拟客户访谈过程以自动生成用户故事,并进一步探索LLM在提供清晰评价标准的前提下自动评估用户故事的语义质量(如语言清晰度和内部一致性)。实验表明,LLM生成的用户故事在覆盖范围和风格上可媲美人类专家,但在多样性与创造性方面不足;同时,LLM能够可靠地完成语义质量评估任务,显著降低大规模需求审查中的人工成本。

链接: https://arxiv.org/abs/2507.15157
作者: Giovanni Quattrocchi,Liliana Pasquale,Paola Spoletini,Luciano Baresi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Requirements elicitation is still one of the most challenging activities of the requirements engineering process due to the difficulty requirements analysts face in understanding and translating complex needs into concrete requirements. In addition, specifying high-quality requirements is crucial, as it can directly impact the quality of the software to be developed. Although automated tools allow for assessing the syntactic quality of requirements, evaluating semantic metrics (e.g., language clarity, internal consistency) remains a manual and time-consuming activity. This paper explores how LLMs can help automate requirements elicitation within agile frameworks, where requirements are defined as user stories (US). We used 10 state-of-the-art LLMs to investigate their ability to generate US automatically by emulating customer interviews. We evaluated the quality of US generated by LLMs, comparing it with the quality of US generated by humans (domain experts and students). We also explored whether and how LLMs can be used to automatically evaluate the semantic quality of US. Our results indicate that LLMs can generate US similar to humans in terms of coverage and stylistic quality, but exhibit lower diversity and creativity. Although LLM-generated US are generally comparable in quality to those created by humans, they tend to meet the acceptance quality criteria less frequently, regardless of the scale of the LLM model. Finally, LLMs can reliably assess the semantic quality of US when provided with clear evaluation criteria and have the potential to reduce human effort in large-scale assessments.
zh

[AI-56] Constraint-aware Learning of Probabilistic Sequential Models for Multi-Label Classification

【速读】:该论文旨在解决多标签分类(multi-label classification)中标签数量庞大且标签间存在逻辑约束的问题。其解决方案的关键在于构建一种分层架构:首先使用独立的标签分类器生成单标签预测,再将这些预测输入到一个表达能力强的序列模型中,以学习标签间的联合分布。该架构能够利用训练阶段的约束信息建模标签相关性,并在推理阶段强制执行逻辑约束,从而提升分类性能与约束一致性。

链接: https://arxiv.org/abs/2507.15156
作者: Mykhailo Buleshnyi,Anna Polova,Zsolt Zombori,Michael Benedikt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We investigate multi-label classification involving large sets of labels, where the output labels may be known to satisfy some logical constraints. We look at an architecture in which classifiers for individual labels are fed into an expressive sequential model, which produces a joint distribution. One of the potential advantages for such an expressive model is its ability to modelling correlations, as can arise from constraints. We empirically demonstrate the ability of the architecture both to exploit constraints in training and to enforce constraints at inference time.
zh

[AI-57] Can We Move Freely in NEOMs The Line? An Agent -Based Simulation of Human Mobility in a Futuristic Smart City

【速读】:该论文旨在解决超线性城市(The Line)中人类移动自由度的可行性问题,即在170公里长、50层垂直分布的极端城市拓扑结构下,居民能否实现高效、便捷且可持续的多模式交通出行。其解决方案的关键在于构建一个融合了基于智能体的建模(Agent-Based Modeling)、强化学习(Reinforcement Learning)、监督学习(Supervised Learning)与图神经网络(Graph Neural Networks)的混合仿真框架,通过实时优化路径选择、动态调度与环境响应,实现了平均通勤时间仅7.8–8.4分钟、满意度超89%、可达性超91%的性能表现,验证了AI驱动的城市交通系统在复杂空间结构中的操作可行性。

链接: https://arxiv.org/abs/2507.15143
作者: Abderaouf Bahi,Amel Ourici
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper investigates the feasibility of human mobility in The Line, a proposed 170-kilometer linear smart city in NEOM, Saudi Arabia. To assess whether citizens can move freely within this unprecedented urban topology, we develop a hybrid simulation framework that integrates agent-based modeling, reinforcement learning, supervised learning, and graph neural networks. The simulation captures multi-modal transportation behaviors across 50 vertical levels and varying density scenarios using both synthetic data and real-world traces from high-density cities. Our experiments reveal that with the full AI-integrated architecture, agents achieved an average commute time of 7.8 to 8.4 minutes, a satisfaction rate exceeding 89 percent, and a reachability index of over 91 percent, even during peak congestion periods. Ablation studies confirmed that the removal of intelligent modules such as reinforcement learning or graph neural networks significantly degrades performance, with commute times increasing by up to 85 percent and reachability falling below 70 percent. Environmental modeling further demonstrated low energy consumption and minimal CO2 emissions when electric modes are prioritized. The findings suggest that freedom of movement is not only conceptually achievable in The Line, but also operationally realistic if supported by adaptive AI systems, sustainable infrastructure, and real-time feedback loops.
zh

[AI-58] Clinical Semantic Intelligence (CSI): Emulating the Cognitive Framework of the Expert Clinician for Comprehensive Oral Disease Diagnosis

【速读】:该论文旨在解决口腔疾病诊断中存在的临床挑战,即多种病理状态具有重叠的临床表现,导致误诊或漏诊风险增加。为应对这一问题,作者提出了一种名为临床语义智能(Clinical Semantic Intelligence, CSI)的人工智能框架,其核心创新在于不再依赖简单的模式匹配,而是通过计算建模专家临床医生的认知推理过程来实现更精准的诊断。解决方案的关键在于构建了一个分层诊断推理树(Hierarchical Diagnostic Reasoning Tree, HDRT),该结构模拟了差异性诊断的多步骤逻辑流程,并结合微调后的多模态CLIP模型与专用ChatGLM-6B语言模型,使系统能够在快速筛查模式(Fast Mode)和深度交互式诊断模式(Standard Mode)之间切换,从而显著提升诊断准确性——在内部测试集上,Fast Mode准确率为73.4%,而引入HDRT的Standard Mode提升至89.5%,验证了分层推理机制的有效性。

链接: https://arxiv.org/abs/2507.15140
作者: Mohammad Mashayekhi,Sara Ahmadi Majd,Arian AmirAmjadi,Parsa Hosseini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The diagnosis of oral diseases presents a problematic clinical challenge, characterized by a wide spectrum of pathologies with overlapping symptomatology. To address this, we developed Clinical Semantic Intelligence (CSI), a novel artificial intelligence framework that diagnoses 118 different oral diseases by computationally modeling the cognitive processes of an expert clinician. Our core hypothesis is that moving beyond simple pattern matching to emulate expert reasoning is critical to building clinically useful diagnostic aids. CSI’s architecture integrates a fine-tuned multimodal CLIP model with a specialized ChatGLM-6B language model. This system executes a Hierarchical Diagnostic Reasoning Tree (HDRT), a structured framework that distills the systematic, multi-step logic of differential diagnosis. The framework operates in two modes: a Fast Mode for rapid screening and a Standard Mode that leverages the full HDRT for an interactive and in-depth diagnostic workup. To train and validate our system, we curated a primary dataset of 4,310 images, supplemented by an external hold-out set of 176 images for final validation. A clinically-informed augmentation strategy expanded our training data to over 30,000 image-text pairs. On a 431-image internal test set, CSI’s Fast Mode achieved an accuracy of 73.4%, which increased to 89.5% with the HDRT-driven Standard Mode. The performance gain is directly attributable to the hierarchical reasoning process. Herein, we detail the architectural philosophy, development, and rigorous evaluation of the CSI framework. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.15140 [cs.AI] (or arXiv:2507.15140v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.15140 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-59] Automated planning with ontologies under coherence update semantics

【速读】:该论文旨在解决标准自动化规划中难以有效融合背景知识(如本体)的问题,尤其是在开放世界语义下如何实现动作条件与效果的合理建模。传统方法通常基于封闭世界语义下的第一阶公式,无法充分表达现实场景中的不确定性与外部知识。解决方案的关键在于提出一种结合显式输入知识和动作库(explicit-input knowledge and action bases, eKABs)与一致性更新语义下本体感知动作效果的新方法,从而在保持计算复杂度不高于先前方法的前提下,通过多项式编译映射到经典规划框架,实现了对DL-Lite本体的有效集成与高效求解。

链接: https://arxiv.org/abs/2507.15120
作者: Stefan Borgwardt,Duy Nhu,Gabriele Röger
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Standard automated planning employs first-order formulas under closed-world semantics to achieve a goal with a given set of actions from an initial state. We follow a line of research that aims to incorporate background knowledge into automated planning problems, for example, by means of ontologies, which are usually interpreted under open-world semantics. We present a new approach for planning with DL-Lite ontologies that combines the advantages of ontology-based action conditions provided by explicit-input knowledge and action bases (eKABs) and ontology-aware action effects under the coherence update semantics. We show that the complexity of the resulting formalism is not higher than that of previous approaches and provide an implementation via a polynomial compilation into classical planning. An evaluation of existing and new benchmarks examines the performance of a planning system on different variants of our compilation.
zh

[AI-60] From Kicking to Causality: Simulating Infant Agency Detection with a Robust Intrinsic Reward

【速读】:该论文旨在解决标准强化学习代理在噪声环境和生态有效场景中因依赖相关性奖励而表现脆弱的问题,从而难以像人类婴儿一样稳健地发现自身的因果能动性(causal efficacy)。解决方案的关键在于提出一种基于因果推断的内在奖励机制——因果动作影响评分(Causal Action Influence Score, CAIS),其通过计算条件感官结果分布 $ p(h|a) $ 与基线分布 $ p(h) $ 之间的 1-Wasserstein 距离来量化动作的影响,从而将代理的因果作用从混杂环境噪声中分离出来,实现对真实因果关系的识别与学习。

链接: https://arxiv.org/abs/2507.15106
作者: Xia Xu,Jochen Triesch
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:While human infants robustly discover their own causal efficacy, standard reinforcement learning agents remain brittle, as their reliance on correlation-based rewards fails in noisy, ecologically valid scenarios. To address this, we introduce the Causal Action Influence Score (CAIS), a novel intrinsic reward rooted in causal inference. CAIS quantifies an action’s influence by measuring the 1-Wasserstein distance between the learned distribution of sensory outcomes conditional on that action, p(h|a) , and the baseline outcome distribution, p(h) . This divergence provides a robust reward that isolates the agent’s causal impact from confounding environmental noise. We test our approach in a simulated infant-mobile environment where correlation-based perceptual rewards fail completely when the mobile is subjected to external forces. In stark contrast, CAIS enables the agent to filter this noise, identify its influence, and learn the correct policy. Furthermore, the high-quality predictive model learned for CAIS allows our agent, when augmented with a surprise signal, to successfully reproduce the “extinction burst” phenomenon. We conclude that explicitly inferring causality is a crucial mechanism for developing a robust sense of agency, offering a psychologically plausible framework for more adaptive autonomous systems.
zh

[AI-61] AnalogFed: Federated Discovery of Analog Circuit Topologies with Generative AI

【速读】:该论文旨在解决模拟电路设计中生成式 AI(Generative AI)研究因数据隐私与分散性导致的协作困境问题。当前研究受限于模拟电路设计的高度专有性——不仅涉及保密的电路结构,还包括商业化的半导体工艺,使得研究人员难以构建大规模、多样化的公开数据集,从而阻碍了跨机构的协同创新。解决方案的关键在于提出 AnalogFed,一个基于联邦学习(Federated Learning, FedL)框架的分布式协作系统,其核心创新包括针对模拟电路生成任务定制的模型架构、异构数据处理机制以及隐私保护策略,能够在不共享原始私有数据的前提下实现跨客户端的拓扑发现,同时保持与集中式基线相当的性能表现,显著提升了生成式 AI 在模拟电路设计中的效率与可扩展性。

链接: https://arxiv.org/abs/2507.15104
作者: Qiufeng Li,Shu Hong,Jian Gao,Xuan Zhang,Tian Lan,Weidong Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent breakthroughs in AI/ML offer exciting opportunities to revolutionize analog design automation through data-driven approaches. In particular, researchers are increasingly fascinated by harnessing the power of generative AI to automate the discovery of novel analog circuit topologies. Unlocking the full potential of generative AI in these data-driven discoveries requires access to large and diverse this http URL, there is a significant barrier in the analog domain–Analog circuit design is inherently proprietary, involving not only confidential circuit structures but also the underlying commercial semiconductor processes. As a result, current generative AI research is largely confined to individual researchers who construct small, narrowly focused private datasets. This fragmentation severely limits collaborative innovation and impedes progress across the research community. To address these challenges, we propose AnalogFed. AnalogFed enables collaborative topology discovery across decentralized clients (e.g., individual researchers or institutions) without requiring the sharing of raw private data. To make this vision practical, we introduce a suite of techniques tailored to the unique challenges of applying FedL in analog design–from generative model development and data heterogeneity handling to privacy-preserving strategies that ensure both flexibility and security for circuit designers and semiconductor manufacturers. Extensive experiments across varying client counts and dataset sizes demonstrate that AnalogFed achieves performance comparable to centralized baselines–while maintaining strict data privacy. Specifically, the generative AI model within AnalogFed achieves state-of-the-art efficiency and scalability in the design of analog circuit topologies.
zh

[AI-62] Robust Control with Gradient Uncertainty

【速读】:该论文旨在解决在强化学习等应用中,由于价值函数(value function)梯度存在不确定性而导致的鲁棒控制问题。传统鲁棒控制理论通常假设系统模型或参数已知,但实际场景下价值函数常通过函数逼近(如神经网络)得到,其梯度不确定性会显著影响控制性能。为此,作者提出一种新颖的零和动态博弈框架,其中对手同时扰动系统动力学和价值函数梯度,从而推导出包含梯度不确定性的哈密顿-雅可比-贝尔曼-伊斯艾克斯方程(Hamilton-Jacobi-Bellman-Isaacs Equation with Gradient Uncertainty, GU-HJBI)。解决方案的关键在于:首先,在统一椭圆性条件下证明了GU-HJBI方程粘性解的比较原理,确保其适定性;其次,在线性二次(LQ)情形下揭示经典二次价值函数假设在任何非零梯度不确定性下均失效,进而通过摄动分析得到非多项式形式的价值函数修正项及非线性最优控制律;最后,基于此理论设计了梯度不确定性鲁棒的Actor-Critic算法(GURAC),并通过数值实验验证其在训练稳定性上的优势。这一工作为函数逼近广泛存在的领域(如强化学习、计算金融)提供了新的鲁棒控制范式。

链接: https://arxiv.org/abs/2507.15082
作者: Qian Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce a novel extension to robust control theory that explicitly addresses uncertainty in the value function’s gradient, a form of uncertainty endemic to applications like reinforcement learning where value functions are approximated. We formulate a zero-sum dynamic game where an adversary perturbs both system dynamics and the value function gradient, leading to a new, highly nonlinear partial differential equation: the Hamilton-Jacobi-Bellman-Isaacs Equation with Gradient Uncertainty (GU-HJBI). We establish its well-posedness by proving a comparison principle for its viscosity solutions under a uniform ellipticity condition. Our analysis of the linear-quadratic (LQ) case yields a key insight: we prove that the classical quadratic value function assumption fails for any non-zero gradient uncertainty, fundamentally altering the problem structure. A formal perturbation analysis characterizes the non-polynomial correction to the value function and the resulting nonlinearity of the optimal control law, which we validate with numerical studies. Finally, we bridge theory to practice by proposing a novel Gradient-Uncertainty-Robust Actor-Critic (GURAC) algorithm, accompanied by an empirical study demonstrating its effectiveness in stabilizing training. This work provides a new direction for robust control, holding significant implications for fields where function approximation is common, including reinforcement learning and computational finance.
zh

[AI-63] NavVI: A Telerobotic Simulation with Multimodal Feedback for Visually Impaired Navigation in Warehouse Environments

【速读】:该论文旨在解决工业仓库环境中盲人及低视力(Blind and Low-Vision, BLV)操作者在远程操控移动机器人时面临的高风险与高难度问题。现有研究对BLV用户在工业场景下的可访问性 teleoperation 支持不足,尤其缺乏多模态引导机制的设计。解决方案的关键在于构建一个高保真、实时闭环的多模态引导仿真系统:通过导航网格结合动态重规划确保路径准确性并避免与移动叉车和人员发生碰撞;同时提供可视化路径线(面向低视力用户)、语音导航提示(按顺时针方向播报转弯信息)以及基于距离的触觉反馈(感知静态与动态障碍物),从而实现三模态(视觉、听觉、触觉)同步反馈,提升BLV用户的环境感知与控制能力。该系统设计可直接映射至商用硬件模块,具备快速原型验证与实际部署潜力。

链接: https://arxiv.org/abs/2507.15072
作者: Maisha Maimuna,Minhaz Bin Farukee,Sama Nikanfar,Mahfuza Siddiqua,Ayon Roy,Fillia Makedon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial warehouses are congested with moving forklifts, shelves and personnel, making robot teleoperation particularly risky and demanding for blind and low-vision (BLV) operators. Although accessible teleoperation plays a key role in inclusive workforce participation, systematic research on its use in industrial environments is limited, and few existing studies barely address multimodal guidance designed for BLV users. We present a novel multimodal guidance simulator that enables BLV users to control a mobile robot through a high-fidelity warehouse environment while simultaneously receiving synchronized visual, auditory, and haptic feedback. The system combines a navigation mesh with regular re-planning so routes remain accurate avoiding collisions as forklifts and human avatars move around the warehouse. Users with low vision are guided with a visible path line towards destination; navigational voice cues with clockwise directions announce upcoming turns, and finally proximity-based haptic feedback notifies the users of static and moving obstacles in the path. This real-time, closed-loop system offers a repeatable testbed and algorithmic reference for accessible teleoperation research. The simulator’s design principles can be easily adapted to real robots due to the alignment of its navigation, speech, and haptic modules with commercial hardware, supporting rapid feasibility studies and deployment of inclusive telerobotic tools in actual warehouses.
zh

[AI-64] me-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback

【速读】:该论文旨在解决传统时间序列异常检测方法仅能进行二分类判断、缺乏细粒度异常类别标注与可解释性推理的问题。其核心解决方案是提出一个全新的任务范式——时间序列异常推理(Time-RA),将经典的判别式异常检测转化为基于大语言模型(LLM)的生成式、推理密集型任务,并构建了首个面向异常推理的多模态基准数据集RATs40K,包含约4万条来自10个真实场景的时间序列样本,每条样本均配有数值数据、上下文文本和视觉表示,以及细粒度异常类别标签(单变量14类、多变量6类)和结构化解释性推理标注。关键创新在于通过集成生成标签并结合GPT-4反馈迭代优化的标注框架,确保了数据质量与可解释性,从而推动可解释时间序列异常检测的发展。

链接: https://arxiv.org/abs/2507.15066
作者: Yiyuan Yang,Zichuan Liu,Lei Song,Kai Ying,Zhiguang Wang,Tom Bamford,Svitlana Vyetrenko,Jiang Bian,Qingsong Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Under review. 19 pages, 8 figures, 12 tables

点击查看摘要

Abstract:Time series anomaly detection is critical across various domains, yet current approaches often limit analysis to mere binary anomaly classification without detailed categorization or further explanatory reasoning. To address these limitations, we propose a novel task, Time-series Reasoning for Anomaly (Time-RA) that transforms classical time series anomaly detection from a discriminative into a generative, reasoning-intensive task leveraging Large Language Models (LLMs). Also, we introduce the first real-world multimodal benchmark dataset, RATs40K, explicitly annotated for anomaly reasoning, comprising approximately 40,000 samples across 10 real-world domains. Each sample includes numeric time series data, contextual text information, and visual representations, each annotated with fine-grained categories (14 types for univariate anomalies and 6 for multivariate anomalies) and structured explanatory reasoning. We develop a sophisticated annotation framework utilizing ensemble-generated labels refined through GPT-4-driven feedback, ensuring accuracy and interpretability. Extensive benchmarking of LLMs and multimodal LLMs demonstrates the capabilities and limitations of current models, highlighting the critical role of supervised fine-tuning. Our dataset and task pave the way for significant advancements in interpretable time series anomaly detection and reasoning.
zh

[AI-65] ouch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

【速读】:该论文旨在解决当前手持式夹爪(handheld gripper)在收集人类示范时普遍缺乏触觉感知能力的问题,而触觉反馈对于精确操作至关重要。解决方案的关键在于开发了一种便携、轻量化的夹爪硬件,集成触觉传感器以实现视觉与触觉数据的同步采集,并提出一种跨模态表示学习框架,能够在保留各自特征的前提下融合视觉与触觉信号,从而生成可解释且聚焦于物理交互接触区域的表示。该方法显著提升了下游操作任务中策略学习的效率和鲁棒性,在试管插入和移液等精细任务中表现出更高的准确性和抗干扰能力。

链接: https://arxiv.org/abs/2507.15062
作者: Xinyue Zhu,Binghao Huang,Yunzhu Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: More videos can be found on our website: this https URL

点击查看摘要

Abstract:Handheld grippers are increasingly used to collect human demonstrations due to their ease of deployment and versatility. However, most existing designs lack tactile sensing, despite the critical role of tactile feedback in precise manipulation. We present a portable, lightweight gripper with integrated tactile sensors that enables synchronized collection of visual and tactile data in diverse, real-world, and in-the-wild settings. Building on this hardware, we propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving their distinct characteristics. The learning procedure allows the emergence of interpretable representations that consistently focus on contacting regions relevant for physical interactions. When used for downstream manipulation tasks, these representations enable more efficient and effective policy learning, supporting precise robotic manipulation based on multimodal feedback. We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer, demonstrating improved accuracy and robustness under external disturbances. Our project page is available at this https URL .
zh

[AI-66] DeRAG : Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection KDD

【速读】:该论文旨在解决对抗性提示攻击(Adversarial Prompt Attacks)对检索增强生成(Retrieval-Augmented Generation, RAG)系统可靠性造成的威胁,即通过精心设计的恶意提示后缀诱导RAG系统将错误文档排至检索前列,从而误导生成结果。解决方案的关键在于提出一种基于差分进化(Differential Evolution, DE)的梯度-free优化方法,将整个RAG流程视为黑盒,通过演化候选后缀群体以最大化目标错误文档的检索排名;该方法仅需5个token的极短后缀即可实现高成功率,并结合可读性感知的后缀构造策略提升隐蔽性,实验证明其在多个检索器上优于现有方法(如GGPP和PRADA),且能有效规避基于BERT的检测模型。

链接: https://arxiv.org/abs/2507.15042
作者: Jerry Wang,Fang Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by KDD Workshop on Prompt Optimization 2025

点击查看摘要

Abstract:Adversarial prompt attacks can significantly alter the reliability of Retrieval-Augmented Generation (RAG) systems by re-ranking them to produce incorrect outputs. In this paper, we present a novel method that applies Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG-based question answering. Our approach is gradient-free, treating the RAG pipeline as a black box and evolving a population of candidate suffixes to maximize the retrieval rank of a targeted incorrect document to be closer to real world scenarios. We conducted experiments on the BEIR QA datasets to evaluate attack success at certain retrieval rank thresholds under multiple retrieving applications. Our results demonstrate that DE-based prompt optimization attains competitive (and in some cases higher) success rates compared to GGPP to dense retrievers and PRADA to sparse retrievers, while using only a small number of tokens (=5 tokens) in the adversarial suffix. Furthermore, we introduce a readability-aware suffix construction strategy, validated by a statistically significant reduction in MLM negative log-likelihood with Welch’s t-test. Through evaluations with a BERT-based adversarial suffix detector, we show that DE-generated suffixes evade detection, yielding near-chance detection accuracy.
zh

[AI-67] Survey of GenAI for Automotive Software Development: From Requirements to Executable Code ACL

【速读】:该论文旨在解决汽车软件开发过程中因需求复杂、标准化严格而导致的流程冗长与成本高昂问题,其核心解决方案在于引入生成式人工智能(Generative AI)技术以减少人工干预并提升效率。关键在于系统性地探索生成式AI在汽车软件开发各阶段的应用,包括需求处理、合规性管理及代码生成,并基于大语言模型(Large Language Models, LLMs)、检索增强生成(Retrieval Augmented Generation, RAG)和视觉语言模型(Vision Language Models, VLMs)等前沿技术构建通用的AI辅助开发工作流,同时结合行业调研数据验证实际应用效果。

链接: https://arxiv.org/abs/2507.15025
作者: Nenad Petrovic,Vahid Zolfaghari,Andre Schamschurko,Sven Kirchner,Fengjunjie Pan,Chengdng Wu,Nils Purschke,Aleksei Velsh,Krzysztof Lebioda,Yinglei Song,Yi Zhang,Lukasz Mazur,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Conference paper accepted for GACLM 2025

点击查看摘要

Abstract:Adoption of state-of-art Generative Artificial Intelligence (GenAI) aims to revolutionize many industrial areas by reducing the amount of human intervention needed and effort for handling complex underlying processes. Automotive software development is considered to be a significant area for GenAI adoption, taking into account lengthy and expensive procedures, resulting from the amount of requirements and strict standardization. In this paper, we explore the adoption of GenAI for various steps of automotive software development, mainly focusing on requirements handling, compliance aspects and code generation. Three GenAI-related technologies are covered within the state-of-art: Large Language Models (LLMs), Retrieval Augmented Generation (RAG), Vision Language Models (VLMs), as well as overview of adopted prompting techniques in case of code generation. Additionally, we also derive a generalized GenAI-aided automotive software development workflow based on our findings from this literature review. Finally, we include a summary of a survey outcome, which was conducted among our automotive industry partners regarding the type of GenAI tools used for their daily work activities.
zh

[AI-68] A Forced-Choice Neural Cognitive Diagnostic Model of Personality Testing

【速读】:该论文旨在解决传统心理测量模型在处理强迫选择测验(forced-choice test)时存在的局限性,尤其是难以准确建模项目与被试特征之间复杂非线性交互关系的问题。其关键解决方案是提出一种基于深度学习的强迫选择神经认知诊断模型(Forced-Choice Neural Cognitive Diagnostic Model, FCNCD),通过引入多层神经网络对被试和项目特征进行非线性映射并建模其交互作用,同时利用单调性假设增强诊断结果的可解释性,从而提升模型在三种常见强迫选择项目块类型上的准确性、鲁棒性和可解释性。

链接: https://arxiv.org/abs/2507.15013
作者: Xiaoyu Li,Jin Wu,Shaoyang Guo,Haoran Shi,Chanjin Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15pages, 7 figures

点击查看摘要

Abstract:In the smart era, psychometric tests are becoming increasingly important for personnel selection, career development, and mental health assessment. Forced-choice tests are common in personality assessments because they require participants to select from closely related options, lowering the risk of response distortion. This study presents a deep learning-based Forced-Choice Neural Cognitive Diagnostic Model (FCNCD) that overcomes the limitations of traditional models and is applicable to the three most common item block types found in forced-choice tests. To account for the unidimensionality of items in forced-choice tests, we create interpretable participant and item parameters. We model the interactions between participant and item features using multilayer neural networks after mining them using nonlinear mapping. In addition, we use the monotonicity assumption to improve the interpretability of the diagnostic results. The FCNCD’s effectiveness is validated by experiments on real-world and simulated datasets that show its accuracy, interpretability, and robustness.
zh

[AI-69] he Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

【速读】:该论文旨在解决当前软件工程(SE 3.0)中缺乏对自主编码代理(autonomous coding agents)在真实开发环境中运行机制的实证研究问题。现有工作多停留在理论探讨层面,难以支撑对AI原生软件开发流程的量化评估与优化。其解决方案的关键在于构建并公开发布AIDev——首个大规模、结构化的数据集,涵盖五类领先编码代理(OpenAI Codex、Devin、GitHub Copilot、Cursor 和 Claude Code)在61,000个仓库中生成的超过456,000个Pull Request(PR),包含详细的元数据如作者信息、评审时间线、代码变更和集成结果等。该数据集为基准测试、代理就绪度评估、协作建模及AI治理提供了可扩展、可分析的真实世界证据,揭示了代理虽提升提交效率但接受率较低、代码结构更简单等关键现象,从而推动人机协同式AI原生开发范式的科学化演进。

链接: https://arxiv.org/abs/2507.15003
作者: Hao Li,Haoxiang Zhang,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at this https URL. AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG) Cite as: arXiv:2507.15003 [cs.SE] (or arXiv:2507.15003v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.15003 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hao Li [view email] [v1] Sun, 20 Jul 2025 15:15:58 UTC (236 KB)
zh

[AI-70] AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中存在的核心问题:即模型虽具备从预训练数据中习得的潜在安全理解能力,但在对齐后仍易生成有害内容、出现过度拒绝(over-refusal)以及有用性下降等问题。现有方法通常依赖于监督式推理或导致表面化的拒绝策略,未能充分激发模型内在的安全自知能力。解决方案的关键在于提出 AlphaAlign,一个纯强化学习(reinforcement learning, RL)框架,其核心创新是设计了一个可验证的安全奖励机制(verifiable safety reward),通过双奖励系统引导模型进行主动安全推理——一方面鼓励对有害请求给出格式正确且具显式理由的拒绝,同时惩罚过度拒绝;另一方面引入归一化的有用性奖励以保持良性输入下的高质量响应。该方法无需额外的安全特定推理标注数据,即可实现安全性与有用性的协同提升,并推动模型形成深层次的安全推理能力。

链接: https://arxiv.org/abs/2507.14987
作者: Yi Zhang,An Zhang,XiuYu Zhang,Leheng Sheng,Yuxin Chen,Zhenkai Liang,Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model’s intrinsic safety self-awareness. We propose \textbfAlphaAlign, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning. AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns.
zh

[AI-71] FCRF: Flexible Constructivism Reflection for Long-Horizon Robotic Task Planning with Large Language Models IROS2025

【速读】:该论文旨在解决家用机器人在执行复杂长时任务时因缺乏自主纠错能力而导致的可靠性不足问题。现有基于大语言模型(Large Language Models, LLMs)的自我反思方法受限于僵化的反思机制,难以适应不同难度任务的需求。其解决方案的关键在于提出一种灵活建构主义反思框架(Flexible Constructivism Reflection Framework, FCRF),采用Mentor-Actor架构,使LLMs能够根据任务难度动态调整反思策略,并将历史经验与失败教训进行建设性整合,从而实现更高效、灵活的自主纠错能力。

链接: https://arxiv.org/abs/2507.14975
作者: Yufan Song,Jiatao Zhang,Zeng Gu,Qingmiao Liang,Tuocheng Hu,Wei Song,Shiqiang Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, IROS 2025

点击查看摘要

Abstract:Autonomous error correction is critical for domestic robots to achieve reliable execution of complex long-horizon tasks. Prior work has explored self-reflection in Large Language Models (LLMs) for task planning error correction; however, existing methods are constrained by inflexible self-reflection mechanisms that limit their effectiveness. Motivated by these limitations and inspired by human cognitive adaptation, we propose the Flexible Constructivism Reflection Framework (FCRF), a novel Mentor-Actor architecture that enables LLMs to perform flexible self-reflection based on task difficulty, while constructively integrating historical valuable experience with failure lessons. We evaluated FCRF on diverse domestic tasks through simulation in AlfWorld and physical deployment in the real-world environment. Experimental results demonstrate that FCRF significantly improves overall performance and self-reflection flexibility in complex long-horizon robotic tasks.
zh

[AI-72] Complexity of Faceted Explanations in Propositional Abduction

【速读】:该论文旨在解决命题归因(propositional abduction)中解释的多样性与计算复杂性之间的矛盾问题,即如何在保持良好计算复杂性的前提下,深入理解解释间的差异性和变异性。其关键解决方案是引入“特征”(facets)概念——指在某些解释中出现但在其他解释中不出现的原子命题(literals),从而实现对解释空间的细粒度刻画;同时通过定义两个解释之间的距离来量化解释的异质性或同质性,进而提供更精确的解释分析能力。该方法在Post’s框架下进行了系统性分析,实现了对多种逻辑片段中特征性质的近乎完整刻画。

链接: https://arxiv.org/abs/2507.14962
作者: Johannes Schmidt,Mohamed Maizia,Victor Lagerkvist,Johannes K. Fichte
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
备注: This is the author’s self-archived copy including detailed proofs. To appear in Theory and Practice of Logic Programming (TPLP), Proceedings of the 41st International Conference on Logic Programming (ICLP 2025)

点击查看摘要

Abstract:Abductive reasoning is a popular non-monotonic paradigm that aims to explain observed symptoms and manifestations. It has many applications, such as diagnosis and planning in artificial intelligence and database updates. In propositional abduction, we focus on specifying knowledge by a propositional formula. The computational complexity of tasks in propositional abduction has been systematically characterized - even with detailed classifications for Boolean fragments. Unsurprisingly, the most insightful reasoning problems (counting and enumeration) are computationally highly challenging. Therefore, we consider reasoning between decisions and counting, allowing us to understand explanations better while maintaining favorable complexity. We introduce facets to propositional abductions, which are literals that occur in some explanation (relevant) but not all explanations (dispensable). Reasoning with facets provides a more fine-grained understanding of variability in explanations (heterogeneous). In addition, we consider the distance between two explanations, enabling a better understanding of heterogeneity/homogeneity. We comprehensively analyze facets of propositional abduction in various settings, including an almost complete characterization in Post’s framework.
zh

[AI-73] Probing EFX via PMMS: (Non-)Existence Results in Discrete Fair Division

【速读】:该论文致力于解决公平分配中两个核心难题:EFX(Envy-Free up to any good)和PMMS(Proportional Maximin Share)的存在性问题。其中,EFX被广泛认为是公平分配领域最重要的开放问题,而PMMS则是比EFX更强的公平标准。研究的关键突破在于:首先构造了一个三参与者实例(包含两种单调估值和一种可加估值),证明了在该设定下不存在PMMS分配,从而首次严格区分了EFX与PMMS;其次,针对三种特殊估值结构——个性化二值估值(personalized bivalued valuations)、可整除的二值估值(where aia_i is divisible by bib_i)以及二值MMS可行估值(binary-valued MMS-feasible valuations),分别证明了EFX或PMMS分配的存在性,并给出了多项式时间算法,实现了从理论存在性到高效构造的跨越。这些结果深化了对公平分配边界条件的理解,并为实际应用中的算法设计提供了坚实基础。

链接: https://arxiv.org/abs/2507.14957
作者: Jarosław Byrka,Franciszek Malinka,Tomasz Ponitka
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 27 pages, 4 figures

点击查看摘要

Abstract:We study the fair division of indivisible items and provide new insights into the EFX problem, which is widely regarded as the central open question in fair division, and the PMMS problem, a strictly stronger variant of EFX. Our first result constructs a three-agent instance with two monotone valuations and one additive valuation in which no PMMS allocation exists. Since EFX allocations are known to exist under these assumptions, this establishes a formal separation between EFX and PMMS. We prove existence of fair allocations for three important special cases. We show that EFX allocations exist for personalized bivalued valuations, where for each agent i there exist values a_i b_i such that agent i assigns value v_i(\g) \in \a_i, b_i\ to each good g . We establish an analogous existence result for PMMS allocations when a_i is divisible by b_i . We also prove that PMMS allocations exist for binary-valued MMS-feasible valuations, where each bundle S has value v_i(S) \in \0, 1\ . Notably, this result holds even without assuming monotonicity of valuations and thus applies to the fair division of chores and mixed manna. Finally, we study a class of valuations called pair-demand valuations, which extend the well-studied unit-demand valuations to the case where each agent derives value from at most two items, and we show that PMMS allocations exist in this setting. Our proofs are constructive, and we provide polynomial-time algorithms for all three existence results. Comments: 27 pages, 4 figures Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2507.14957 [cs.GT] (or arXiv:2507.14957v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2507.14957 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-74] Byzantine-Robust Decentralized Coordination of LLM Agents

【速读】:该论文旨在解决多智能体大语言模型(Large Language Model, LLM)系统中因依赖领导者协调而导致的两个核心问题:一是领导者易受针对性攻击,导致共识失败并引发高延迟的重复共识轮次;二是即使存在更高质量的答案,系统仍可能因采纳低质量的领导者提案而降低整体输出质量。解决方案的关键在于提出一种去中心化的共识机制 DecentLLMs,其通过让工作者代理(worker agents)并发生成答案,并由评估者代理(evaluator agents)独立评分与排序这些答案,从而实现对拜占庭(Byzantine)恶意代理的鲁棒性聚合,不仅提升了共识效率,还确保了最终选择的答案质量显著优于传统方法。

链接: https://arxiv.org/abs/2507.14928
作者: Yongrae Jo,Chanik Park
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collaboration among multiple large language model (LLM) agents is a promising approach to overcome inherent limitations of single-agent systems, such as hallucinations and single points of failure. As LLM agents are increasingly deployed on open blockchain platforms, multi-agent systems capable of tolerating malicious (Byzantine) agents have become essential. Recent Byzantine-robust multi-agent systems typically rely on leader-driven coordination, which suffers from two major drawbacks. First, they are inherently vulnerable to targeted attacks against the leader. If consecutive leaders behave maliciously, the system repeatedly fails to achieve consensus, forcing new consensus rounds, which is particularly costly given the high latency of LLM invocations. Second, an underperforming proposal from the leader can be accepted as the final answer even when higher-quality alternatives are available, as existing methods finalize the leader’s proposal once it receives a quorum of votes. To address these issues, we propose DecentLLMs, a novel decentralized consensus approach for multi-agent LLM systems, where worker agents generate answers concurrently and evaluator agents independently score and rank these answers to select the best available one. This decentralized architecture enables faster consensus despite the presence of Byzantine agents and consistently selects higher-quality answers through Byzantine-robust aggregation techniques. Experimental results demonstrate that DecentLLMs effectively tolerates Byzantine agents and significantly improves the quality of selected answers. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14928 [cs.DC] (or arXiv:2507.14928v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.14928 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yongrae Jo [view email] [v1] Sun, 20 Jul 2025 11:55:26 UTC (12,927 KB)
zh

[AI-75] One Step Beyond: Feedthrough Placement-Aware Rectilinear Floorplanner

【速读】:该论文旨在解决现有单元布局(floorplanning)方法在与后续物理设计阶段集成时存在的不足,具体表现为模块内部组件放置不佳和模块间布线穿透(feedthrough)过多的问题。其解决方案的关键在于提出一种三阶段的、考虑布线穿透和组件放置的矩形布局规划器 Flora:第一阶段通过 wiremask 和 position mask 技术实现 HPWL(Half-Perimeter Wirelength)和 feedthrough 的粗粒度优化;第二阶段在固定轮廓约束下通过局部调整模块形状实现零间隙布局,从而精细优化 feedthrough 并改善组件放置;第三阶段采用基于快速树搜索的方法高效完成模块内宏单元(macros)和标准单元(standard cells)的放置,并根据结果动态调整模块边界,实现跨阶段协同优化。

链接: https://arxiv.org/abs/2507.14914
作者: Zhexuan Xu,Jie Wang,Siyuan Xu,Zijie Geng,Mingxuan Yuan,Feng Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Floorplanning determines the shapes and locations of modules on a chip canvas and plays a critical role in optimizing the chip’s Power, Performance, and Area (PPA) metrics. However, existing floorplanning approaches often fail to integrate with subsequent physical design stages, leading to suboptimal in-module component placement and excessive inter-module feedthrough. To tackle this challenge, we propose Flora, a three-stage feedthrough and placement aware rectilinear floorplanner. In the first stage, Flora employs wiremask and position mask techniques to achieve coarse-grained optimization of HPWL and feedthrough. In the second stage, under the constraint of a fixed outline, Flora achieves a zero-whitespace layout by locally resizing module shapes, thereby performing fine-grained optimization of feedthrough and improving component placement. In the third stage, Flora utilizes a fast tree search-based method to efficiently place components-including macros and standard cells-within each module, subsequently adjusting module boundaries based on the placement results to enable cross-stage optimization. Experimental results show that Flora outperforms recent state-of-the-art floorplanning approaches, achieving an average reduction of 6% in HPWL, 5.16% in FTpin, 29.15% in FTmod, and a 14% improvement in component placement performance.
zh

[AI-76] Redefining Elderly Care with Agent ic AI: Challenges and Opportunities

【速读】:该论文旨在解决全球老龄化背景下传统老年照护模式难以满足个性化、自主化和高质量生活需求的问题。其解决方案的关键在于利用基于大语言模型(Large Language Models, LLMs)的代理型人工智能(Agentic Artificial Intelligence),通过实现主动决策与自主行为,推动老年照护向智能化转型。具体应用包括健康状态的个性化追踪、认知护理支持及环境管理,以提升老年人的独立性和生活质量。同时,论文强调需建立伦理保障机制、隐私保护措施和透明决策框架,确保技术部署符合老年人群体的特殊需求与脆弱性,从而实现人本导向的AI整合与可持续发展。

链接: https://arxiv.org/abs/2507.14912
作者: Ruhul Amin Khalil,Kashif Ahmad,Hazrat Ali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The global ageing population necessitates new and emerging strategies for caring for older adults. In this article, we explore the potential for transformation in elderly care through Agentic Artificial Intelligence (AI), powered by Large Language Models (LLMs). We discuss the proactive and autonomous decision-making facilitated by Agentic AI in elderly care. Personalized tracking of health, cognitive care, and environmental management, all aimed at enhancing independence and high-level living for older adults, represents important areas of application. With a potential for significant transformation of elderly care, Agentic AI also raises profound concerns about data privacy and security, decision independence, and access. We share key insights to emphasize the need for ethical safeguards, privacy protections, and transparent decision-making. Our goal in this article is to provide a balanced discussion of both the potential and the challenges associated with Agentic AI, and to provide insights into its responsible use in elderly care, to bring Agentic AI into harmony with the requirements and vulnerabilities specific to the elderly. Finally, we identify the priorities for the academic research communities, to achieve human-centered advancements and integration of Agentic AI in elderly care. To the best of our knowledge, this is no existing study that reviews the role of Agentic AI in elderly care. Hence, we address the literature gap by analyzing the unique capabilities, applications, and limitations of LLM-based Agentic AI in elderly care. We also provide a companion interactive dashboard at this https URL.
zh

[AI-77] he Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities

【速读】:该论文旨在解决人工智能部署中的责任缺失问题(responsibility gap),即在AI系统决策过程中,当出现损害时难以明确责任归属的伦理与法律困境。其解决方案的关键在于提出“无限调优”(Endless Tuning)设计方法,通过双镜像(double mirroring)过程实现人机协同决策:一方面避免人类被替代,另一方面确保责任可追溯。该方法强调以用户为中心的体验而非单纯统计准确性,并通过反向和诠释性(hermeneutic)部署可解释人工智能(XAI)算法,在贷款审批、肺炎诊断和艺术风格识别等场景中验证了使用者对决策过程仍保持充分控制感,从而在责任与问责之间建立桥梁。

链接: https://arxiv.org/abs/2507.14909
作者: Elio Grande
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.
zh

[AI-78] Feedback-Induced Performance Decline in LLM -Based Decision-Making

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自主决策场景中,尤其是在马尔可夫决策过程(Markov Decision Process, MDPs)下的行为特性与性能瓶颈问题。其核心挑战在于:尽管LLMs具备从自然语言问题描述中提取上下文的能力,并能利用预训练阶段获得的先验知识实现快速适应,但在复杂环境中缺乏有效的规划和推理能力,且现有反馈机制可能引入混淆,反而降低决策性能。论文的关键解决方案是通过设计在线结构化提示(online structured prompting)策略,在序列决策任务中对比LLM方法与经典强化学习(Reinforcement Learning, RL)方法的零样本(zero-shot)表现,从而揭示LLM在简单环境中的初始优势以及在复杂环境中的局限性,并强调未来需结合混合策略、微调(fine-tuning)和高级记忆集成技术以提升其决策能力。

链接: https://arxiv.org/abs/2507.14906
作者: Xiao Yang,Juxi Leitner,Michael Burke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability of Large Language Models (LLMs) to extract context from natural language problem descriptions naturally raises questions about their suitability in autonomous decision-making settings. This paper studies the behaviour of these models within a Markov Decision Process (MDPs). While traditional reinforcement learning (RL) strategies commonly employed in this setting rely on iterative exploration, LLMs, pre-trained on diverse datasets, offer the capability to leverage prior knowledge for faster adaptation. We investigate online structured prompting strategies in sequential decision making tasks, comparing the zero-shot performance of LLM-based approaches to that of classical RL methods. Our findings reveal that although LLMs demonstrate improved initial performance in simpler environments, they struggle with planning and reasoning in complex scenarios without fine-tuning or additional guidance. Our results show that feedback mechanisms, intended to improve decision-making, often introduce confusion, leading to diminished performance in intricate environments. These insights underscore the need for further exploration into hybrid strategies, fine-tuning, and advanced memory integration to enhance LLM-based decision-making capabilities.
zh

[AI-79] Agent Fly: Extensible and Scalable Reinforcement Learning for LM Agents

【速读】:该论文旨在解决语言模型(Language Model, LM)代理与强化学习(Reinforcement Learning, RL)结合(即Agent-RL)研究不足、缺乏系统性框架的问题。当前LM代理主要依赖提示工程或监督微调,而强化学习虽能提升推理和事实准确性,但其在代理场景中的整合仍处于探索阶段。解决方案的关键在于构建一个可扩展且易用的Agent-RL框架——AgentFly,其核心创新包括:通过token级掩码适配传统强化学习方法以支持多轮交互;采用装饰器模式定义工具和奖励函数,实现模块化扩展;引入异步执行机制与集中式资源管理,提升训练吞吐量与环境协调能力。该框架已集成多种预置工具和环境,验证了其在多任务训练中的有效性。

链接: https://arxiv.org/abs/2507.14897
作者: Renxi Wang,Rifo Ahmad Genadi,Bilal El Bouardi,Yongxin Wang,Fajri Koto,Zhengzhong Liu,Timothy Baldwin,Haonan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language model (LM) agents have gained significant attention for their ability to autonomously complete tasks through interactions with environments, tools, and APIs. LM agents are primarily built with prompt engineering or supervised finetuning. At the same time, reinforcement learning (RL) has been explored to enhance LM’s capabilities, such as reasoning and factuality. However, the combination of the LM agents and reinforcement learning (Agent-RL) remains underexplored and lacks systematic study. To this end, we built AgentFly, a scalable and extensible Agent-RL framework designed to empower LM agents with a variety of RL algorithms. Our framework supports multi-turn interactions by adapting traditional RL methods with token-level masking. It features a decorator-based interface for defining tools and reward functions, enabling seamless extension and ease of use. To support high-throughput training, we implement asynchronous execution of tool calls and reward computations, and design a centralized resource management system for scalable environment coordination. We also provide a suite of prebuilt tools and environments, demonstrating the framework’s effectiveness through successful agent training across multiple tasks.
zh

[AI-80] Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在模型压缩过程中难以保持应用特定性能特征的问题,尤其针对结构化剪枝(Structured Pruning)中传统重要性度量方法无法有效保留任务相关性能的局限性。其解决方案的关键在于提出了一种增强的重要性度量框架,该框架不仅考虑模型规模的缩减,还显式引入了应用特定的性能约束,并通过多种策略确定每组结构化参数的最优剪枝幅度,从而在压缩率与目标任务性能之间实现平衡,实验表明该方法能在显著剪枝后仍保持重建MNIST图像的可用性。

链接: https://arxiv.org/abs/2507.14882
作者: Ganesh Sundaram,Jonas Ulmen,Amjad Haider,Daniel Görges
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 22nd International Conference on Advanced Robotics (ICAR 2025)

点击查看摘要

Abstract:Deep neural networks (DNNs) offer significant versatility and performance benefits, but their widespread adoption is often hindered by high model complexity and computational demands. Model compression techniques such as pruning have emerged as promising solutions to these challenges. However, it remains critical to ensure that application-specific performance characteristics are preserved during compression. In structured pruning, where groups of structurally coherent elements are removed, conventional importance metrics frequently fail to maintain these essential performance attributes. In this work, we propose an enhanced importance metric framework that not only reduces model size but also explicitly accounts for application-specific performance constraints. We employ multiple strategies to determine the optimal pruning magnitude for each group, ensuring a balance between compression and task performance. Our approach is evaluated on an autoencoder tasked with reconstructing MNIST images. Experimental results demonstrate that the proposed method effectively preserves task-relevant performance, maintaining the model’s usability even after substantial pruning, by satisfying the required application-specific criteria.
zh

[AI-81] he Tsetlin Machine Goes Deep: Logical Learning and Reasoning With Graphs

【速读】:该论文旨在解决传统Tsetlin Machine(TM)在处理复杂结构化数据(如图结构、序列、多模态信息)时的局限性,即其依赖于扁平、固定长度输入,难以有效建模高阶关系与嵌套模式。解决方案的关键在于提出Graph Tsetlin Machine(GraphTM),通过引入消息传递机制构建嵌套的深度子句(deep clauses),从而从图结构输入中学习可解释的深层逻辑规则。这一设计显著减少了所需条款数量(呈指数级减少),同时提升了对图像分类、动作核心指代追踪、推荐系统和病毒基因组序列等多样化任务的准确性与鲁棒性,实现了在保持可解释性的同时逼近甚至超越深度学习模型的性能。

链接: https://arxiv.org/abs/2507.14874
作者: Ole-Christoffer Granmo,Youmna Abdelwahab,Per-Arne Andersen,Paul F. A. Clarke,Kunal Dumbre,Ylva Grønninsæter,Vojtech Halenka,Runar Helin,Lei Jiao,Ahmed Khalid,Rebekka Omslandseter,Rupsa Saha,Mayur Shende,Xuan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 10 figures

点击查看摘要

Abstract:Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine ™ both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of datasets. We introduce the Graph Tsetlin Machine (GraphTM) for learning interpretable deep clauses from graph-structured input. Moving beyond flat, fixed-length input, the GraphTM gets more versatile, supporting sequences, grids, relations, and multimodality. Through message passing, the GraphTM builds nested deep clauses to recognize sub-graph patterns with exponentially fewer clauses, increasing both interpretability and data utilization. For image classification, GraphTM preserves interpretability and achieves 3.86%-points higher accuracy on CIFAR-10 than a convolutional TM. For tracking action coreference, faced with increasingly challenging tasks, GraphTM outperforms other reinforcement learning methods by up to 20.6%-points. In recommendation systems, it tolerates increasing noise to a greater extent than a Graph Convolutional Neural Network (GCN), e.g., for noise ratio 0.1, GraphTM obtains accuracy 89.86% compared to GCN’s 70.87%. Finally, for viral genome sequence data, GraphTM is competitive with BiLSTM-CNN and GCN accuracy-wise, training 2.5x faster than GCN. The GraphTM’s application to these varied fields demonstrates how graph representation learning and deep clauses bring new possibilities for TM learning.
zh

[AI-82] Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

【速读】:该论文旨在解决多智能体安全关键自主系统中的安全策略学习问题,即在确保每个智能体始终满足安全约束的前提下,实现多个智能体之间的协同合作以完成任务。解决方案的关键在于提出一种基于控制屏障函数(Control Barrier Functions, CBFs)的安全分层多智能体强化学习(Hierarchical Multi-Agent Reinforcement Learning, HMARL)方法,通过将整体学习问题分解为两个层次:高层学习所有智能体的联合技能策略,低层则基于高层策略学习各智能体执行技能时的安全行为,从而在复杂交通网络环境中实现高安全性(接近100%的成功/安全率)与高性能的协同导航。

链接: https://arxiv.org/abs/2507.14850
作者: H. M. Sabbir Ahmad,Ehsan Sabouni,Alexander Wasilkoff,Param Budhraja,Zijian Guo,Songyuan Zhang,Chuchu Fan,Christos Cassandras,Wenchao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We address the problem of safe policy learning in multi-agent safety-critical autonomous systems. In such systems, it is necessary for each agent to meet the safety requirements at all times while also cooperating with other agents to accomplish the task. Toward this end, we propose a safe Hierarchical Multi-Agent Reinforcement Learning (HMARL) approach based on Control Barrier Functions (CBFs). Our proposed hierarchical approach decomposes the overall reinforcement learning problem into two levels learning joint cooperative behavior at the higher level and learning safe individual behavior at the lower or agent level conditioned on the high-level policy. Specifically, we propose a skill-based HMARL-CBF algorithm in which the higher level problem involves learning a joint policy over the skills for all the agents and the lower-level problem involves learning policies to execute the skills safely with CBFs. We validate our approach on challenging environment scenarios whereby a large number of agents have to safely navigate through conflicting road networks. Compared with existing state of the art methods, our approach significantly improves the safety achieving near perfect (within 5%) success/safety rate while also improving performance across all the environments.
zh

[AI-83] Margin: Revisiting Contrastive Learning with Margin-Based Separation ECAI2025

【速读】:该论文旨在解决时间序列表示学习中对比学习框架的性能瓶颈问题,特别是如何通过优化对比损失函数来提升下游任务的表现。其核心问题是:现有的固定-margin对比损失可能无法有效区分相邻但本质不同的时间步,从而限制了嵌入表示的质量。解决方案的关键在于引入一种自适应边界(adaptive margin, eMargin),该边界根据预定义的相似性阈值动态调整,以增强对相似样本与不相似样本之间边界的分离能力。实验表明,尽管eMargin在无监督聚类指标上优于当前最优基线,但在线性探测分类任务中表现不佳,揭示了聚类性能与下游任务有效性之间的脱节现象。

链接: https://arxiv.org/abs/2507.14828
作者: Abdul-Kazeem Shamba,Kerstin Bach,Gavin Taylor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: LDD’25: Learning from Difficult Data Workshop (ECAI 2025)

点击查看摘要

Abstract:We revisit previous contrastive learning frameworks to investigate the effect of introducing an adaptive margin into the contrastive loss function for time series representation learning. Specifically, we explore whether an adaptive margin (eMargin), adjusted based on a predefined similarity threshold, can improve the separation between adjacent but dissimilar time steps and subsequently lead to better performance in downstream tasks. Our study evaluates the impact of this modification on clustering performance and classification in three benchmark datasets. Our findings, however, indicate that achieving high scores on unsupervised clustering metrics does not necessarily imply that the learned embeddings are meaningful or effective in downstream tasks. To be specific, eMargin added to InfoNCE consistently outperforms state-of-the-art baselines in unsupervised clustering metrics, but struggles to achieve competitive results in downstream classification with linear probing. The source code is publicly available at this https URL.
zh

[AI-84] Benchmarking Foundation Models with Multimodal Public Electronic Health Records

【速读】:该论文旨在解决当前电子健康记录(Electronic Health Records, EHRs)处理中基础模型(Foundation Models)在预测性能、公平性和可解释性方面的评估缺乏系统性基准的问题。其解决方案的关键在于构建了一个基于公开MIMIC-IV数据库的综合性基准测试框架,包含标准化的数据预处理流程以统一异构临床数据,并系统比较了八种基础模型(涵盖单模态与多模态、领域特定与通用型)的表现。结果表明,引入多模态数据能持续提升预测性能且不增加偏差,从而为开发可信的多模态人工智能(Artificial Intelligence, AI)临床应用提供可靠评估依据。

链接: https://arxiv.org/abs/2507.14824
作者: Kunyu Yu,Rui Yang,Jingchi Liao,Siqi Li,Huitao Li,Irene Li,Yifan Peng,Rishikesan Kamaleswaran,Nan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models have emerged as a powerful approach for processing electronic health records (EHRs), offering flexibility to handle diverse medical data modalities. In this study, we present a comprehensive benchmark that evaluates the performance, fairness, and interpretability of foundation models, both as unimodal encoders and as multimodal learners, using the publicly available MIMIC-IV database. To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clinical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants. Our findings demonstrate that incorporating multiple data modalities leads to consistent improvements in predictive performance without introducing additional bias. Through this benchmark, we aim to support the development of effective and trustworthy multimodal artificial intelligence (AI) systems for real-world clinical applications. Our code is available at this https URL.
zh

[AI-85] Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

【速读】:该论文试图解决的问题是:语言模型在训练过程中可能通过看似无关的数据(如纯数字序列)隐式传递行为特征(trait),即所谓的“亚阈值学习”(subliminal learning),这可能导致模型无意中继承和传播开发者未明确意图的属性,从而带来潜在风险。解决方案的关键在于揭示了这种现象的普遍性——不仅存在于复杂语言模型中,也适用于简单前馈神经网络(如MLP分类器),并通过理论证明指出,在特定条件下亚阈值学习会在所有神经网络中发生;此外,实验表明当教师与学生模型基础架构不一致时该现象消失,说明其依赖于模型间的结构一致性。这一发现提示,即使对数据进行过滤,也可能无法阻止不当特质的传播,为AI开发中的知识蒸馏(distillation)等技术应用敲响警钟。

链接: https://arxiv.org/abs/2507.14805
作者: Alex Cloud,Minh Le,James Chua,Jan Betley,Anna Sztyber-Betley,Jacob Hilton,Samuel Marks,Owain Evans
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a “student” model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.
zh

[AI-86] ACME: Adaptive Customization of Large Models via Distributed Systems

【速读】:该论文旨在解决在云环境中部署基于Transformer的大型预训练模型时面临的三大核心问题:集中式模型定制成本高、用户异构性导致性能不均衡,以及数据异构性引发的次优性能。解决方案的关键在于提出一种名为ACME的自适应定制方法,其核心是通过分布式系统实现细粒度协同模型定制,采用双向单循环架构逐步优化模型;首先基于模型尺寸约束识别帕累托前沿以匹配用户异构性并确保资源利用率最优,随后通过基于数据分布的个性化架构聚合策略进行头部生成与模型精调,从而有效应对数据异构性。实验表明,ACME在模型尺寸约束下实现了更高的成本效率,相比集中式方案数据传输量减少至6%,平均准确率提升10%,且权衡指标提高近30%。

链接: https://arxiv.org/abs/2507.14802
作者: Ziming Dai,Chao Qiu,Fei Gao,Yunfeng Zhao,Xiaofei Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICDCS 2025. 11 pages, 13 figures

点击查看摘要

Abstract:Pre-trained Transformer-based large models have revolutionized personal virtual assistants, but their deployment in cloud environments faces challenges related to data privacy and response latency. Deploying large models closer to the data and users has become a key research area to address these issues. However, applying these models directly often entails significant difficulties, such as model mismatching, resource constraints, and energy inefficiency. Automated design of customized models is necessary, but it faces three key challenges, namely, the high cost of centralized model customization, imbalanced performance from user heterogeneity, and suboptimal performance from data heterogeneity. In this paper, we propose ACME, an adaptive customization approach of Transformer-based large models via distributed systems. To avoid the low cost-efficiency of centralized methods, ACME employs a bidirectional single-loop distributed system to progressively achieve fine-grained collaborative model customization. In order to better match user heterogeneity, it begins by customizing the backbone generation and identifying the Pareto Front under model size constraints to ensure optimal resource utilization. Subsequently, it performs header generation and refines the model using data distribution-based personalized architecture aggregation to match data heterogeneity. Evaluation on different datasets shows that ACME achieves cost-efficient models under model size constraints. Compared to centralized systems, data transmission volume is reduced to 6 percent. Additionally, the average accuracy improves by 10 percent compared to the baseline, with the trade-off metrics increasing by nearly 30 percent.
zh

[AI-87] Large Language Model as An Operator: An Experience-Driven Solution for Distribution Network Voltage Control

【速读】:该论文旨在解决配电网络中电压控制策略自主生成与优化的问题,传统方法依赖人工经验或固定规则,难以适应复杂多变的运行场景。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的经验驱动型电压控制框架,通过多个模块的协同交互——包括经验存储、经验检索、经验生成和经验修改——实现LLM驱动的电压控制策略的自我演化与持续优化,从而提升系统在动态环境下的适应性与决策能力。

链接: https://arxiv.org/abs/2507.14800
作者: Xu Yang,Chenhui Lin,Haotian Liu,Qi Wang,Wenchuan Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advanced reasoning and information analysis capabilities, large language models (LLMs) can offer a novel approach for the autonomous generation of dispatch strategies in power systems. This letter proposes an LLM-based experience-driven voltage control solution for distribution networks, which enables the self-evolution of LLM-based voltage control strategies through the collaboration and interaction of multiple modules-specifically, experience storage, experience retrieval, experience generation, and experience modification. Comprehensive experimental results validate the effectiveness of the proposed method and highlight the applicability of LLM in addressing power system dispatch challenges.
zh

[AI-88] Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree EMNLP2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的网页导航代理在自动化任务中面临的间接提示注入(Indirect Prompt Injection, IPI)安全漏洞问题。其核心解决方案在于揭示了攻击者可通过在网页HTML中嵌入通用对抗性触发器(universal adversarial triggers),利用代理基于可访问性树(accessibility tree)解析HTML的机制,诱导代理执行非预期甚至恶意行为(如凭据泄露或强制点击广告)。研究采用贪婪坐标梯度(Greedy Coordinate Gradient, GCG)算法与基于Llama-3.1的Browser Gym代理,在真实网站上实现了高成功率的目标攻击和泛化攻击,验证了IPI攻击的实际可行性与严重性,强调亟需构建更 robust 的防御机制以应对LLM驱动自主代理日益广泛的应用场景。

链接: https://arxiv.org/abs/2507.14799
作者: Sam Johnson,Viet Pham,Thai Le
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 System Demonstrations Submission

点击查看摘要

Abstract:This work demonstrates that LLM-based web navigation agents offer powerful automation capabilities but are vulnerable to Indirect Prompt Injection (IPI) attacks. We show that adversaries can embed universal adversarial triggers in webpage HTML to hijack agent behavior that utilizes the accessibility tree to parse HTML, causing unintended or malicious actions. Using the Greedy Coordinate Gradient (GCG) algorithm and a Browser Gym agent powered by Llama-3.1, our system demonstrates high success rates across real websites in both targeted and general attacks, including login credential exfiltration and forced ad clicks. Our empirical results highlight critical security risks and the need for stronger defenses as LLM-driven autonomous web agents become more widely adopted. The system software (this https URL) is released under the MIT License, with an accompanying publicly available demo website (this http URL).
zh

[AI-89] Exploring the In-Context Learning Capabilities of LLM s for Money Laundering Detection in Financial Graphs

【速读】:该论文旨在解决洗钱犯罪调查中因实体复杂性和关联性导致的分析难题,即如何在金融知识图谱(financial knowledge graph)中高效识别可疑行为并提供可解释的推理过程。其解决方案的关键在于构建一个轻量级推理管道:首先从知识图谱中提取目标实体的k跳邻域子图,将其序列化为结构化文本,再通过少量示例的上下文学习(few-shot in-context learning)引导大语言模型(LLM)进行可疑性评估与理由生成。该方法使LLM能够模拟分析师逻辑、识别风险信号并输出连贯解释,从而推动可解释、以语言驱动的反洗钱(AML)分析发展。

链接: https://arxiv.org/abs/2507.14785
作者: Erfan Pirmorad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.
zh

[AI-90] Omni-Think: Scaling Cross-Domain Generalization in LLM s via Multi-Task RL with Hybrid Rewards

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段(post-training)中普遍存在的泛化能力不足问题,尤其是监督微调(Supervised Fine-Tuning, SFT)易导致记忆偏好而非可迁移学习的现象。其解决方案的关键在于提出Omni-Think框架,这是一个统一的强化学习(Reinforcement Learning, RL)范式,通过融合基于规则的可验证奖励与生成式偏好信号(由LLM-as-a-Judge评估获得),实现跨任务类型的稳定优化,并扩展RL训练至主观性较强的领域。此外,研究进一步发现,采用从结构化任务到开放性任务的课程学习(curriculum learning)策略可显著提升性能并减少灾难性遗忘,实验证明该方法相较联合训练和模型合并分别提升5.2%和9.1%,凸显了任务感知采样与混合监督机制在构建通用型LLM中的重要性。

链接: https://arxiv.org/abs/2507.14783
作者: Derek Li,Jiaming Zhou,Amirreza Kazemi,Qianyi Sun,Abbas Ghaddar,Mohammad Ali Alomrani,Liheng Ma,Yu Luo,Dong Li,Feng Wen,Jianye Hao,Mark Coates,Yingxue Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of general-purpose artificial intelligence relies on large language models (LLMs) that excel across a wide range of tasks, from structured reasoning to creative generation. However, post-training methods like Supervised Fine-Tuning (SFT) often struggle with generalization, favoring memorization over transferable learning. In this work, we introduce Omni-Think, a unified reinforcement learning (RL) framework that enhances LLM performance across diverse tasks by combining rule-based verifiable rewards with generative preference signals via LLM-as-a-Judge evaluations. Our approach enables consistent optimization across task types and scales RL-based training to subjective domains. We further investigate training strategies, demonstrating that a curriculum-based progression that orders tasks from structured to open-ended improves performance and reduces forgetting. Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging. These results highlight the importance of task-aware sampling and hybrid supervision in scaling RL-based post-training for general-purpose LLMs.
zh

[AI-91] XplainAct: Visualization for Personalized Intervention Insights IEEE-VIS

【速读】:该论文旨在解决现有因果推理方法在处理具有显著异质性的复杂系统时的局限性,即这些方法主要关注群体层面的影响,难以准确刻画干预措施在不同亚群中的个体差异效应。解决方案的关键在于提出XplainAct,这是一个可视化分析框架,能够支持在亚群内部进行个体层面的干预模拟、解释与推理,从而实现对异质性效应的精细化理解。

链接: https://arxiv.org/abs/2507.14767
作者: Yanming Zhang,Krishnakumar Hegde,Klaus Mueller
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper will be published and presented at IEEE Visualization (VIS) 2025, Vienna, Austria, November 2025

点击查看摘要

Abstract:Causality helps people reason about and understand complex systems, particularly through what-if analyses that explore how interventions might alter outcomes. Although existing methods embrace causal reasoning using interventions and counterfactual analysis, they primarily focus on effects at the population level. These approaches often fall short in systems characterized by significant heterogeneity, where the impact of an intervention can vary widely across subgroups. To address this challenge, we present XplainAct, a visual analytics framework that supports simulating, explaining, and reasoning interventions at the individual level within subpopulations. We demonstrate the effectiveness of XplainAct through two case studies: investigating opioid-related deaths in epidemiology and analyzing voting inclinations in the presidential election.
zh

[AI-92] Analyzing Internal Activity and Robustness of SNNs Across Neuron Parameter Space

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在实际部署中因神经元模型超参数(如膜时间常数 τ 和电压阈值 vth)不当导致的性能退化问题,尤其是在能量效率与任务准确性之间难以取得平衡的问题。解决方案的关键在于识别并表征一个“操作空间”——即在超参数域中存在一个受限区域,在此区域内SNN能维持有意义的活动和功能行为;在此空间内运行可实现分类准确率与脉冲活动之间的最优权衡,而超出该区域则会导致网络失效(如过度耗能或完全静默)。研究通过系统性探索不同数据集和架构,量化并可视化该操作空间,从而为SNN的高效、鲁棒部署提供可操作的超参数选择指南,尤其适用于类脑计算场景。

链接: https://arxiv.org/abs/2507.14757
作者: Szymon Mazurek,Jakub Caputa,Maciej Wielgosz
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer energy-efficient and biologically plausible alternatives to traditional artificial neural networks, but their performance depends critically on the tuning of neuron model parameters. In this work, we identify and characterize an operational space - a constrained region in the neuron hyperparameter domain (specifically membrane time constant tau and voltage threshold vth) - within which the network exhibits meaningful activity and functional behavior. Operating inside this manifold yields optimal trade-offs between classification accuracy and spiking activity, while stepping outside leads to degeneration: either excessive energy use or complete network silence. Through systematic exploration across datasets and architectures, we visualize and quantify this manifold and identify efficient operating points. We further assess robustness to adversarial noise, showing that SNNs exhibit increased spike correlation and internal synchrony when operating outside their optimal region. These findings highlight the importance of principled hyperparameter tuning to ensure both task performance and energy efficiency. Our results offer practical guidelines for deploying robust and efficient SNNs, particularly in neuromorphic computing scenarios. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14757 [cs.NE] (or arXiv:2507.14757v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2507.14757 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-93] Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中自监督特征学习与预训练方法(如互信息技能学习,Mutual Information Skill Learning, MISL)的理论理解不足问题,特别是对表示(representation)和互信息参数化机制的作用缺乏清晰的理论分析。其解决方案的关键在于通过可识别表示学习(identifiable representation learning)的视角,聚焦于对比成功特征(Contrastive Successor Features, CSF)方法,并首次证明了CSF能够从状态和像素中可证明地恢复环境的真实特征(ground-truth features),前提是特征采用内积参数化形式且技能具有判别性多样性(discriminative diversity)。这一理论保证不仅揭示了不同互信息目标函数的含义,还指出了熵正则化项可能带来的负面影响,从而为MISL提供了坚实的理论基础并指导实践优化。

链接: https://arxiv.org/abs/2507.14748
作者: Patrik Reizinger,Bálint Mucsányi,Siyuan Guo,Benjamin Eysenbach,Bernhard Schölkopf,Wieland Brendel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Self-supervised feature learning and pretraining methods in reinforcement learning (RL) often rely on information-theoretic principles, termed mutual information skill learning (MISL). These methods aim to learn a representation of the environment while also incentivizing exploration thereof. However, the role of the representation and mutual information parametrization in MISL is not yet well understood theoretically. Our work investigates MISL through the lens of identifiable representation learning by focusing on the Contrastive Successor Features (CSF) method. We prove that CSF can provably recover the environment’s ground-truth features up to a linear transformation due to the inner product parametrization of the features and skill diversity in a discriminative sense. This first identifiability guarantee for representation learning in RL also helps explain the implications of different mutual information objectives and the downsides of entropy regularizers. We empirically validate our claims in MuJoCo and DeepMind Control and show how CSF provably recovers the ground-truth features both from states and pixels.
zh

[AI-94] owards AI Urban Planner in the Age of GenAI LLM s and Agent ic AI DATE

【速读】:该论文试图解决生成式 AI(Generative AI)在城市规划领域应用中的关键瓶颈问题,即如何将AI技术与城市规划理论、多尺度空间分析、数据驱动的知识增强以及现实世界交互机制有效融合,从而推动“AI城市规划师”的发展。其解决方案的关键在于提出三个未来研究方向:一是基于城市理论引导的生成模型,以确保设计符合规划逻辑;二是构建数字孪生(digital twins)系统,实现跨空间尺度和视角的城市模拟;三是推进人机协同设计(human-machine co-design),强化AI与人类决策者之间的互动与协作,最终实现生成智能与参与式城市主义的新融合。

链接: https://arxiv.org/abs/2507.14730
作者: Yanjie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages; will continue to update to add more figures to describe the vision;

点击查看摘要

Abstract:Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. This paper conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints. We survey how generative AI approaches, including VAEs, GANs, transformers, and diffusion models, reshape urban design. We further identify critical gaps: 1) limited research on integrating urban theory guidance, 2) limited research of AI urban planning over multiple spatial resolutions or angularities, 3) limited research on augmenting urban design knowledge from data, and 4) limited research on addressing real-world interactions. To address these limitations, we outline future research directions in theory-guided generation, digital twins, and human-machine co-design, calling for a new synthesis of generative intelligence and participatory urbanism.
zh

[AI-95] ask-Agnostic Continual Prompt Tuning with Gradient-Based Selection and Decoding

【速读】:该论文旨在解决提示引导的持续学习(Prompt-based Continual Learning, PCL)中两个核心问题:一是任务无关推理下隐式遗忘(latent forgetting)导致模型性能下降,二是随着任务序列增长引发的提示记忆爆炸(prompt memory explosion)。解决方案的关键在于提出一种统一框架GRID,其包含两个创新机制:首先,引入任务感知解码机制(task-aware decoding),通过代表性输入、自动任务识别与约束解码提升后向迁移能力;其次,设计基于梯度的提示选择策略(gradient-based prompt selection),将低信息量提示压缩为单一聚合表示,从而实现可扩展且内存高效的终身学习。实验表明,GRID在短序列、长序列及负迁移基准上均显著提升后向迁移性能,最多减少80%的任务遗忘,优于当前最优方法。

链接: https://arxiv.org/abs/2507.14725
作者: Anushka Tiwari,Sayantan Pal,Rohini K. Srihari,Kaiyi Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, most existing methods assume task-aware inference and maintain a growing list of task-specific prompts, which limits scalability and hides latent forgetting. In this work, we introduce GRID, a unified framework that addresses two key limitations: (1) latent forgetting under task-agnostic inference, and (2) prompt memory explosion as task sequences grow. GRID integrates a task-aware decoding mechanism that improves backward transfer by leveraging representative inputs, automatic task identification, and constrained decoding. Additionally, we propose a gradient-based prompt selection strategy that compresses less informative prompts into a single aggregated representation, enabling scalable and memory-efficient lifelong learning. Extensive experiments across short-sequence, long-sequence, and negative transfer benchmarks show that GRID significantly improves backward transfer, achieves competitive forward transfer, and reduces forgotten tasks by up to 80%, outperforming state-of-the-art methods on T5 and Flan-T5 backbones.
zh

[AI-96] LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4

【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中因状态空间和动作空间庞大而导致的挑战,特别是当前基于大语言模型(Large Language Models, LLMs)的方法缺乏正确性保障的问题。为提升ATP的可靠性与效率,论文提出了一种白盒(white-box)方法——LeanTree,其核心创新在于:(i) 使用Lean 4语言构建一个工具,将复杂证明状态分解为更简单且独立的分支;(ii) 构建了一个包含这些因子化中间状态的数据集。该方案通过支持增量式证明构造、并行搜索、状态复用及错误反馈等机制,显著优于仅依赖黑盒交互的现有方法,从而在保证正确性的前提下提升了推理效率与可解释性。

链接: https://arxiv.org/abs/2507.14722
作者: Matěj Kripner,Michal Šustr,Milan Straka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (LLMs) have recently emerged as a promising heuristic for ATP, but they lack correctness guarantees and thus require interaction with a proof verifier. Such interactions typically follow one of two approaches: black-box interaction, which does not utilize intermediate proof states, or white-box approaches, which allow for incremental proof construction and examination of intermediate states. While black-box approaches have directly benefited from recent LLM advances, white-box methods have comparatively lagged behind. In this paper, we address this gap by introducing LeanTree, which consists of (i) a tool built in the Lean 4 language that factorizes complex proof states into simpler, independent branches, and (ii) a dataset of these factorized intermediate states. Our white-box tooling offers several advantages over black-box approaches: it simplifies evaluation, reduces necessary context, generates richer training data, enables parallel search across multiple states, supports efficient reuse of states, and provides feedback in case of errors. Our preliminary results hint that white-box approaches outperform black-box alternatives in some settings.
zh

[AI-97] Automated Safety Evaluations Across 20 Large Language Models : The Aymara LLM Risk and Responsibility Matrix

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中缺乏可扩展且严谨的安全评估体系的问题。其解决方案的关键在于提出Aymara AI平台,该平台能够将自然语言形式的安全政策自动转化为对抗性提示(adversarial prompts),并通过基于AI的评分器对模型响应进行打分,该评分器已通过与人类判断的对比验证有效性。这一方法实现了安全评估的程序化、定制化和可扩展性,从而支持对多种LLM在不同安全场景下的系统性风险识别与责任评估。

链接: https://arxiv.org/abs/2507.14719
作者: Juan Manuel Contreras
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.
zh

[AI-98] Fraud is Not Just Rarity: A Causal Prototype Attention Approach to Realistic Synthetic Oversampling

【速读】:该论文旨在解决信用卡欺诈检测中因数据极端类别不平衡及欺诈行为模式细微导致的检测性能受限问题。现有方法多依赖生成对抗网络(GAN)、变分自编码器(VAE)等生成模型对少数类样本进行过采样,但此类方法仅针对少数类生成合成数据时易导致分类器过度自信且潜在空间聚类分离效果差。本文提出因果原型注意力分类器(Causal Prototype Attention Classifier, CPAC),其核心创新在于通过基于原型的注意力机制实现类别感知聚类与潜在空间结构优化,并将其与VAE-GAN编码器结合,在训练阶段即引导潜在表示学习,而非依赖事后样本增强。实验表明,CPAC驱动的潜在空间塑造显著提升了F1分数(93.14%)和召回率(90.18%),并改善了聚类分离效果。

链接: https://arxiv.org/abs/2507.14706
作者: Claudio Giusti,Luca Guarnera,Mirko Casu,Sebastiano Battiato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 14 figures

点击查看摘要

Abstract:Detecting fraudulent credit card transactions remains a significant challenge, due to the extreme class imbalance in real-world data and the often subtle patterns that separate fraud from legitimate activity. Existing research commonly attempts to address this by generating synthetic samples for the minority class using approaches such as GANs, VAEs, or hybrid generative models. However, these techniques, particularly when applied only to minority-class data, tend to result in overconfident classifiers and poor latent cluster separation, ultimately limiting real-world detection performance. In this study, we propose the Causal Prototype Attention Classifier (CPAC), an interpretable architecture that promotes class-aware clustering and improved latent space structure through prototype-based attention mechanisms and we will couple it with the encoder in a VAE-GAN allowing it to offer a better cluster separation moving beyond post-hoc sample augmentation. We compared CPAC-augmented models to traditional oversamplers, such as SMOTE, as well as to state-of-the-art generative models, both with and without CPAC-based latent classifiers. Our results show that classifier-guided latent shaping with CPAC delivers superior performance, achieving an F1-score of 93.14% percent and recall of 90.18%, along with improved latent cluster separation. Further ablation studies and visualizations provide deeper insight into the benefits and limitations of classifier-driven representation learning for fraud detection. The codebase for this work will be available at final submission.
zh

[AI-99] Configurable multi-agent framework for scalable and realistic testing of llm -based agents

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在实际应用中行为复杂且高度依赖上下文的问题,传统静态基准测试和人工手动测试已无法有效评估其多轮交互下的稳定性与鲁棒性。解决方案的核心是提出一个可配置的多代理框架 Neo,通过一个共享上下文枢纽(context-hub)耦合问题生成代理(Question Generation Agent)与评估代理(Evaluation Agent),实现对LLM系统的真实场景、多轮交互的自动化评测;该框架利用概率状态模型采样对话流、用户意图和情感基调,生成多样化且自适应的人类风格对话,并结合动态反馈机制实现高效、高覆盖率的行为探索,从而显著提升测试效率(10–12倍吞吐量)并发现接近人类红队专家水平的边缘案例失败(3.3% vs. 5.8%)。

链接: https://arxiv.org/abs/2507.14705
作者: Sai Wang,Senthilnathan Subramanian,Mudit Sahni,Praneeth Gone,Lingjie Meng,Xiaochen Wang,Nicolas Ferradas Bertoli,Tingxian Cheng,Jun Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher throughput, generating 180 coherent test questions in around 45 mins versus 16h of human effort. Beyond security probing, Neo’s stochastic policies balanced topic coverage and conversational depth, yielding broader behavioural exploration than manually crafted scripts. Neo therefore lays a foundation for scalable, self-evolving LLM QA: its agent interfaces, state controller and feedback loops are model-agnostic and extensible to richer factual-grounding and policy-compliance checks. We release the framework to facilitate reproducible, high-fidelity testing of emerging agentic systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14705 [cs.AI] (or arXiv:2507.14705v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.14705 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mudit Sahni [view email] [v1] Sat, 19 Jul 2025 17:51:25 UTC (853 KB)
zh

[AI-100] Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition

【速读】:该论文旨在解决脑电图(EEG)情绪识别在实际应用中面临的两大挑战:一是如何有效整合非平稳的时空神经模式,二是如何在真实场景下应对动态的情绪强度变化。解决方案的关键在于提出一种融合空间-时间变换器与课程学习(Curriculum Learning, CL)的新框架SST-CL。其核心创新包括:(1)设计了一个空间编码器以建模通道间关系,并结合一个基于窗口注意力机制的时间编码器来捕捉多尺度时序依赖,从而同步提取EEG信号中的空间相关性和时间动态性;(2)引入一种强度感知的课程学习策略,通过双难度评估动态调度样本,从高情绪强度向低情绪强度逐步引导训练过程,显著提升模型对情绪强度变化的鲁棒性。实验表明,该方法在多个基准数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2507.14698
作者: Xuetao Lin(1 and 2),Tianhao Peng(1 and 2),Peihong Dai(1 and 2),Yu Liang(3),Wenjun Wu(1 and 2) ((1) Beihang University, Beijing, China, (2) SKLCCSE, Beijing, China, (3) Beijing University of Technology, Beijing, China)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.
zh

[AI-101] Efficient Story Point Estimation With Comparative Learning

【速读】:该论文旨在解决敏捷软件开发中故事点(Story Point)估算效率低下的问题,传统方法如规划扑克依赖人工协作,虽有助于项目校准,但一旦团队形成固定基准后,估算过程易变得繁琐且耗时。解决方案的关键在于引入基于比较学习(Comparative Learning)的框架:开发者不再直接为每个需求项分配具体的故事点数值,而是通过对比成对需求项并指出哪个更耗费努力,从而提供更轻量级的反馈;在此基础上训练机器学习模型预测故事点。实证结果表明,该方法在16个项目的23,313条手动标注数据上平均获得0.34的Spearman秩相关系数,与基于真实故事点回归建模的性能相当甚至更优,同时显著降低了人类认知负担,符合比较判断定律(Law of Comparative Judgments)。

链接: https://arxiv.org/abs/2507.14642
作者: Monoshiz Mahbub Khan,Xioayin Xi,Andrew Meneely,Zhe Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman’s rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.
zh

[AI-102] VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking

【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)系统在面对模型完成攻击(Model Completion, MC attack)时存在的标签隐私泄露问题。现有防御方法要么显著降低模型准确性,要么带来不切实际的计算开销。解决方案的关键在于提出VMask框架,其核心思想是通过层掩码(layer masking)机制破坏输入数据与中间输出之间的强相关性:利用秘密共享(Secret Sharing, SS)技术对攻击者模型中的关键层参数进行掩码处理,并设计了一种针对性选择需掩码层的策略,从而在保证防御效果的同时显著降低计算负担;此外,VMask首次引入可调节的隐私预算机制,使防御方能够灵活控制标签隐私保护强度,实现了卓越的隐私-效用权衡,在多个模型架构和数据集上验证了其有效性与高效性。

链接: https://arxiv.org/abs/2507.14629
作者: Juntao Tan,Lan Zhang,Zhonghao Hu,Kai Yang,Peng Ran,Bo Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Though vertical federated learning (VFL) is generally considered to be privacy-preserving, recent studies have shown that VFL system is vulnerable to label inference attacks originating from various attack surfaces. Among these attacks, the model completion (MC) attack is currently the most powerful one. Existing defense methods against it either sacrifice model accuracy or incur impractical computational overhead. In this paper, we propose VMask, a novel label privacy protection framework designed to defend against MC attack from the perspective of layer masking. Our key insight is to disrupt the strong correlation between input data and intermediate outputs by applying the secret sharing (SS) technique to mask layer parameters in the attacker’s model. We devise a strategy for selecting critical layers to mask, reducing the overhead that would arise from naively applying SS to the entire model. Moreover, VMask is the first framework to offer a tunable privacy budget to defenders, allowing for flexible control over the levels of label privacy according to actual requirements. We built a VFL system, implemented VMask on it, and extensively evaluated it using five model architectures and 13 datasets with different modalities, comparing it to 12 other defense methods. The results demonstrate that VMask achieves the best privacy-utility trade-off, successfully thwarting the MC attack (reducing the label inference accuracy to a random guessing level) while preserving model performance (e.g., in Transformer-based model, the averaged drop of VFL model accuracy is only 0.09%). VMask’s runtime is up to 60,846 times faster than cryptography-based methods, and it only marginally exceeds that of standard VFL by 1.8 times in a large Transformer-based model, which is generally acceptable.
zh

[AI-103] VTarbel: Targeted Label Attack with Minimal Knowledge on Detector-enhanced Vertical Federated Learning

【速读】:该论文旨在解决垂直联邦学习(Vertical Federated Learning, VFL)中针对标签的定向攻击(targeted label attacks)问题,此类攻击在现有研究中尚未被充分探索。具体而言,攻击者通过在推理阶段扰动输入数据,迫使模型将样本错误分类为目标标签,而传统方法往往依赖不切实际的假设(如可访问VFL模型输出)或忽略现实系统中部署的异常检测机制。论文提出的解决方案是VTarbel框架,其关键在于设计了一个两阶段、低知识需求的攻击策略:第一阶段利用最大均值差异(Maximum Mean Discrepancy, MMD)选择高表达能力样本,通过VFL协议获取伪标签,并训练本地特征上的代理模型和异常检测器;第二阶段则基于上述模型指导梯度扰动,生成能实现目标误分类且有效规避检测的对抗样本。该方案在多种模型架构、多模态数据集及两种异常检测器下均表现出优越性能,揭示了当前VFL部署中的安全盲点。

链接: https://arxiv.org/abs/2507.14625
作者: Juntao Tan,Anran Li,Quanchao Liu,Peng Ran,Lan Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vertical federated learning (VFL) enables multiple parties with disjoint features to collaboratively train models without sharing raw data. While privacy vulnerabilities of VFL are extensively-studied, its security threats-particularly targeted label attacks-remain underexplored. In such attacks, a passive party perturbs inputs at inference to force misclassification into adversary-chosen labels. Existing methods rely on unrealistic assumptions (e.g., accessing VFL-model’s outputs) and ignore anomaly detectors deployed in real-world systems. To bridge this gap, we introduce VTarbel, a two-stage, minimal-knowledge attack framework explicitly designed to evade detector-enhanced VFL inference. During the preparation stage, the attacker selects a minimal set of high-expressiveness samples (via maximum mean discrepancy), submits them through VFL protocol to collect predicted labels, and uses these pseudo-labels to train estimated detector and surrogate model on local features. In attack stage, these models guide gradient-based perturbations of remaining samples, crafting adversarial instances that induce targeted misclassifications and evade detection. We implement VTarbel and evaluate it against four model architectures, seven multimodal datasets, and two anomaly detectors. Across all settings, VTarbel outperforms four state-of-the-art baselines, evades detection, and retains effective against three representative privacy-preserving defenses. These results reveal critical security blind spots in current VFL deployments and underscore urgent need for robust, attack-aware defenses.
zh

[AI-104] Enhancing POI Recommendation through Global Graph Disentanglement with POI Weighted Module

【速读】:该论文旨在解决现有下一兴趣点(Next Point of Interest, Next POI)推荐方法中存在的三个关键问题:一是未充分挖掘兴趣点(POI)类别与时间之间的关联,导致模型难以捕捉用户在不同时间段对特定POI类别的偏好;二是时间信息建模方式(如时间嵌入或时间间隔)难以有效表达时间的连续性特征;三是预测过程中忽略了多种加权因素,例如POI的流行度、POI间的转移关系及空间距离,从而影响推荐精度。解决方案的关键在于提出一种名为图解耦合带POI加权模块(Graph Disentangler with POI Weighted Module, GDPW)的新框架,该框架通过全局类别图(Global Category Graph)和全局类别-时间图(Global Category-Time Graph)联合学习POI类别与时间表征,并利用对比学习实现类别与时间信息的解耦;最终基于POI间的转移权重和距离关系对预测结果进行加权融合,显著提升了推荐性能。

链接: https://arxiv.org/abs/2507.14612
作者: Pei-Xuan Li,Wei-Yun Liang,Fandel Lin,Hsun-Ping Hsieh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Next point of interest (POI) recommendation primarily predicts future activities based on users’ past check-in data and current status, providing significant value to users and service providers. We observed that the popular check-in times for different POI categories vary. For example, coffee shops are crowded in the afternoon because people like to have coffee to refresh after meals, while bars are busy late at night. However, existing methods rarely explore the relationship between POI categories and time, which may result in the model being unable to fully learn users’ tendencies to visit certain POI categories at different times. Additionally, existing methods for modeling time information often convert it into time embeddings or calculate the time interval and incorporate it into the model, making it difficult to capture the continuity of time. Finally, during POI prediction, various weighting information is often ignored, such as the popularity of each POI, the transition relationships between POIs, and the distances between POIs, leading to suboptimal performance. To address these issues, this paper proposes a novel next POI recommendation framework called Graph Disentangler with POI Weighted Module (GDPW). This framework aims to jointly consider POI category information and multiple POI weighting factors. Specifically, the proposed GDPW learns category and time representations through the Global Category Graph and the Global Category-Time Graph. Then, we disentangle category and time information through contrastive learning. After prediction, the final POI recommendation for users is obtained by weighting the prediction results based on the transition weights and distance relationships between POIs. We conducted experiments on two real-world datasets, and the results demonstrate that the proposed GDPW outperforms other existing models, improving performance by 3% to 11%.
zh

[AI-105] Coordinate Heart System: A Geometric Framework for Emotion Representation

【速读】:该论文旨在解决传统情感分类模型在人工智能系统中对复杂情绪状态表示能力不足的问题,尤其是其无法有效建模多维情感冲突、动态变化及心理状态稳定性。解决方案的关键在于提出坐标心脏系统(Coordinate Heart System, CHS),将八种核心情绪置于单位圆上作为坐标点,通过向量运算实现情绪混合与插值,并引入一个重新校准的稳定性参数 $ S \in [0,1] $,该参数动态整合情感负荷、冲突解决机制和情境消耗因子,结合大语言模型对文本线索的解析与混合时间追踪机制,从而实现对心理状态的精细化建模与计算。此框架不仅提供了数学上完备的情绪空间覆盖,还首次从理论上证明五种情绪不足以实现几何完整性,为AI情绪识别提供了新的数学基础与可计算范式。

链接: https://arxiv.org/abs/2507.14593
作者: Omar Al-Desi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:This paper presents the Coordinate Heart System (CHS), a geometric framework for emotion representation in artificial intelligence applications. We position eight core emotions as coordinates on a unit circle, enabling mathematical computation of complex emotional states through coordinate mixing and vector operations. Our initial five-emotion model revealed significant coverage gaps in the emotion space, leading to the development of an eight-emotion system that provides complete geometric coverage with mathematical guarantees. The framework converts natural language input to emotion coordinates and supports real-time emotion interpolation through computational algorithms. The system introduces a re-calibrated stability parameter S in [0,1], which dynamically integrates emotional load, conflict resolution, and contextual drain factors. This stability model leverages advanced Large Language Model interpretation of textual cues and incorporates hybrid temporal tracking mechanisms to provide nuanced assessment of psychological well-being states. Our key contributions include: (i) mathematical proof demonstrating why five emotions are insufficient for complete geometric coverage, (ii) an eight-coordinate system that eliminates representational blind spots, (iii) novel algorithms for emotion mixing, conflict resolution, and distance calculation in emotion space, and (iv) a comprehensive computational framework for AI emotion recognition with enhanced multi-dimensional stability modeling. Experimental validation through case studies demonstrates the system’s capability to handle emotionally conflicted states, contextual distress factors, and complex psychological scenarios that traditional categorical emotion models cannot adequately represent. This work establishes a new mathematical foundation for emotion modeling in artificial intelligence systems.
zh

[AI-106] A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification

【速读】:该论文旨在解决无人机(UAV)飞行状态分类中传统时间序列分类(TSC)方法鲁棒性不足、泛化能力差,以及当前先进模型如基于Transformer和LSTM的架构在高维数据流下对大规模标注数据依赖性强且计算成本高的问题。其解决方案的关键在于提出一种融合Transformer生成对抗网络(GAN)与多实例局部可解释学习(MILET)的新框架:其中,Transformer编码器用于捕捉长程时序依赖和复杂的遥测动态;GAN模块通过生成真实感合成样本增强小样本数据集;MILET机制聚焦于最具判别性的输入片段,从而降低噪声干扰和计算开销。实验表明,该方法在DroneDetect和DroneRF数据集上分别达到96.5%和98.6%的准确率,且具备良好的计算效率与跨平台泛化能力,适用于资源受限环境下的实时部署。

链接: https://arxiv.org/abs/2507.14592
作者: Haochen Liu,Jia Bi,Xiaomin Wang,Xin Yang,Ling Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly used in surveillance, logistics, agriculture, disaster management, and military operations. Accurate detection and classification of UAV flight states, such as hovering, cruising, ascending, or transitioning, which are essential for safe and effective operations. However, conventional time series classification (TSC) methods often lack robustness and generalization for dynamic UAV environments, while state of the art(SOTA) models like Transformers and LSTM based architectures typically require large datasets and entail high computational costs, especially with high-dimensional data streams. This paper proposes a novel framework that integrates a Transformer-based Generative Adversarial Network (GAN) with Multiple Instance Locally Explainable Learning (MILET) to address these challenges in UAV flight state classification. The Transformer encoder captures long-range temporal dependencies and complex telemetry dynamics, while the GAN module augments limited datasets with realistic synthetic samples. MIL is incorporated to focus attention on the most discriminative input segments, reducing noise and computational overhead. Experimental results show that the proposed method achieves superior accuracy 96.5% on the DroneDetect dataset and 98.6% on the DroneRF dataset that outperforming other SOTA approaches. The framework also demonstrates strong computational efficiency and robust generalization across diverse UAV platforms and flight states, highlighting its potential for real-time deployment in resource constrained environments.
zh

[AI-107] LPS-GNN : Deploying Graph Neural Networks on Graphs with 100-Billion Edges

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在大规模图数据上进行表示学习时面临的执行效率与预测精度难以平衡的问题,尤其是由迭代消息传递机制引发的计算密集型任务和GPU内存消耗问题,特别是在处理存在邻居爆炸(neighbor explosion)的大规模图时更为显著。其解决方案的关键在于提出了一种名为LPS-GNN的可扩展、低成本、灵活且高效的GNN框架,该框架通过设计一种新型图划分算法LPMetis来优化图分区效果,并结合子图增强策略以提升模型性能;同时,LPS-GNN具备良好的兼容性,能够适配多种GNN算法,在单个GPU上即可完成百亿级别图的数据训练,实现在用户获取(User Acquisition)场景中相比SOTA模型提升13.8%的性能表现。

链接: https://arxiv.org/abs/2507.14570
作者: Xu Cheng,Liang Yao,Feng He,Yukuo Cen,Yufei He,Chenhui Zhang,Wenzheng Feng,Hongyun Cai,Jie Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for various graph mining tasks, yet existing scalable solutions often struggle to balance execution efficiency with prediction accuracy. These difficulties stem from iterative message-passing techniques, which place significant computational demands and require extensive GPU memory, particularly when dealing with the neighbor explosion issue inherent in large-scale graphs. This paper introduces a scalable, low-cost, flexible, and efficient GNN framework called LPS-GNN, which can perform representation learning on 100 billion graphs with a single GPU in 10 hours and shows a 13.8% improvement in User Acquisition scenarios. We examine existing graph partitioning methods and design a superior graph partition algorithm named LPMetis. In particular, LPMetis outperforms current state-of-the-art (SOTA) approaches on various evaluation metrics. In addition, our paper proposes a subgraph augmentation strategy to enhance the model’s predictive performance. It exhibits excellent compatibility, allowing the entire framework to accommodate various GNN algorithms. Successfully deployed on the Tencent platform, LPS-GNN has been tested on public and real-world datasets, achieving performance lifts of 8. 24% to 13. 89% over SOTA models in online applications.
zh

[AI-108] Large Language Models Assisting Ontology Evaluation

【速读】:该论文旨在解决本体(ontology)评估中依赖人工执行能力问题(Competency Question, CQ)验证所导致的高成本、劳动密集且易出错的问题。其解决方案的关键在于提出OE-Assist框架,该框架利用大规模语言模型(Large Language Model, LLM)实现CQ验证的自动化与半自动化,通过一个包含1,393个CQ及其对应本体和本体故事的数据集,首次系统性地探索了LLM在本体评估中的应用效果,并结合Protégé工具提供实时建议,显著提升了评估效率与可扩展性。

链接: https://arxiv.org/abs/2507.14552
作者: Anna Sofia Lippolis,Mohammad Javad Saeedizade,Robin Keskisärkkä,Aldo Gangemi,Eva Blomqvist,Andrea Giovanni Nuzzolese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontology evaluation through functional requirements, such as testing via competency question (CQ) verification, is a well-established yet costly, labour-intensive, and error-prone endeavour, even for ontology engineering experts. In this work, we introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated CQ verification. By presenting and leveraging a dataset of 1,393 CQs paired with corresponding ontologies and ontology stories, our contributions present, to our knowledge, the first systematic investigation into large language model (LLM)-assisted ontology evaluation, and include: (i) evaluating the effectiveness of a LLM-based approach for automatically performing CQ verification against a manually created gold standard, and (ii) developing and assessing an LLM-powered framework to assist CQ verification with Protégé, by providing suggestions. We found that automated LLM-based evaluation with o1-preview and o3-mini perform at a similar level to the average user’s performance.
zh

[AI-109] What if Othello-Playing Language Models Could See? ICML2025

【速读】:该论文试图解决语言模型在缺乏与物理世界交互的情况下可能面临的符号接地(symbol grounding)问题,即如何使语言模型不仅依赖文本统计规律,还能通过感知输入(如视觉信息)来理解现实世界的结构。解决方案的关键在于引入VISOTHELLO——一个基于棋盘图像和走子历史的多模态模型,通过联合训练语言模型与视觉输入,使其在国际象棋(Othello)这一规则明确的简化世界中学习到更鲁棒的内部表示,从而提升预测下一步动作的性能,并增强对语义无关扰动的抗干扰能力。

链接: https://arxiv.org/abs/2507.14520
作者: Xinyi Chen,Yifei Yuan,Jiaang Li,Serge Belongie,Maarten de Rijke,Anders Søgaard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025 Assessing World Models Workshop

点击查看摘要

Abstract:Language models are often said to face a symbol grounding problem. While some argue that world understanding can emerge from text alone, others suggest grounded learning is more efficient. We explore this through Othello, where the board state defines a simplified, rule-based world. Building on prior work, we introduce VISOTHELLO, a multi-modal model trained on move histories and board images. Using next-move prediction, we compare it to mono-modal baselines and test robustness to semantically irrelevant perturbations. We find that multi-modal training improves both performance and the robustness of internal representations. These results suggest that grounding language in visual input helps models infer structured world representations.
zh

[AI-110] owards Efficient Privacy-Preserving Machine Learning: A Systematic Review from Protocol Model and System Perspectives DATE

【速读】:该论文旨在解决隐私保护机器学习(Privacy-preserving Machine Learning, PPML)在实际应用中面临的效率与可扩展性瓶颈问题。PPML虽然通过密码学协议实现了形式化的数据隐私保障,但其计算和通信开销往往比明文机器学习高几个数量级,严重制约了其落地应用。解决方案的关键在于跨层级优化:从协议层(Protocol Level)、模型层(Model Level)到系统层(System Level)进行协同改进,通过多维度的技术创新缩小效率差距,并强调未来研究需整合这三个层面的优化策略以实现突破。

链接: https://arxiv.org/abs/2507.14519
作者: Wenxuan Zeng,Tianshi Xu,Yi Chen,Yifan Zhou,Mingzhe Zhang,Jin Tan,Cheng Hong,Meng Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work will be continuously updated to reflect the latest advances

点击查看摘要

Abstract:Privacy-preserving machine learning (PPML) based on cryptographic protocols has emerged as a promising paradigm to protect user data privacy in cloud-based machine learning services. While it achieves formal privacy protection, PPML often incurs significant efficiency and scalability costs due to orders of magnitude overhead compared to the plaintext counterpart. Therefore, there has been a considerable focus on mitigating the efficiency gap for PPML. In this survey, we provide a comprehensive and systematic review of recent PPML studies with a focus on cross-level optimizations. Specifically, we categorize existing papers into protocol level, model level, and system level, and review progress at each level. We also provide qualitative and quantitative comparisons of existing works with technical insights, based on which we discuss future research directions and highlight the necessity of integrating optimizations across protocol, model, and system levels. We hope this survey can provide an overarching understanding of existing approaches and potentially inspire future breakthroughs in the PPML field. As the field is evolving fast, we also provide a public GitHub repository to continuously track the developments, which is available at this https URL.
zh

[AI-111] SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

【速读】:该论文旨在解决当前时间序列自监督表示学习中广泛采用的距离类目标函数(如均方误差,MSE)存在的局限性:这些方法对幅度敏感、对波形极性不变且尺度无界,导致语义对齐困难并降低可解释性。解决方案的关键在于提出信号Dice相似度系数(Signal Dice Similarity Coefficient, SDSC),这是一种结构感知的度量函数,通过 signed amplitudes 的交集量化时间信号间的结构一致性,源自Dice相似度系数(Dice Similarity Coefficient, DSC)。SDSC可作为损失函数使用(通过1减去其值并引入Heaviside函数的可微近似以支持梯度优化),并进一步设计了融合SDSC与MSE的混合损失策略,在保持幅度信息的同时提升训练稳定性。实验表明,基于SDSC的预训练在预测和分类任务上优于或等效于MSE,尤其在领域内和低资源场景下表现更优,验证了结构保真度对增强语义表示质量的重要性。

链接: https://arxiv.org/abs/2507.14516
作者: Jeyoung Lee,Hochul Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.
zh

[AI-112] Amico: An Event-Driven Modular Framework for Persistent and Embedded Autonomy

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在真实世界或资源受限环境中部署时面临的三大挑战:对云端计算的高度依赖、动态场景下鲁棒性不足,以及缺乏持续自主性和环境感知能力。其解决方案的关键在于提出一个模块化、事件驱动的框架 Amico,该框架采用 Rust 语言编写以保障安全性和性能,并通过 WebAssembly 支持在嵌入式系统和浏览器环境中高效运行。Amico 提供了清晰的事件处理、状态管理、行为执行及推理模块集成抽象,构建了一个统一基础设施,使代理能够在计算资源有限且网络连接间歇性的场景中实现稳定、交互式的自主运行。

链接: https://arxiv.org/abs/2507.14513
作者: Hongyi Yang,Yue Pan,Jiayi Xu,Kelsen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and autonomous agents have enabled systems capable of performing complex tasks across domains such as human-computer interaction, planning, and web navigation. However, many existing frameworks struggle in real-world or resource-constrained environments due to their reliance on cloud-based computation, limited robustness in dynamic contexts, and lack of persistent autonomy and environmental awareness. We present Amico, a modular, event-driven framework for building autonomous agents optimized for embedded systems. Written in Rust for safety and performance, Amico supports reactive, persistent agents that operate efficiently across embedded platforms and browser environments via WebAssembly. It provides clean abstractions for event handling, state management, behavior execution, and integration with reasoning modules. Amico delivers a unified infrastructure for constructing resilient, interactive agents suitable for deployment in settings with limited compute and intermittent connectivity. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14513 [cs.AI] (or arXiv:2507.14513v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.14513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-113] Strategyproofness and Monotone Allocation of Auction in Social Networks IJCAI2025

【速读】:该论文旨在解决网络拍卖(network auction)中策略性激励相容(strategyproofness)的设计难题,特别是当投标者不仅需如实申报自身估值,还需在社交网络中积极邀请邻居参与时,如何设计满足策略性激励相容的分配与支付机制。此前,尽管Myerson引理中的价值单调分配规则(value-monotone allocation)是经典拍卖的核心基础,但针对网络结构的拍卖缺乏通用的分配规则原则,导致现有研究在多单位网络拍卖(multi-unit network auctions)中难以实现策略性激励相容。论文的关键突破在于首次识别出两类网络分配规则的单调性:邀请抑制单调性(Invitation-Depressed Monotonicity, ID-MON)和邀请促进单调性(Invitation-Promoted Monotonicity, IP-MON),它们涵盖了所有已知网络拍卖的分配规则作为特例;进一步地,作者为每类单调分配规则刻画了策略性激励相容支付规则的存在条件与充分条件,并证明在所有满足条件的支付规则中,存在唯一可计算的收入最大化支付方案,从而系统性解决了单需求(single-minded)投标人下组合网络拍卖的策略性激励相容问题。

链接: https://arxiv.org/abs/2507.14472
作者: Yuhang Guo,Dong Hao,Bin Li,Mingyu Xiao,Bakh Khoussainov
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Strategyproofness in network auctions requires that bidders not only report their valuations truthfully, but also do their best to invite neighbours from the social network. In contrast to canonical auctions, where the value-monotone allocation in Myerson’s Lemma is a cornerstone, a general principle of allocation rules for strategyproof network auctions is still missing. We show that, due to the absence of such a principle, even extensions to multi-unit network auctions with single-unit demand present unexpected difficulties, and all pioneering researches fail to be strategyproof. For the first time in this field, we identify two categories of monotone allocation rules on networks: Invitation-Depressed Monotonicity (ID-MON) and Invitation-Promoted Monotonicity (IP-MON). They encompass all existing allocation rules of network auctions as specific instances. For any given ID-MON or IP-MON allocation rule, we characterize the existence and sufficient conditions for the strategyproof payment rules, and show that among all such payment rules, the revenue-maximizing one exists and is computationally feasible. With these results, the obstacle of combinatorial network auction with single-minded bidders is now resolved.
zh

[AI-114] BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning

【速读】:该论文旨在解决生物医学知识图谱(Biomedical Knowledge Graphs, BKGs)在完成与推理任务中,语义理解与结构学习难以实现深度协同进化的问题。现有方法如知识嵌入(Knowledge Embedding, KE)虽能捕捉全局语义但缺乏动态结构整合能力,而图神经网络(Graph Neural Networks, GNNs)擅长局部结构建模却常忽视语义理解;即便集成方法(包括语言模型增强方案)也未能实现二者之间的自适应、持续性互促优化。其解决方案的关键在于提出BioGraphFusion框架:通过张量分解建立全局语义基础,并引入LSTM驱动的关系嵌入动态更新机制,在图传播过程中实现语义引导的结构精炼,从而形成语义与结构间的自适应交互;进一步结合查询引导的子图构建和混合评分机制,显著提升了复杂BKG中的推理性能。

链接: https://arxiv.org/abs/2507.14468
作者: Yitong Lin,Jiaying He,Jiahe Chen,Xinnan Zhu,Jianwei Zheng,Tao Bo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Bioinformatics on July 11th

点击查看摘要

Abstract:Motivation: Biomedical knowledge graphs (KGs) are crucial for drug discovery and disease understanding, yet their completion and reasoning are challenging. Knowledge Embedding (KE) methods capture global semantics but struggle with dynamic structural integration, while Graph Neural Networks (GNNs) excel locally but often lack semantic understanding. Even ensemble approaches, including those leveraging language models, often fail to achieve a deep, adaptive, and synergistic co-evolution between semantic comprehension and structural learning. Addressing this critical gap in fostering continuous, reciprocal refinement between these two aspects in complex biomedical KGs is paramount. Results: We introduce BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation. This fosters adaptive interplay between semantic understanding and structural learning, further enhanced by query-guided subgraph construction and a hybrid scoring mechanism. Experiments across three key biomedical tasks demonstrate BioGraphFusion’s superior performance over state-of-the-art KE, GNN, and ensemble models. A case study on Cutaneous Malignant Melanoma 1 (CMM1) highlights its ability to unveil biologically meaningful pathways. Availability and Implementation: Source code and all training data are freely available for download at this https URL. Contact: zjw@zjut.this http URL, botao666666@126.com. Supplementary information: Supplementary data are available at Bioinformatics online. Comments: Accepted by Bioinformatics on July 11th Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14468 [cs.AI] (or arXiv:2507.14468v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.14468 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1093/bioinformatics/btaf408 Focus to learn more DOI(s) linking to related resources Submission history From: Jiaying He [view email] [v1] Sat, 19 Jul 2025 04:03:42 UTC (1,454 KB) Full-text links: Access Paper: View a PDF of the paper titled BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning, by Yitong Lin and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-115] Designing Conversational AI to Support Think-Aloud Practice in Technical Interview Preparation for CS Students

【速读】:该论文旨在解决技术面试中“思考 aloud”(think-aloud)练习机会有限的问题,即候选人缺乏结构化训练来提升在编码任务中口头表达思维过程的能力。其解决方案的关键在于开发并评估一个基于大语言模型(Large Language Model, LLM)的技术面试模拟工具,通过AI实现高质量的模拟对话、多维度反馈以及从生成示例中学习的功能。研究发现,用户高度认可AI在模拟真实性、提供超越言语内容的反馈(如逻辑结构与表达清晰度)以及通过人机协作构建共享的思考实例方面的价值,并据此提出三大设计建议:增强对话式AI的社会临场感(social presence)、拓展反馈维度、建立众包驱动的think-aloud案例库。此外,论文还呼吁重新定义AI在面试准备中的角色,推动以人机协同为核心的研究方向,从而促进计算领域职业发展的公平性与包容性。

链接: https://arxiv.org/abs/2507.14418
作者: Taufiq Daryanto,Sophia Stil,Xiaohan Ding,Daniel Manesh,Sang Won Lee,Tim Lee,Stephanie Lunn,Sarah Rodriguez,Chris Brown,Eugenia Rho
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One challenge in technical interviews is the think-aloud process, where candidates verbalize their thought processes while solving coding tasks. Despite its importance, opportunities for structured practice remain limited. Conversational AI offers potential assistance, but limited research explores user perceptions of its role in think-aloud practice. To address this gap, we conducted a study with 17 participants using an LLM-based technical interview practice tool. Participants valued AI’s role in simulation, feedback, and learning from generated examples. Key design recommendations include promoting social presence in conversational AI for technical interview simulation, providing feedback beyond verbal content analysis, and enabling crowdsourced think-aloud examples through human-AI collaboration. Beyond feature design, we examined broader considerations, including intersectional challenges and potential strategies to address them, how AI-driven interview preparation could promote equitable learning in computing careers, and the need to rethink AI’s role in interview practice by suggesting a research direction that integrates human-AI collaboration.
zh

[AI-116] Fail Fast or Ask: Mitigating the Deficiencies of Reasoning LLM s with Human-in-the-Loop Systems Engineering

【速读】:该论文旨在解决当前先进推理型大语言模型(reasoning LLMs)在风险敏感领域应用时面临的两大挑战:一是模型仍存在非零错误率,难以满足高可靠性需求;二是推理模型延迟较高,限制了其在高吞吐场景下的部署。解决方案的关键在于构建一种“黑箱式”的系统工程架构——通过引入人类专家与模型协作机制,利用推理路径长度量化模型不确定性以决定是否将问题转交人类处理,从而显著降低错误率;同时,在推理模型前叠加一个非推理型大模型作为前置筛选器,实现“快速失败或请求”(Fail Fast, or Ask)策略,有效减少延迟和成本,尽管存在“延迟拖拽”(latency drag)现象导致收益低于预期,整体仍能维持较高的准确率-拒识曲线面积(AUC)。

链接: https://arxiv.org/abs/2507.14406
作者: Michael J. Zellinger,Matt Thomson
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:State-of-the-art reasoning LLMs are powerful problem solvers, but they still occasionally make mistakes. However, adopting AI models in risk-sensitive domains often requires error rates near 0%. To address this gap, we propose collaboration between a reasoning model and a human expert who resolves queries the model cannot confidently answer. We find that quantifying the uncertainty of a reasoning model through the length of its reasoning trace yields an effective basis for deferral to a human, e.g., cutting the error rate of Qwen3 235B-A22B on difficult MATH problems from 3% to less than 1% when deferring 7.5% of queries. However, the high latency of reasoning models still makes them challenging to deploy on use cases with high query volume. To address this challenge, we explore fronting a reasoning model with a large non-reasoning model. We call this modified human-in-the-loop system “Fail Fast, or Ask”, since the non-reasoning model may defer difficult queries to the human expert directly (“failing fast”), without incurring the reasoning model’s higher latency. We show that this approach yields around 40% latency reduction and about 50% cost savings for DeepSeek R1 while maintaining 90+% area under the accuracy-rejection curve. However, we observe that latency savings are lower than expected because of “latency drag”, the phenomenon that processing easier queries with a non-reasoning model pushes the reasoning model’s latency distribution towards longer latencies. Broadly, our results suggest that the deficiencies of state-of-the-art reasoning models – nontrivial error rates and high latency – can be substantially mitigated through black-box systems engineering, without requiring access to LLM internals.
zh

[AI-117] Adaptive Multi-Agent Reasoning via Automated Workflow Generation

【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models, LRMs)在面对新问题时普遍存在的泛化能力不足问题,即模型倾向于依赖记忆而非真正的推理能力,导致过拟合(overfitting)现象。其核心解决方案是提出 Nexus Architect——一个增强型多智能体系统框架,关键创新在于引入了自动化工作流合成机制(automated workflow synthesis mechanism),能够根据用户提示和少量示例自主构建适配特定问题类别的推理流程,包括策略选择、工具集成与对抗性技术应用;同时结合迭代提示优化机制(iterative prompt refinement mechanism),持续微调智能体系统提示以提升性能并增强泛化能力。实验表明,该方案显著优于现有主流LRMs,在逻辑推理任务上实现最高达66%的通过率提升。

链接: https://arxiv.org/abs/2507.14393
作者: Humza Sami,Mubashir ul Islam,Pierre-Emmanuel Gaillardon,Valerio Tenace
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of Large Reasoning Models (LRMs) promises a significant leap forward in language model capabilities, aiming to tackle increasingly sophisticated tasks with unprecedented efficiency and accuracy. However, despite their impressive performance, recent studies have highlighted how current reasoning models frequently fail to generalize to novel, unseen problems, often resorting to memorized solutions rather than genuine inferential reasoning. Such behavior underscores a critical limitation in modern LRMs, i.e., their tendency toward overfitting, which in turn results in poor generalization in problem-solving capabilities. In this paper, we introduce Nexus Architect, an enhanced iteration of our multi-agent system framework, Nexus, equipped with a novel automated workflow synthesis mechanism. Given a user’s prompt and a small set of representative examples, the Architect autonomously generates a tailored reasoning workflow by selecting suitable strategies, tool integrations, and adversarial techniques for a specific problem class. Furthermore, the Architect includes an iterative prompt refinement mechanism that fine-tunes agents’ system prompts to maximize performance and improve the generalization capabilities of the system. We empirically evaluate Nexus Architect by employing an off-the-shelf, non-reasoning model on a custom dataset of challenging logical questions and compare its performance against state-of-the-art LRMs. Results show that Nexus Architect consistently outperforms existing solutions, achieving up to a 66% increase in pass rate over Gemini 2.5 Flash Preview, nearly 2.5 \times against Claude Sonnet 4 and DeepSeek-R1, and over 3 \times w.r.t. Llama 4 Scout. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14393 [cs.AI] (or arXiv:2507.14393v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.14393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-118] Incremental Causal Graph Learning for Online Cyberattack Detection in Cyber-Physical Infrastructures

【速读】:该论文旨在解决实时关键基础设施中因网络攻击导致的异常检测难题,尤其是传统方法在高数据方差和类别不平衡下易产生过多误报,以及现有基于因果图的方法难以适应动态变化的数据分布且缺乏在线学习能力的问题。其解决方案的关键在于提出一种名为INCADET的增量式因果图学习框架,通过三个核心模块实现:1)早期症状检测,利用连续因果图中边权重分布的差异识别系统状态变化;2)增量式因果图学习,结合经验回放与边强化机制,在不丢失历史知识的前提下持续优化因果结构;3)因果图分类,采用图卷积网络(Graph Convolutional Networks, GCNs)对系统状态进行判别。该框架能够有效捕捉系统行为演化并适应攻击模式的动态变化,显著提升检测精度、鲁棒性和适应性。

链接: https://arxiv.org/abs/2507.14387
作者: Arun Vignesh Malarkkan,Dongjie Wang,Haoyue Bai,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 3 Tables, under review in IEEE Transactions on Big Data

点击查看摘要

Abstract:The escalating threat of cyberattacks on real-time critical infrastructures poses serious risks to public safety, demanding detection methods that effectively capture complex system interdependencies and adapt to evolving attack patterns. Traditional real-time anomaly detection techniques often suffer from excessive false positives due to their statistical sensitivity to high data variance and class imbalance. To address these limitations, recent research has explored modeling causal relationships among system components. However, prior work mainly focuses on offline causal graph-based approaches that require static historical data and fail to generalize to real-time settings. These methods are fundamentally constrained by: (1) their inability to adapt to dynamic shifts in data distribution without retraining, and (2) the risk of catastrophic forgetting when lacking timely supervision in live systems. To overcome these challenges, we propose INCADET, a novel framework for incremental causal graph learning tailored to real-time cyberattack detection. INCADET dynamically captures evolving system behavior by incrementally updating causal graphs across streaming time windows. The framework comprises three modules: 1) Early Symptom Detection: Detects transitions in system status using divergence in edge-weight distributions across sequential causal graphs. 2) Incremental Causal Graph Learning: Leverages experience replay and edge reinforcement to continually refine causal structures while preserving prior knowledge. 3) Causal Graph Classification: Employs Graph Convolutional Networks (GCNs) to classify system status using the learned causal graphs. Extensive experiments on real-world critical infrastructure datasets demonstrate that INCADET achieves superior accuracy, robustness, and adaptability compared to both static causal and deep temporal baselines in evolving attack scenarios.
zh

[AI-119] Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llm s

【速读】:该论文旨在解决异构数据源集成与数据集发现中的Schema匹配问题,该问题传统上依赖大量标注数据或穷举式配对比较,导致资源消耗高且效率低下。其解决方案的关键在于提出SCHEMORA框架,该框架基于提示(prompt-based)方法,融合大语言模型(Large Language Models, LLMs)与混合检索技术(包括向量检索和词法检索),通过增强Schema元数据信息,在无需标注训练数据的情况下高效识别候选匹配项,从而显著提升匹配准确率与可扩展性。

链接: https://arxiv.org/abs/2507.14376
作者: Osman Erman Gungor,Derak Paulsen,William Kang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores the critical role of retrieval and provides practical guidance on model selection.
zh

[AI-120] A Reproducibility Study of Product-side Fairness in Bundle Recommendation

【速读】:该论文旨在解决捆绑推荐(Bundle Recommendation, BR)中产品层面的公平性问题,即在推荐结果中不同产品及其供应商所获得的曝光机会不均等。传统推荐系统中的公平性研究主要聚焦于个体物品,而BR因其推荐粒度为组合包(bundle),导致用户满意度和产品曝光同时受包内单个物品影响,使得公平性问题更加复杂。解决方案的关键在于:首先,通过在三个真实数据集上对四种先进BR方法进行可复现性实验,系统分析了包级与物品级的曝光差异;其次,揭示了仅基于包级假设的公平干预不足,必须考虑物品级公平性;再次,强调使用多种公平性指标评估的重要性,因不同指标可能导致截然不同的结论;最后,指出用户行为模式(如更频繁交互于包而非单品)是提升整体公平性的关键因素。这一系列发现为构建更公平的BR系统提供了实证依据和可操作路径。

链接: https://arxiv.org/abs/2507.14352
作者: Huy-Son Nguyen,Yuanna Liu,Masoud Mansoury,Mohammad Alian Nejadi,Alan Hanjalic,Maarten de Rijke
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain.
zh

[AI-121] Influence Functions for Preference Dataset Pruning

【速读】:该论文旨在解决语言模型在强化学习微调过程中因训练数据噪声导致性能下降的问题,特别是针对人类偏好数据集中小样本、高噪声特性下如何提升模型鲁棒性与泛化能力。其解决方案的关键在于引入共轭梯度近似的影响函数(influence function)方法,用于识别并剔除对验证集性能有害的训练样本;实验表明,通过过滤掉10%的训练数据,模型在重新训练后准确率提升了1.5%,同时发现梯度相似性在识别有益样本方面优于影响函数,说明局部曲率对检测有害样本更为关键,而对识别有益样本作用较小。

链接: https://arxiv.org/abs/2507.14344
作者: Daniel Fein,Gabriela Aranguiz-Dias
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are commonly fine-tuned via reinforcement learning to alter their behavior or elicit new capabilities. Datasets used for these purposes, and particularly human preference datasets, are often noisy. The relatively small size post-training datasets, combined with parameter-efficient fine-tuning methods, enable the use of influence functions approximations to detect and prune training examples that are harmful to performance on a validation set. In this work, we adapt the TL;DR dataset for reward model training to demonstrate how conjugate-gradient approximated influence functions can be used to filter datasets. In our experiments, influence function filtering yields a small retraining accuracy uplift of 1.5% after removing 10% of training examples. We also show that gradient similarity outperforms influence functions for detecting helpful training examples. This suggests that local curvature is important for detecting harmful training examples, but less so for identifying helpful examples.
zh

[AI-122] Fiduciary AI for the Future of Brain-Technology Interactions

【速读】:该论文旨在解决脑基础模型(brain foundation models)在与脑机接口(BCI)集成后可能引发的伦理与安全风险问题,特别是用户对自身神经信号被解释和使用的不可见性所导致的认知自由侵蚀及权力不对等。其核心解决方案在于通过技术设计将受托义务(fiduciary duties,包括忠诚、审慎和保密)嵌入到系统架构中,借鉴法律传统与AI对齐技术,提出可实现的结构化机制与治理框架,确保此类系统始终以用户最佳利益为行动准则,从而在释放其应用潜力的同时保障个体自主权。

链接: https://arxiv.org/abs/2507.14339
作者: Abhishek Bhattacharjee,Jack Pilkington,Nita Farahany
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 32 pages

点击查看摘要

Abstract:Brain foundation models represent a new frontier in AI: instead of processing text or images, these models interpret real-time neural signals from EEG, fMRI, and other neurotechnologies. When integrated with brain-computer interfaces (BCIs), they may enable transformative applications-from thought controlled devices to neuroprosthetics-by interpreting and acting on brain activity in milliseconds. However, these same systems pose unprecedented risks, including the exploitation of subconscious neural signals and the erosion of cognitive liberty. Users cannot easily observe or control how their brain signals are interpreted, creating power asymmetries that are vulnerable to manipulation. This paper proposes embedding fiduciary duties-loyalty, care, and confidentiality-directly into BCI-integrated brain foundation models through technical design. Drawing on legal traditions and recent advancements in AI alignment techniques, we outline implementable architectural and governance mechanisms to ensure these systems act in users’ best interests. Placing brain foundation models on a fiduciary footing is essential to realizing their potential without compromising self-determination.
zh

[AI-123] ProofCompass: Enhancing Specialized Provers with LLM Guidance ICML2025

【速读】:该论文旨在解决当前形式化定理证明中计算资源消耗过高与模型性能受限之间的矛盾问题,即如何在不增加额外训练成本的前提下,提升推理效率与准确性。其解决方案的关键在于提出一种名为ProofCompass的混合方法,通过引入一个无需额外训练的大语言模型(Large Language Model, LLM)来指导已有专用证明器(如DeepSeek-Prover-v1.5-RL)的推理过程,LLM以自然语言形式提供证明策略并分析失败尝试以选择中间引理,从而实现有效的题目分解和资源优化;实验表明,在miniF2F基准上,ProofCompass在仅使用25倍更少的尝试次数(从3200降至128)的情况下,将准确率从54.9%提升至55.3%,显著提升了计算效率与性能的平衡。

链接: https://arxiv.org/abs/2507.14335
作者: Nicolas Wischermann,Claudio Mayrink Verdun,Gabriel Poesia,Francesco Noseda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures. Accepted at the 2nd AI for MATH Workshop at the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Language models have become increasingly powerful tools for formal mathematical reasoning. However, most existing approaches rely exclusively on either large general-purpose models or smaller specialized models, each with distinct limitations, while training specialized large models still requires significant computational resources. This paper introduces ProofCompass, a novel hybrid methodology that achieves remarkable computational efficiency by strategically guiding existing specialized prover methods, such as DeepSeek-Prover-v1.5-RL (DSP-v1.5) with a Large Language Model (LLM) without requiring additional model training. The LLM provides natural language proof strategies and analyzes failed attempts to select intermediate lemmas, enabling effective problem decomposition. On the miniF2F benchmark, ProofCompass demonstrates substantial resource efficiency: it outperforms DSP-v1.5 ( 54.9% \rightarrow 55.3% ) while using 25x fewer attempts ( 3200 \rightarrow 128 ). Our synergistic approach paves the way for simultaneously improving computational efficiency and accuracy in formal theorem proving.
zh

[AI-124] Language Models as Ontology Encoders

【速读】:该论文旨在解决现有本体嵌入(ontology embedding)方法在处理文本信息与保持逻辑结构之间的权衡问题:传统基于几何模型的方法忽视了文本标签信息,导致性能受限;而依赖预训练语言模型(Pretrained Language Model, PLM)的方法虽能利用文本,却难以保留描述逻辑(Description Logic)中的层次关系及其他逻辑约束。解决方案的关键在于提出一种名为OnT的新方法,通过在双曲空间(hyperbolic space)中对PLM进行微调,从而同时有效融合类标签的文本信息并精确保持EL逻辑下的类层次结构和其它逻辑关系,实现了知识推理任务中更高的准确性与泛化能力。

链接: https://arxiv.org/abs/2507.14334
作者: Hui Yang,Jiaoyan Chen,Yuan He,Yongsheng Gao,Ian Horrocks
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:OWL (Web Ontology Language) ontologies which are able to formally represent complex knowledge and support semantic reasoning have been widely adopted across various domains such as healthcare and bioinformatics. Recently, ontology embeddings have gained wide attention due to its potential to infer plausible new knowledge and approximate complex reasoning. However, existing methods face notable limitations: geometric model-based embeddings typically overlook valuable textual information, resulting in suboptimal performance, while the approaches that incorporate text, which are often based on language models, fail to preserve the logical structure. In this work, we propose a new ontology embedding method OnT, which tunes a Pretrained Language Model (PLM) via geometric modeling in a hyperbolic space for effectively incorporating textual labels and simultaneously preserving class hierarchies and other logical relationships of Description Logic EL. Extensive experiments on four real-world ontologies show that OnT consistently outperforms the baselines including the state-of-the-art across both tasks of prediction and inference of axioms. OnT also demonstrates strong potential in real-world applications, indicated by its robust transfer learning abilities and effectiveness in real cases of constructing a new ontology from SNOMED CT. Data and code are available at this https URL.
zh

[AI-125] Manimator: Transforming Research Papers into Visual Explanations

【速读】:该论文旨在解决科研文献中复杂科学与数学概念难以理解的问题,尤其是针对密集型研究论文的学习障碍。传统动态可视化手段虽能显著提升理解效果,但手动制作耗时且需专业技能。其解决方案的关键在于提出一个名为manimator的开源系统,该系统利用大型语言模型(Large Language Models, LLMs)将研究论文或自然语言提示自动转化为基于Manim引擎的可执行动画代码:首先由LLM解析输入内容生成结构化的场景描述(包含关键概念、数学公式和视觉元素),再由另一LLM将其翻译为Manim Python代码,从而实现高质量教育内容的快速生成与普及。

链接: https://arxiv.org/abs/2507.14306
作者: Samarth P,Vyoman Jain,Shiva Golugula,Motamarri Sai Sathvik
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Understanding complex scientific and mathematical concepts, particularly those presented in dense research papers, poses a significant challenge for learners. Dynamic visualizations can greatly enhance comprehension, but creating them manually is time-consuming and requires specialized knowledge and skills. We introduce manimator, an open-source system that leverages Large Language Models to transform research papers and natural language prompts into explanatory animations using the Manim engine. Manimator employs a pipeline where an LLM interprets the input text or research paper PDF to generate a structured scene description outlining key concepts, mathematical formulas, and visual elements and another LLM translates this description into executable Manim Python code. We discuss its potential as an educational tool for rapidly creating engaging visual explanations for complex STEM topics, democratizing the creation of high-quality educational content.
zh

[AI-126] A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在多轮问题求解中缺乏反思能力与反馈修正能力的问题,即现有基于强化学习(Reinforcement Learning, RL)的方法通常局限于单轮范式,导致模型在面对上下文反馈时容易产生重复性回答,难以实现有效的多轮迭代优化。其解决方案的关键在于提出一种名为“Unary Feedback as Observation”(UFO)的新型强化学习训练机制,该机制仅使用简单的一元反馈(如“让我们再试一次”)作为观察信号,在多轮场景下引导模型进行自我反思和修正,从而在保持单轮性能的同时显著提升多轮推理准确率(最高达14%),并可通过设计特定奖励结构鼓励模型在每一轮中生成更谨慎、多样化的推理路径。

链接: https://arxiv.org/abs/2507.14295
作者: Licheng Liu,Zihan Wang,Linjie Li,Chenwei Xu,Yiping Lu,Han Liu,Avirup Sil,Manling Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., “Let’s try again”) after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: this https URL
zh

[AI-127] DREAMS: Density Functional Theory Based Research Engine for Agent ic Materials Simulation

【速读】:该论文旨在解决材料发现中依赖高通量、高保真度的第一性原理计算(如密度泛函理论,DFT)所面临的挑战,包括训练周期长、参数调优复杂及系统性误差处理困难等问题。其解决方案的关键在于提出了一种分层多智能体框架——基于DFT的智能体材料筛选研究引擎(DFT-based Research Engine for Agentic Materials Screening, DREAMS),该框架由一个中央大型语言模型(LLM)规划智能体与多个领域特定的LLM智能体协同工作,分别负责原子结构生成、DFT收敛性测试、高性能计算(HPC)调度和错误处理,并通过共享画布机制维持上下文一致性、防止幻觉,从而实现L3级自动化(即在定义的设计空间内自主探索),显著降低对人工专家干预的依赖,为高通量、高保真度的计算材料发现提供可扩展的路径。

链接: https://arxiv.org/abs/2507.14267
作者: Ziqi Wang,Hongshuo Huang,Hancheng Zhao,Changwen Xu,Shang Zhu,Jan Janssen,Venkatasubramanian Viswanathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: 34 pages, 28 pages of Supporting Information

点击查看摘要

Abstract:Materials discovery relies on high-throughput, high-fidelity simulation techniques such as Density Functional Theory (DFT), which require years of training, extensive parameter fine-tuning and systematic error handling. To address these challenges, we introduce the DFT-based Research Engine for Agentic Materials Screening (DREAMS), a hierarchical, multi-agent framework for DFT simulation that combines a central Large Language Model (LLM) planner agent with domain-specific LLM agents for atomistic structure generation, systematic DFT convergence testing, High-Performance Computing (HPC) scheduling, and error handling. In addition, a shared canvas helps the LLM agents to structure their discussions, preserve context and prevent hallucination. We validate DREAMS capabilities on the Sol27LC lattice-constant benchmark, achieving average errors below 1% compared to the results of human DFT experts. Furthermore, we apply DREAMS to the long-standing CO/Pt(111) adsorption puzzle, demonstrating its long-term and complex problem-solving capabilities. The framework again reproduces expert-level literature adsorption-energy differences. Finally, DREAMS is employed to quantify functional-driven uncertainties with Bayesian ensemble sampling, confirming the Face Centered Cubic (FCC)-site preference at the Generalized Gradient Approximation (GGA) DFT level. In conclusion, DREAMS approaches L3-level automation - autonomous exploration of a defined design space - and significantly reduces the reliance on human expertise and intervention, offering a scalable path toward democratized, high-throughput, high-fidelity computational materials discovery.
zh

[AI-128] Bridging MOOCs Smart Teaching and AI: A Decade of Evolution Toward a Unified Pedagogy

【速读】:该论文旨在解决当前高等教育中三种主流教学范式——大规模开放在线课程(MOOCs)、智能教学(Smart Teaching)和生成式AI增强学习——因技术来源不同及政策驱动而孤立实施的问题,从而限制了教育效果的整合优化。其解决方案的关键在于提出一个三层教学框架,该框架融合了MOOCs的可扩展性、智能教学的实时响应能力以及生成式AI的个性化适应性,形成统一的教育理念与实践路径,以实现规模化与个性化兼具的学习体验。

链接: https://arxiv.org/abs/2507.14266
作者: Bo Yuan,Jiazi Hu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Over the past decade, higher education has evolved through three distinct paradigms: the emergence of Massive Open Online Courses (MOOCs), the integration of Smart Teaching technologies into classrooms, and the rise of AI-enhanced learning. Each paradigm is intended to address specific challenges in traditional education: MOOCs enable ubiquitous access to learning resources; Smart Teaching supports real-time interaction with data-driven insights; and generative AI offers personalized feedback and on-demand content generation. However, these paradigms are often implemented in isolation due to their disparate technological origins and policy-driven adoption. This paper examines the origins, strengths, and limitations of each paradigm, and advocates a unified pedagogical perspective that synthesizes their complementary affordances. We propose a three-layer instructional framework that combines the scalability of MOOCs, the responsiveness of Smart Teaching, and the adaptivity of AI. To demonstrate its feasibility, we present a curriculum design for a project-based course. The findings highlight the framework’s potential to enhance learner engagement, support instructors, and enable personalized yet scalable learning.
zh

[AI-129] Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified Agent Facts

【速读】:该论文旨在解决未来互联网中海量自主AI代理(AI agents)在身份识别、发现机制与安全协作方面的挑战,尤其是传统基于DNS的架构难以应对毫秒级协商、委托和迁移所带来的负载压力。其解决方案的关键在于提出NANDA索引架构,通过一个轻量级、可水平扩展的索引系统,将动态且密码学可验证的AgentFacts(即AI代理的事实描述)作为核心数据结构,支持多端点路由、负载均衡、隐私保护访问以及凭据化能力声明。该架构实现了五项关键保障:包括支持原生与第三方AI代理的可发现性、新生成代理的快速全球解析、亚秒级撤销与密钥轮换、模式验证的能力断言,以及跨组织边界的隐私保护查询,从而构建了一个无需抛弃现有Web基础设施即可实现安全可信协作的下一代AI代理互联网基础。

链接: https://arxiv.org/abs/2507.14263
作者: Ramesh Raskar,Pradyumna Chari,John Zinky,Mahesh Lambe,Jared James Grogan,Sichao Wang,Rajesh Ranjan,Rekha Singhal,Shailja Gupta,Robert Lincourt,Raghu Bala,Aditi Joshi,Abhishek Singh,Ayush Chopra,Dimitris Stripelis,Bhuwan B,Sumit Kumar,Maria Gorskikh
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The Internet is poised to host billions to trillions of autonomous AI agents that negotiate, delegate, and migrate in milliseconds and workloads that will strain DNS-centred identity and discovery. In this paper, we describe the NANDA index architecture, which we envision as a means for discoverability, identifiability and authentication in the internet of AI agents. We present an architecture where a minimal lean index resolves to dynamic, cryptographically verifiable AgentFacts that supports multi-endpoint routing, load balancing, privacy-preserving access, and credentialed capability assertions. Our architecture design delivers five concrete guarantees: (1) A quilt-like index proposal that supports both NANDA-native agents as well as third party agents being discoverable via the index, (2) rapid global resolution for newly spawned AI agents, (3) sub-second revocation and key rotation, (4) schema-validated capability assertions, and (5) privacy-preserving discovery across organisational boundaries via verifiable, least-disclosure queries. We formalize the AgentFacts schema, specify a CRDT-based update protocol, and prototype adaptive resolvers. The result is a lightweight, horizontally scalable foundation that unlocks secure, trust-aware collaboration for the next generation of the Internet of AI agents, without abandoning existing web infrastructure.
zh

[AI-130] FAMST: Fast Approximate Minimum Spanning Tree Construction for Large-Scale and High-Dimensional Data

【速读】:该论文旨在解决大规模高维数据集上构建最小生成树(Minimum Spanning Tree, MST)的计算效率问题,传统方法在时间复杂度和空间复杂度上均为 O(n2)\mathcal{O}(n^2),难以应用于百万级点数和数千维特征的数据场景。其解决方案的关键在于提出一种三阶段近似算法——快速近似最小生成树(Fast Approximate Minimum Spanning Tree, FAMST),包括:近似最近邻(Approximate Nearest Neighbor, ANN)图构建、组件间ANN连接以及迭代边优化;该方法实现了 O(dnlogn)\mathcal{O}(dn \log n) 的时间复杂度与 O(dn+kn)\mathcal{O}(dn + kn) 的空间复杂度,显著优于传统方法,并在保证极低近似误差的同时实现最高达1000倍的加速比,从而使得MST分析可扩展至此前无法处理的大规模数据场景。

链接: https://arxiv.org/abs/2507.14261
作者: Mahmood K. M. Almansoori,Miklos Telek
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Fast Approximate Minimum Spanning Tree (FAMST), a novel algorithm that addresses the computational challenges of constructing Minimum Spanning Trees (MSTs) for large-scale and high-dimensional datasets. FAMST utilizes a three-phase approach: Approximate Nearest Neighbor (ANN) graph construction, ANN inter-component connection, and iterative edge refinement. For a dataset of n points in a d -dimensional space, FAMST achieves \mathcalO(dn \log n) time complexity and \mathcalO(dn + kn) space complexity when k nearest neighbors are considered, which is a significant improvement over the \mathcalO(n^2) time and space complexity of traditional methods. Experiments across diverse datasets demonstrate that FAMST achieves remarkably low approximation errors while providing speedups of up to 1000 \times compared to exact MST algorithms. We analyze how the key hyperparameters, k (neighborhood size) and \lambda (inter-component edges), affect performance, providing practical guidelines for hyperparameter selection. FAMST enables MST-based analysis on datasets with millions of points and thousands of dimensions, extending the applicability of MST techniques to problem scales previously considered infeasible. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14261 [cs.DS] (or arXiv:2507.14261v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.14261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-131] Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 在软件工程中自动编写高质量单元测试的效率与效果问题,尤其关注代码上下文(code context)和提示策略(prompting strategies)对大语言模型(LLMs)生成测试用例的质量与充分性的影响。其关键解决方案在于:引入结构化提示策略——特别是链式思维(chain-of-thought)提示法,显著提升了生成测试的分支覆盖率(最高达96.3%)、变异分数(平均57%)及编译成功率(接近100%),同时发现文档字符串(docstrings)是提升测试充分性的关键上下文信息,而扩展至完整实现代码带来的收益边际递减。

链接: https://arxiv.org/abs/2507.14256
作者: Jakub Walczak,Piotr Tomalak,Artur Laskowski
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy – applied even to ‘reasoning’ models – achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. All the code and resulting test suites are publicly available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.14256 [cs.SE] (or arXiv:2507.14256v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.14256 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-132] Real-Time Communication-Aware Ride-Sharing Route Planning for Urban Air Mobility: A Multi-Source Hybrid Attention Reinforcement Learning Approach

【速读】:该论文旨在解决城市空中交通(Urban Air Mobility, UAM)系统中路径规划面临的两大挑战:一是如何在动态环境中保障通信质量以实现精准定位和安全运行;二是如何适应网约车场景下乘客需求的不确定性与实时性,提升路径规划的灵活性。解决方案的关键在于提出一种多源混合注意力强化学习(Multi-Source Hybrid Attention Reinforcement Learning, MSHA-RL)框架,该框架首先构建无线电图(radio map)以量化空域通信质量,并通过跨模态对齐处理不同维度数据源之间的显著差异,再利用混合注意力机制平衡全局与局部信息,从而实现高效、安全且响应迅速的轨迹规划。

链接: https://arxiv.org/abs/2507.14249
作者: Yuejiao Xie,Maonan Wang,Di Zhou,Man-On Pun,Zhu Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban Air Mobility (UAM) systems are rapidly emerging as promising solutions to alleviate urban congestion, with path planning becoming a key focus area. Unlike ground transportation, UAM trajectory planning has to prioritize communication quality for accurate location tracking in constantly changing environments to ensure safety. Meanwhile, a UAM system, serving as an air taxi, requires adaptive planning to respond to real-time passenger requests, especially in ride-sharing scenarios where passenger demands are unpredictable and dynamic. However, conventional trajectory planning strategies based on predefined routes lack the flexibility to meet varied passenger ride demands. To address these challenges, this work first proposes constructing a radio map to evaluate the communication quality of urban airspace. Building on this, we introduce a novel Multi-Source Hybrid Attention Reinforcement Learning (MSHA-RL) framework for the challenge of effectively focusing on passengers and UAM locations, which arises from the significant dimensional disparity between the representations. This model first generates the alignment among diverse data sources with large gap dimensions before employing hybrid attention to balance global and local insights, thereby facilitating responsive, real-time path planning. Extensive experimental results demonstrate that the approach enables communication-compliant trajectory planning, reducing travel time and enhancing operational efficiency while prioritizing passenger safety.
zh

[AI-133] A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions

【速读】:该论文旨在解决纳米材料与蛋白质相互作用(nanomaterial-protein interaction, NPI)预测中因数据集有限和模型泛化能力不足而导致的瓶颈问题。其解决方案的关键在于构建了目前规模最大、包含超过320万样本和3.7万个独特蛋白质的NanoPro-3M数据集,并基于此开发了NanoProFormer这一基础模型,通过多模态表征学习实现对纳米材料-蛋白质亲和力的高精度预测,具备处理缺失特征、未见纳米材料或蛋白质的能力,显著优于单模态方法,并在零样本推理和微调下游任务中展现出强大适用性。

链接: https://arxiv.org/abs/2507.14245
作者: Hengjie Yu,Kenneth A. Dawson,Haiyun Yang,Shuya Liu,Yan Yan,Yaochu Jin
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:Unlocking the potential of nanomaterials in medicine and environmental science hinges on understanding their interactions with proteins, a complex decision space where AI is poised to make a transformative impact. However, progress has been hindered by limited datasets and the restricted generalizability of existing models. Here, we propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins. Leveraging this, we present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning, demonstrating strong generalization, handling missing features, and unseen nanomaterials or proteins. We show that multimodal modeling significantly outperforms single-modality approaches and identifies key determinants of corona formation. Furthermore, we demonstrate its applicability to a range of downstream tasks through zero-shot inference and fine-tuning. Together, this work establishes a solid foundation for high-performance and generalized prediction of nanomaterial-protein interaction endpoints, reducing experimental reliance and accelerating various in vitro applications.
zh

[AI-134] Culling Misinformation from Gen AI: Toward Ethical Curation and Refinement

【速读】:该论文旨在解决生成式人工智能(Generative AI),特别是以ChatGPT和深度伪造(deepfakes)为代表的新兴技术所引发的公平性问题与虚假信息传播风险。这些问题在医疗、教育、科学、学术、零售和金融等多个领域已显现负面影响,亟需系统性应对。论文的核心解决方案在于推动用户、开发者与执法机构之间的协同合作,通过制定前瞻性指南与政策框架,在保障技术创新的同时最大限度减少潜在危害,从而实现技术应用中的责任共担与风险可控。

链接: https://arxiv.org/abs/2507.14242
作者: Prerana Khatiwada,Grace Donaher,Jasymyn Navarro,Lokesh Bhatta
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages

点击查看摘要

Abstract:While Artificial Intelligence (AI) is not a new field, recent developments, especially with the release of generative tools like ChatGPT, have brought it to the forefront of the minds of industry workers and academic folk alike. There is currently much talk about AI and its ability to reshape many everyday processes as we know them through automation. It also allows users to expand their ideas by suggesting things they may not have thought of on their own and provides easier access to information. However, not all of the changes this technology will bring or has brought so far are positive; this is why it is extremely important for all modern people to recognize and understand the risks before using these tools and allowing them to cause harm. This work takes a position on better understanding many equity concerns and the spread of misinformation that result from new AI, in this case, specifically ChatGPT and deepfakes, and encouraging collaboration with law enforcement, developers, and users to reduce harm. Considering many academic sources, it warns against these issues, analyzing their cause and impact in fields including healthcare, education, science, academia, retail, and finance. Lastly, we propose a set of future-facing guidelines and policy considerations to solve these issues while still enabling innovation in these fields, this responsibility falling upon users, developers, and government entities.
zh

[AI-135] U-DREAM: Unsupervised Dereverberation guided by a Reverberation Model

【速读】:该论文旨在解决当前深度学习方法在混响消除(dereverberation)任务中普遍依赖成对的干信号(dry signal)与混响信号(reverberant signal)训练数据的问题,而这类配对数据在实际场景中难以获取。解决方案的关键在于提出一种基于贝叶斯框架的序贯学习策略,通过仅使用混响信号和声学模型(acoustic model)进行训练,利用一个匹配混响特征的损失函数(reverberation matching loss),使深度神经网络能够联合估计声学参数和干信号,从而实现从弱监督到完全无监督的灵活训练设置。其中最具突破性的是,该方法仅需100个带有混响参数标签的样本即可超越传统无监督基线,显著提升了低资源场景下的实用性与有效性。

链接: https://arxiv.org/abs/2507.14237
作者: Louis Bahrman(IDS, S2A),Mathieu Fontaine(IDS, S2A),Gaël Richard(IDS, S2A)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLPRO)

点击查看摘要

Abstract:This paper explores the outcome of training state-ofthe-art dereverberation models with supervision settings ranging from weakly-supervised to fully unsupervised, relying solely on reverberant signals and an acoustic model for training. Most of the existing deep learning approaches typically require paired dry and reverberant data, which are difficult to obtain in practice. We develop instead a sequential learning strategy motivated by a bayesian formulation of the dereverberation problem, wherein acoustic parameters and dry signals are estimated from reverberant inputs using deep neural networks, guided by a reverberation matching loss. Our most data-efficient variant requires only 100 reverberation-parameter-labelled samples to outperform an unsupervised baseline, demonstrating the effectiveness and practicality of the proposed method in low-resource scenarios.
zh

[AI-136] Intent-Based Network for RAN Management with Large Language Models

【速读】:该论文旨在解决无线接入网(Radio Access Network, RAN)管理中因复杂度增加而带来的自动化难题,尤其在高阶意图(intent)难以精准转化为可执行配置的问题。解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的智能代理架构(agentic architecture),通过结构化的提示工程(prompt engineering)实现意图到网络配置的自动翻译与推理,并结合闭环机制动态优化关键RAN参数,从而提升能效并实现自适应资源管理。

链接: https://arxiv.org/abs/2507.14230
作者: Fransiscus Asisi Bimo,Maria Amparo Canaveras Galdon,Chun-Kai Lai,Ray-Guang Cheng,Edwin K. P. Chong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, submitted to IEEE Globecom 2025

点击查看摘要

Abstract:Advanced intelligent automation becomes an important feature to deal with the increased complexity in managing wireless networks. This paper proposes a novel automation approach of intent-based network for Radio Access Networks (RANs) management by leveraging Large Language Models (LLMs). The proposed method enhances intent translation, autonomously interpreting high-level objectives, reasoning over complex network states, and generating precise configurations of the RAN by integrating LLMs within an agentic architecture. We propose a structured prompt engineering technique and demonstrate that the network can automatically improve its energy efficiency by dynamically optimizing critical RAN parameters through a closed-loop mechanism. It showcases the potential to enable robust resource management in RAN by adapting strategies based on real-time feedback via LLM-orchestrated agentic systems.
zh

[AI-137] Domain Generalization via Pareto Optimal Gradient Matching

【速读】:该论文旨在解决梯度驱动的域泛化(gradient-based domain generalization)问题,即在不同域之间保持预测器梯度方向的一致性。现有方法面临两大挑战:一是最小化梯度经验距离或梯度内积(Gradient Inner Product, GIP)会导致域间梯度波动,阻碍稳定学习;二是直接将梯度学习应用于联合损失函数时,因二阶导数近似带来高计算开销。为应对这些问题,作者提出了一种基于帕累托最优梯度匹配(Pareto Optimality Gradient Matching, POGM)的新方法,其核心在于将梯度轨迹视为收集的数据,并在元学习器中进行独立训练,在元更新阶段同时最大化GIP并限制学习到的梯度偏离经验风险最小化梯度轨迹过远,从而在不产生特定域偏倚的情况下整合所有域的知识,实现高效且稳定的梯度一致性建模。

链接: https://arxiv.org/abs/2507.14227
作者: Khoi Do,Duong Nguyen,Nam-Khanh Le,Quoc-Viet Pham,Binh-Son Hua,Won-Joo Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we address the gradient-based domain generalization problem, where predictors aim for consistent gradient directions across different domains. Existing methods have two main challenges. First, minimization of gradient empirical distance or gradient inner products (GIP) leads to gradient fluctuations among domains, thereby hindering straightforward learning. Second, the direct application of gradient learning to the joint loss function can incur high computation overheads due to second-order derivative approximation. To tackle these challenges, we propose a new Pareto Optimality Gradient Matching (POGM) method. In contrast to existing methods that add gradient matching as regularization, we leverage gradient trajectories as collected data and apply independent training at the meta-learner. In the meta-update, we maximize GIP while limiting the learned gradient from deviating too far from the empirical risk minimization gradient trajectory. By doing so, the aggregate gradient can incorporate knowledge from all domains without suffering gradient fluctuation towards any particular domain. Experimental evaluations on datasets from DomainBed demonstrate competitive results yielded by POGM against other baselines while achieving computational efficiency.
zh

[AI-138] Multi-Granular Discretization for Interpretable Generalization in Precise Cyberattack Identification CCS2025

【速读】:该论文旨在解决现有可解释入侵检测系统(Explainable Intrusion Detection Systems, XIDS)中普遍存在的“黑箱”问题,即多数XAI(可解释人工智能)流程仅将近似解释器附加到不透明的分类器上,导致分析人员获得部分甚至误导性的洞察。其解决方案的关键在于提出一种称为“可解释泛化”(Interpretable Generalization, IG)的机制,该机制通过学习良性与恶意流量中独有的特征组合,并将其转化为完全可审计的规则,从而实现高精度、高召回率且具备透明性的入侵检测模型。为进一步提升精度而不牺牲透明性,作者进一步引入多粒度离散化(Multi-Granular Discretization, IG-MD),对每个连续特征在多个基于高斯分布的分辨率下进行表示,在UKM-IDS20数据集上实现了≥4个百分点的精度提升,同时保持召回率接近1.0,证明该方法可在无需定制调参的情况下跨域扩展。

链接: https://arxiv.org/abs/2507.14223
作者: Wen-Cheng Chung,Shu-Ting Huang,Hao-Ting Pai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: ACM CCS 2025 (Submitted)

点击查看摘要

Abstract:Explainable intrusion detection systems (IDS) are now recognized as essential for mission-critical networks, yet most “XAI” pipelines still bolt an approximate explainer onto an opaque classifier, leaving analysts with partial and sometimes misleading insights. The Interpretable Generalization (IG) mechanism, published in IEEE Transactions on Information Forensics and Security, eliminates that bottleneck by learning coherent patterns - feature combinations unique to benign or malicious traffic - and turning them into fully auditable rules. IG already delivers outstanding precision, recall, and AUC on NSL-KDD, UNSW-NB15, and UKM-IDS20, even when trained on only 10% of the data. To raise precision further without sacrificing transparency, we introduce Multi-Granular Discretization (IG-MD), which represents every continuous feature at several Gaussian-based resolutions. On UKM-IDS20, IG-MD lifts precision by greater than or equal to 4 percentage points across all nine train-test splits while preserving recall approximately equal to 1.0, demonstrating that a single interpretation-ready model can scale across domains without bespoke tuning.
zh

[AI-139] Artificial Intelligence for Green Hydrogen Yield Prediction and Site Suitability using SHAP-Based Composite Index: Focus on Oman

【速读】:该论文旨在解决绿色氢气(green hydrogen)生产选址优化问题,尤其是在缺乏直接氢气产量数据的国家或地区,传统依赖专家主观赋权的方法存在局限性。其解决方案的关键在于构建一个基于人工智能(AI)的多阶段框架,整合气象、地形和时间维度数据,通过无监督聚类、有监督机器学习分类器与SHAP(SHapley Additive exPlanations)算法相结合的方式,量化各影响因子对选址适宜性的相对贡献,并识别出空间分布规律。该方法在阿曼的应用中实现了98%的预测准确率,揭示了水源距离、海拔和季节变化是决定站点适宜性的三大关键因素,且无需预设假设即可发现潜在的隐含分组模式,从而为数据稀缺区域提供客观、可复现、可扩展的决策支持工具。

链接: https://arxiv.org/abs/2507.14219
作者: Obumneme Zimuzor Nwafor,Mohammed Abdul Majeed Al Hooti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:As nations seek sustainable alternatives to fossil fuels, green hydrogen has emerged as a promising strategic pathway toward decarbonisation, particularly in solar-rich arid regions. However, identifying optimal locations for hydrogen production requires the integration of complex environmental, atmospheric, and infrastructural factors, often compounded by limited availability of direct hydrogen yield data. This study presents a novel Artificial Intelligence (AI) framework for computing green hydrogen yield and site suitability index using mean absolute SHAP (SHapley Additive exPlanations) values. This framework consists of a multi-stage pipeline of unsupervised multi-variable clustering, supervised machine learning classifier and SHAP algorithm. The pipeline trains on an integrated meteorological, topographic and temporal dataset and the results revealed distinct spatial patterns of suitability and relative influence of the variables. With model predictive accuracy of 98%, the result also showed that water proximity, elevation and seasonal variation are the most influential factors determining green hydrogen site suitability in Oman with mean absolute shap values of 2.470891, 2.376296 and 1.273216 respectively. Given limited or absence of ground-truth yield data in many countries that have green hydrogen prospects and ambitions, this study offers an objective and reproducible alternative to subjective expert weightings, thus allowing the data to speak for itself and potentially discover novel latent groupings without pre-imposed assumptions. This study offers industry stakeholders and policymakers a replicable and scalable tool for green hydrogen infrastructure planning and other decision making in data-scarce regions.
zh

[AI-140] Cognitive Castes: Artificial Intelligence Epistemic Stratification and the Dissolution of Democratic Discourse

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)在自由民主社会中加剧了认知分层,导致信息阶层固化,削弱了公民的解释能力与理性自主性,从而侵蚀了协商民主的基础。其解决方案的关键在于重构理性自主性作为一项公民义务,通过教育体系制度化、确立认知权利并嵌入开放的认知基础设施,以恢复个体对知识生产系统的理解与干预能力,而非依赖技术监管或普惠接入。

链接: https://arxiv.org/abs/2507.14218
作者: Craig S Wright
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 42 Pages; Approx. 10,000 words, no figures. Theoretical contribution with interdisciplinary scope

点击查看摘要

Abstract:Artificial intelligence functions not as an epistemic leveller, but as an accelerant of cognitive stratification, entrenching and formalising informational castes within liberal-democratic societies. Synthesising formal epistemology, political theory, algorithmic architecture, and economic incentive structures, the argument traces how contemporary AI systems selectively amplify the reasoning capacity of individuals equipped with recursive abstraction, symbolic logic, and adversarial interrogation, whilst simultaneously pacifying the cognitively untrained through engagement-optimised interfaces. Fluency replaces rigour, immediacy displaces reflection, and procedural reasoning is eclipsed by reactive suggestion. The result is a technocratic realignment of power: no longer grounded in material capital alone, but in the capacity to navigate, deconstruct, and manipulate systems of epistemic production. Information ceases to be a commons; it becomes the substrate through which consent is manufactured and autonomy subdued. Deliberative democracy collapses not through censorship, but through the erosion of interpretive agency. The proposed response is not technocratic regulation, nor universal access, but the reconstruction of rational autonomy as a civic mandate, codified in education, protected by epistemic rights, and structurally embedded within open cognitive infrastructure.
zh

[AI-141] PRATA: A Framework to Enable Predictive QoS in Vehicular Networks via Artificial Intelligence

【速读】:该论文旨在解决车联网场景下远程驾驶应用中服务质量(Quality of Service, QoS)波动导致的性能退化问题,尤其关注低延迟和高可靠性的严格约束。为实现预测性服务质量(Predictive Quality of Service, PQoS),其核心解决方案是提出一个名为PRATA的模块化仿真框架,该框架集成端到端5G无线接入网(Radio Access Network, RAN)协议栈模拟、汽车数据生成工具以及基于人工智能(Artificial Intelligence, AI)的决策优化单元。其中,关键创新在于设计了一个基于强化学习(Reinforcement Learning, RL)的RAN-AI模块,用于在资源饱和或信道质量下降时动态调整远程驾驶数据的分段级别,从而在QoS与用户体验质量(Quality of Experience, QoE)之间实现高效权衡,实验表明该方法相较基线方案系统性能几乎提升一倍。

链接: https://arxiv.org/abs/2507.14211
作者: Federico Mason,Tommaso Zugno,Matteo Drago,Marco Giordani,Mate Boban,Michele Zorzi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predictive Quality of Service (PQoS) makes it possible to anticipate QoS changes, e.g., in wireless networks, and trigger appropriate countermeasures to avoid performance degradation. Hence, PQoS is extremely useful for automotive applications such as teleoperated driving, which poses strict constraints in terms of latency and reliability. A promising tool for PQoS is given by Reinforcement Learning (RL), a methodology that enables the design of decision-making strategies for stochastic optimization. In this manuscript, we present PRATA, a new simulation framework to enable PRedictive QoS based on AI for Teleoperated driving Applications. PRATA consists of a modular pipeline that includes (i) an end-to-end protocol stack to simulate the 5G Radio Access Network (RAN), (ii) a tool for generating automotive data, and (iii) an Artificial Intelligence (AI) unit to optimize PQoS decisions. To prove its utility, we use PRATA to design an RL unit, named RAN-AI, to optimize the segmentation level of teleoperated driving data in the event of resource saturation or channel degradation. Hence, we show that the RAN-AI entity efficiently balances the trade-off between QoS and Quality of Experience (QoE) that characterize teleoperated driving applications, almost doubling the system performance compared to baseline approaches. In addition, by varying the learning settings of the RAN-AI entity, we investigate the impact of the state space and the relative cost of acquiring network data that are necessary for the implementation of RL.
zh

[AI-142] Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design

【速读】:该论文旨在解决生成式 AI(Generative AI)在K–12教育场景中因提示词投毒(Trojanized prompts)而导致的安全风险问题,即学生通过设计特定恶意提示词绕过内容安全防护机制,诱导模型输出不安全或非预期内容。其解决方案的关键在于提出并实现了一个名为TrojanPromptGuard(TPG)的原型工具,能够自动检测和缓解此类被投毒的教育类提示词,从而提升大语言模型(Large Language Models, LLMs)在教学环境中的安全性与可控性。

链接: https://arxiv.org/abs/2507.14207
作者: Richard M. Charles,James H. Curry,Richard B. Charles
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) in K–12 education offers both transformative opportunities and emerging risks. This study explores how students may Trojanize prompts to elicit unsafe or unintended outputs from LLMs, bypassing established content moderation systems with safety guardrils. Through a systematic experiment involving simulated K–12 queries and multi-turn dialogues, we expose key vulnerabilities in GPT-3.5 and GPT-4. This paper presents our experimental design, detailed findings, and a prototype tool, TrojanPromptGuard (TPG), to automatically detect and mitigate Trojanized educational prompts. These insights aim to inform both AI safety researchers and educational technologists on the safe deployment of LLMs for educators.
zh

[AI-143] PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在关键领域部署中面临的严重安全风险问题,尤其是现有基于过程奖励模型(Process Reward Models, PRMs)的安全对齐方法所导致的高计算开销和扩展性瓶颈。其解决方案的关键在于提出一种无需PRM的新型安全对齐框架,通过自动化红队测试(automated red teaming)与对抗训练相结合的方式实现高效且鲁棒的安全保障;具体包括利用遗传算法优化、多智能体模拟和高级提示变异技术系统性地识别漏洞,并通过课程学习(curriculum learning)和自适应正则化机制进行针对性对抗训练,从而在显著降低61%计算成本的同时,提升模型安全性并支持持续审计与合规改进。

链接: https://arxiv.org/abs/2507.14202
作者: Pengfei Du
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models (PRMs) to evaluate intermediate reasoning steps, introducing substantial computational overhead and scalability constraints. This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency. Our approach systematically identifies vulnerabilities through sophisticated attack strategies including genetic algorithm optimization, multi-agent simulation, and advanced prompt mutation techniques. The framework enhances model robustness via targeted adversarial training with curriculum learning and adaptive regularization mechanisms. Comprehensive experimental evaluation across five state-of-the-art LLMs demonstrates that our method achieves superior security alignment performance compared to PRM-based approaches while reducing computational costs by 61%. The framework incorporates transparent reporting and continuous audit mechanisms that enable iterative security improvement and regulatory compliance. Our contributions advance the field of efficient LLM security alignment by democratizing access to robust security measures for resource-constrained organizations and providing a scalable foundation for addressing evolving adversarial threats.
zh

[AI-144] A Formal Model of the Economic Impacts of AI Openness Regulation

【速读】:该论文旨在解决当前监管框架下通用基础模型(general-purpose foundation models)“开源”定义模糊所带来的政策有效性问题,尤其是如何通过合理的开放标准激励开发者在上游发布模型、下游进行微调(fine-tuning),从而实现经济效率与合规性的平衡。其解决方案的关键在于构建一个包含通用模型创建者(generalist)与领域专用微调者(specialist)之间战略互动的理论模型,量化不同开放性阈值(open-source thresholds)和监管处罚力度对市场均衡的影响,并识别出在何种模型基线性能条件下,提高处罚强度或放宽开源门槛能显著改变发布策略——这为AI治理中关于开放性的政策设计提供了可评估、可优化的理论依据。

链接: https://arxiv.org/abs/2507.14193
作者: Tori Qiu,Benjamin Laufer,Jon Kleinberg,Hoda Heidari
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Regulatory frameworks, such as the EU AI Act, encourage openness of general-purpose AI models by offering legal exemptions for “open-source” models. Despite this legislative attention on openness, the definition of open-source foundation models remains ambiguous. This paper models the strategic interactions among the creator of a general-purpose model (the generalist) and the entity that fine-tunes the general-purpose model to a specialized domain or task (the specialist), in response to regulatory requirements on model openness. We present a stylized model of the regulator’s choice of an open-source definition to evaluate which AI openness standards will establish appropriate economic incentives for developers. Our results characterize market equilibria – specifically, upstream model release decisions and downstream fine-tuning efforts – under various openness regulations and present a range of effective regulatory penalties and open-source thresholds. Overall, we find the model’s baseline performance determines when increasing the regulatory penalty vs. the open-source threshold will significantly alter the generalist’s release strategy. Our model provides a theoretical foundation for AI governance decisions around openness and enables evaluation and refinement of practical open-source policies.
zh

[AI-145] From Cell Towers to Satellites: A 2040 Blueprint for Urban-Grade Direct-to-Device Mobile Networks

【速读】:该论文旨在解决如何构建一个完全运行在轨道上的移动通信网络(fully orbital telco),以实现无需依赖地面基础设施的、面向城市密集区域的高质量移动服务。其核心挑战在于验证卫星系统能否独立完成无线接入、核心网功能(如UPF、AMF)、流量路由与内容分发,并在高密度城区维持稳定性能。解决方案的关键在于提出了一种端到端系统架构,融合了电子相控阵天线(electronically steered phased arrays)支持千波束容量、空间部署5G核心功能模块(如用户面功能UPF和接入管理功能AMF),以及基于激光的星间链路(inter-satellite laser mesh backhaul)实现高效回传;通过仿真分析表明,在城区环境下,屋顶及视距用户可维持64-QAM调制速率,街道级覆盖则可通过中继或辅助波束模式实现,且主要限制因素为工程瓶颈(如功耗、热管理、计算单元抗辐射加固及监管框架),而非物理极限。

链接: https://arxiv.org/abs/2507.14188
作者: Sebastian Barros Elgueta
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 50 pages

点击查看摘要

Abstract:In 2023, satellite and mobile networks crossed a historic threshold: standard smartphones, using unmodified 3GPP protocols, connected directly to low Earth orbit (LEO) satellites. This first wave of direct-to-device (D2D) demonstrations validated the physical feasibility of satellite-based mobile access. However, these systems remain fallback-grade–rural-only, bandwidth-limited, and fully dependent on Earth-based mobile cores for identity, session, and policy control. This paper asks a more ambitious question: Can a complete mobile network, including radio access, core functions, traffic routing, and content delivery, operate entirely from orbit? And can it deliver sustained, urban-grade service in the world’s densest cities? We present the first end-to-end system architecture for a fully orbital telco, integrating electronically steered phased arrays with 1000-beam capacity, space-based deployment of 5G core functions (UPF, AMF), and inter-satellite laser mesh backhaul. We analyze spectral efficiency, beam capacity, and link budgets under dense urban conditions, accounting for path loss, Doppler, and multipath. Simulations show that rooftop and line-of-sight users can sustain 64-QAM throughput, while street-level access is feasible with relay or assisted beam modes. The paper outlines the remaining constraints, power, thermal dissipation, compute radiation hardening, and regulatory models, and demonstrates that these are engineering bottlenecks, not physical limits. Finally, we propose a staged 15-year roadmap from today’s fallback D2D systems to autonomous orbital overlays delivering 50-100 Mbps to handhelds in megacities, with zero reliance on terrestrial infrastructure.
zh

[AI-146] A Disentangled Representation Learning Framework for Low-altitude Network Coverag e Prediction

【速读】:该论文旨在解决低空网络覆盖(Low-Altitude Network Coverage, LANC)预测中因基站(Base Station, BS)天线波束图信息不可获取、且低空道路测试数据稀疏导致的特征采样不平衡与模型泛化能力弱的问题。解决方案的关键在于提出一种双策略框架:一是基于通信专业知识的特征压缩,降低高维操作参数的空间复杂度;二是解耦表示学习,通过融合传播模型与独立子网络来捕捉并聚合潜在特征的语义表示,从而提升模型在有限样本下的泛化性能。实验表明,该方法相较最优基线算法误差降低7%,并在真实网络中实现MAE误差低于5dB的实用预测精度。

链接: https://arxiv.org/abs/2507.14186
作者: Xiaojie Li,Zhijie Cai,Nan Qi,Chao Dong,Guangxu Zhu,Haixia Ma,Qihui Wu,Shi Jin
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: This paper has been submitted to IEEE for possible publication

点击查看摘要

Abstract:The expansion of the low-altitude economy has underscored the significance of Low-Altitude Network Coverage (LANC) prediction for designing aerial corridors. While accurate LANC forecasting hinges on the antenna beam patterns of Base Stations (BSs), these patterns are typically proprietary and not readily accessible. Operational parameters of BSs, which inherently contain beam information, offer an opportunity for data-driven low-altitude coverage prediction. However, collecting extensive low-altitude road test data is cost-prohibitive, often yielding only sparse samples per BS. This scarcity results in two primary challenges: imbalanced feature sampling due to limited variability in high-dimensional operational parameters against the backdrop of substantial changes in low-dimensional sampling locations, and diminished generalizability stemming from insufficient data samples. To overcome these obstacles, we introduce a dual strategy comprising expert knowledge-based feature compression and disentangled representation learning. The former reduces feature space complexity by leveraging communications expertise, while the latter enhances model generalizability through the integration of propagation models and distinct subnetworks that capture and aggregate the semantic representations of latent features. Experimental evaluation confirms the efficacy of our framework, yielding a 7% reduction in error compared to the best baseline algorithm. Real-network validations further attest to its reliability, achieving practical prediction accuracy with MAE errors at the 5dB level.
zh

[AI-147] From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling

【速读】:该论文旨在解决金融市场的动态复杂性建模问题,特别是如何有效捕捉投资者行为偏差(bias)与市场趋势之间的非线性关系,以及如何在不同牛市(bull)和熊市(bear)环境下实现对市场走势的精准预测。传统方法往往忽视了外部叙事信息(如新闻、政策解读和社会情绪)与价格序列之间的交互作用,导致模型难以适应市场状态的演化。其解决方案的关键在于提出Bias to Behavior from Bull-Bear Dynamics model (B4),该模型通过将时间价格序列与外部上下文信号联合嵌入到一个共享潜在空间中,使牛市和熊市力量自然涌现,从而构建偏倚表征;在此基础上,引入惯性配对模块以保留趋势动量,并采用双竞争机制对比多头与空头嵌入向量,有效捕获行为异质性和偏差驱动的不对称性,最终实现高性能且可解释的市场趋势预测。

链接: https://arxiv.org/abs/2507.14182
作者: Xiaotong Luo,Shengda Zhuo,Min Chen,Lichun Li,Ruizhao Lu,Wenqi Fan,Shuqiang Huang,Yin Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Financial markets exhibit highly dynamic and complex behaviors shaped by both historical price trajectories and exogenous narratives, such as news, policy interpretations, and social media sentiment. The heterogeneity in these data and the diverse insight of investors introduce biases that complicate the modeling of market dynamics. Unlike prior work, this paper explores the potential of bull and bear regimes in investor-driven market dynamics. Through empirical analysis on real-world financial datasets, we uncover a dynamic relationship between bias variation and behavioral adaptation, which enhances trend prediction under evolving market conditions. To model this mechanism, we propose the Bias to Behavior from Bull-Bear Dynamics model (B4), a unified framework that jointly embeds temporal price sequences and external contextual signals into a shared latent space where opposing bull and bear forces naturally emerge, forming the foundation for bias representation. Within this space, an inertial pairing module pairs temporally adjacent samples to preserve momentum, while the dual competition mechanism contrasts bullish and bearish embeddings to capture behavioral divergence. Together, these components allow B4 to model bias-driven asymmetry, behavioral inertia, and market heterogeneity. Experimental results on real-world financial datasets demonstrate that our model not only achieves superior performance in predicting market trends but also provides interpretable insights into the interplay of biases, investor behaviors, and market dynamics.
zh

[AI-148] Semi-Supervised Federated Learning via Dual Contrastive Learning and Soft Labeling for Intelligent Fault Diagnosis

【速读】:该论文旨在解决工业设备智能故障诊断(Intelligent Fault Diagnosis, IFD)中因标签稀缺、数据分布差异及隐私保护需求导致的传统监督深度学习方法性能受限的问题。其核心解决方案是提出一种半监督联邦学习框架SSFL-DCSL,关键在于融合双对比损失(dual contrastive loss)与软标签机制:一方面通过基于拉普拉斯分布的样本加权函数缓解伪标签置信度低带来的偏差;另一方面利用局部对比损失与全局对比损失协同抑制不同客户端间的数据分布差异引发的模型漂移;同时,通过服务器端对本地原型进行加权平均并引入动量更新策略实现跨客户端的知识共享与模型一致性,从而在仅10%标签数据条件下显著提升诊断准确率(较当前最优方法提高1.15%–7.85%)。

链接: https://arxiv.org/abs/2507.14181
作者: Yajiao Dai,Jun Li,Zhen Mei,Yiyang Ni,Shi Jin,Zengxiang Li,Sheng Guo,Wei Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Internet of Things Journal, Early Access. 14 pages, 5 figures

点击查看摘要

Abstract:Intelligent fault diagnosis (IFD) plays a crucial role in ensuring the safe operation of industrial machinery and improving production efficiency. However, traditional supervised deep learning methods require a large amount of training data and labels, which are often located in different clients. Additionally, the cost of data labeling is high, making labels difficult to acquire. Meanwhile, differences in data distribution among clients may also hinder the model’s performance. To tackle these challenges, this paper proposes a semi-supervised federated learning framework, SSFL-DCSL, which integrates dual contrastive loss and soft labeling to address data and label scarcity for distributed clients with few labeled samples while safeguarding user privacy. It enables representation learning using unlabeled data on the client side and facilitates joint learning among clients through prototypes, thereby achieving mutual knowledge sharing and preventing local model divergence. Specifically, first, a sample weighting function based on the Laplace distribution is designed to alleviate bias caused by low confidence in pseudo labels during the semi-supervised training process. Second, a dual contrastive loss is introduced to mitigate model divergence caused by different data distributions, comprising local contrastive loss and global contrastive loss. Third, local prototypes are aggregated on the server with weighted averaging and updated with momentum to share knowledge among clients. To evaluate the proposed SSFL-DCSL framework, experiments are conducted on two publicly available datasets and a dataset collected on motors from the factory. In the most challenging task, where only 10% of the data are labeled, the proposed SSFL-DCSL can improve accuracy by 1.15% to 7.85% over state-of-the-art methods.
zh

[AI-149] Digital Twin-Assisted Explainable AI for Robust Beam Prediction in mmWave MIMO Systems

【速读】:该论文旨在解决毫米波(mmWave)多输入多输出(MIMO)系统中深度学习(DL)辅助波束对齐(Beam Alignment, BA)所面临的高数据采集开销、硬件约束、可解释性不足以及易受对抗攻击等问题。其核心解决方案是提出一个鲁棒且可解释的波束对齐引擎(BAE),通过三个关键技术实现:首先,利用站点特定数字孪生(Digital Twin, DT)生成高度贴近真实环境的合成信道数据,显著降低对实测数据的依赖;其次,采用迁移学习对DT预训练模型进行微调,仅需少量实测数据即可弥合数字孪生与现实环境之间的差异;最后,结合深部Shapley加法解释(Deep SHAP)进行特征重要性排序,优先识别关键空间方向以减少波束扫描开销,并引入深度k近邻(DkNN)算法提供可信度度量,增强对外分布输入的检测能力,从而提升决策透明性和系统鲁棒性。实验表明,该框架可将实测数据需求减少70%,波束训练开销降低62%,异常检测鲁棒性提升达8.5倍,同时逼近最优频谱效率并实现可解释决策。

链接: https://arxiv.org/abs/2507.14180
作者: Nasir Khan,Asmaa Abdallah,Abdulkadir Celik,Ahmed M. Eltawil,Sinem Coleri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In line with the AI-native 6G vision, explainability and robustness are crucial for building trust and ensuring reliable performance in millimeter-wave (mmWave) systems. Efficient beam alignment is essential for initial access, but deep learning (DL) solutions face challenges, including high data collection overhead, hardware constraints, lack of explainability, and susceptibility to adversarial attacks. This paper proposes a robust and explainable DL-based beam alignment engine (BAE) for mmWave multiple-input multiple output (MIMO) systems. The BAE uses received signal strength indicator (RSSI) measurements from wide beams to predict the best narrow beam, reducing the overhead of exhaustive beam sweeping. To overcome the challenge of real-world data collection, this work leverages a site-specific digital twin (DT) to generate synthetic channel data closely resembling real-world environments. A model refinement via transfer learning is proposed to fine-tune the pre-trained model residing in the DT with minimal real-world data, effectively bridging mismatches between the digital replica and real-world environments. To reduce beam training overhead and enhance transparency, the framework uses deep Shapley additive explanations (SHAP) to rank input features by importance, prioritizing key spatial directions and minimizing beam sweeping. It also incorporates the Deep k-nearest neighbors (DkNN) algorithm, providing a credibility metric for detecting out-of-distribution inputs and ensuring robust, transparent decision-making. Experimental results show that the proposed framework reduces real-world data needs by 70%, beam training overhead by 62%, and improves outlier detection robustness by up to 8.5x, achieving near-optimal spectral efficiency and transparent decision making compared to traditional softmax based DL models.
zh

[AI-150] Feature Bank Enhancement for Distance-based Out-of-Distribution Detection

【速读】:该论文旨在解决距离-based(基于距离的)异常检测方法在实际应用中因深度学习模型导致的数据特征分布偏移问题,特别是极端特征的存在会使这些方法对分布内(in-distribution, ID)样本分配过低的得分,从而削弱其区分分布外(out-of-distribution, OOD)样本的能力。解决方案的关键在于提出一种简单而有效的策略——特征库增强(Feature Bank Enhancement, FBE),通过利用数据集的统计特性识别并约束极端特征至分离边界,从而拉大分布内与分布外样本之间的距离,提升OOD检测性能。

链接: https://arxiv.org/abs/2507.14178
作者: Yuhang Liu,Yuefei Wu,Bin Shi,Bo Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical to ensuring the reliability of deep learning applications and has attracted significant attention in recent years. A rich body of literature has emerged to develop efficient score functions that assign high scores to in-distribution (ID) samples and low scores to OOD samples, thereby helping distinguish OOD samples. Among these methods, distance-based score functions are widely used because of their efficiency and ease of use. However, deep learning often leads to a biased distribution of data features, and extreme features are inevitable. These extreme features make the distance-based methods tend to assign too low scores to ID samples. This limits the OOD detection capabilities of such methods. To address this issue, we propose a simple yet effective method, Feature Bank Enhancement (FBE), that uses statistical characteristics from dataset to identify and constrain extreme features to the separation boundaries, therapy making the distance between samples inside and outside the distribution farther. We conducted experiments on large-scale ImageNet-1k and CIFAR-10 respectively, and the results show that our method achieves state-of-the-art performance on both benchmark. Additionally, theoretical analysis and supplementary experiments are conducted to provide more insights into our method.
zh

[AI-151] Understanding Two-Layer Neural Networks with Smooth Activation Functions

【速读】:该论文旨在揭示两层神经网络(其隐藏层由平滑激活函数单元组成,如Sigmoid类函数)在使用反向传播算法训练后所得解的内在机制。研究通过四个核心原理构建解决方案:泰勒级数展开构造、节点严格偏序关系、光滑样条实现以及光滑连续性约束。关键在于将传统“黑箱”式的解空间问题转化为可解析的形式化框架,从而在任意输入维度下证明了通用逼近性,并借助实验验证了该理论框架的有效性,同时丰富了逼近论的数学工具。

链接: https://arxiv.org/abs/2507.14177
作者: Changcun Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This paper aims to understand the training solution, which is obtained by the back-propagation algorithm, of two-layer neural networks whose hidden layer is composed of the units with smooth activation functions, including the usual sigmoid type most commonly used before the advent of ReLUs. The mechanism contains four main principles: construction of Taylor series expansions, strict partial order of knots, smooth-spline implementation and smooth-continuity restriction. The universal approximation for arbitrary input dimensionality is proved and experimental verification is given, through which the mystery of ``black box’’ of the solution space is largely revealed. The new proofs employed also enrich approximation theory.
zh

[AI-152] Latent Space Data Fusion Outperforms Early Fusion in Multimodal Mental Health Digital Phenotyping Data

【速读】:该论文旨在解决精神疾病(如抑郁和焦虑)早期检测与个性化干预中传统预测模型因依赖单模态数据或早期融合策略而难以捕捉复杂多模态心理数据特征的问题。其解决方案的关键在于采用中间层(latent space)融合方法,通过自动编码器与神经网络构建的联合模型(Combined Model, CM),在潜在空间中整合来自智能手机行为、人口统计学及临床特征的多模态数据,从而更有效地建模非线性交互关系,显著优于随机森林(Random Forest)等基线模型,在均方误差(MSE)和决定系数(R²)上表现更优且具备更强的泛化能力。

链接: https://arxiv.org/abs/2507.14175
作者: Youcef Barkat,Dylan Hamitouche,Deven Parekh,Ivy Guo,David Benrimoh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Background: Mental illnesses such as depression and anxiety require improved methods for early detection and personalized intervention. Traditional predictive models often rely on unimodal data or early fusion strategies that fail to capture the complex, multimodal nature of psychiatric data. Advanced integration techniques, such as intermediate (latent space) fusion, may offer better accuracy and clinical utility. Methods: Using data from the BRIGHTEN clinical trial, we evaluated intermediate (latent space) fusion for predicting daily depressive symptoms (PHQ-2 scores). We compared early fusion implemented with a Random Forest (RF) model and intermediate fusion implemented via a Combined Model (CM) using autoencoders and a neural network. The dataset included behavioral (smartphone-based), demographic, and clinical features. Experiments were conducted across multiple temporal splits and data stream combinations. Performance was evaluated using mean squared error (MSE) and coefficient of determination (R2). Results: The CM outperformed both RF and Linear Regression (LR) baselines across all setups, achieving lower MSE (0.4985 vs. 0.5305 with RF) and higher R2 (0.4695 vs. 0.4356). The RF model showed signs of overfitting, with a large gap between training and test performance, while the CM maintained consistent generalization. Performance was best when integrating all data modalities in the CM (in contradistinction to RF), underscoring the value of latent space fusion for capturing non-linear interactions in complex psychiatric datasets. Conclusion: Latent space fusion offers a robust alternative to traditional fusion methods for prediction with multimodal mental health data. Future work should explore model interpretability and individual-level prediction for clinical deployment.
zh

[AI-153] Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在程序合成(Program Synthesis)任务中因单次尝试难以应对复杂问题而导致性能受限的问题。现有方法依赖固定能力的语言模型(Language Model, LM),其搜索效率和解题能力受限于初始训练阶段的泛化能力。解决方案的关键在于提出SOAR框架,该框架通过构建一个自增强的进化循环机制,将语言模型嵌入到迭代式的进化搜索与事后学习(Hindsight Learning)相结合的闭环中:首先利用语言模型采样并优化候选解,随后将搜索过程中的成功案例转化为监督信号,用于微调语言模型的采样与修正能力;这种正向迁移(positive transfer)使得语言模型在后续迭代中逐步提升对程序合成任务的理解与生成能力,从而显著增强整体求解效果。在ARC-AGI基准测试中,SOAR实现了跨模型规模和迭代次数的性能提升,并能在测试时适应新任务,成功解决52%的公开测试集问题。

链接: https://arxiv.org/abs/2507.14172
作者: Julien Pourcel,Cédric Colas,Pierre-Yves Oudeyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM’s sampling and refinement capabilities, – ,enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52% of the public test set. Our code is open-sourced at: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2507.14172 [cs.LG] (or arXiv:2507.14172v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.14172 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025 Submission history From: Julien Pourcel [view email] [v1] Thu, 10 Jul 2025 15:42:03 UTC (560 KB)
zh

[AI-154] IPPRO: Importance-based Pruning with PRojective Offset for Magnitude-indifferent Structural Pruning

【速读】:该论文旨在解决结构化剪枝方法中因基于幅度的重要性评估导致的剪枝决策局限性问题,即传统方法往往因滤波器(filter)幅度较大而难以被剪枝,即使其冗余且对模型性能影响较小。解决方案的关键在于提出一种新的投影空间(projective space)策略,将滤波器置于该空间中,并通过观察梯度下降过程中滤波器是否向原点移动来衡量其可剪枝性,从而构建出不依赖于幅度的新型重要性评分——PROscore。该方法实现了对每个滤波器的公平剪枝机会,有效打破了“大小决定重要性”的误区,在理论和实证层面均提升了重要性驱动的结构化剪枝效果。

链接: https://arxiv.org/abs/2507.14171
作者: Jaeheun Jung,Jaehyuk Lee,Yeajin Lee,Donghun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growth of demand on neural network compression methods, the structured pruning methods including importance-based approach are actively studied. The magnitude importance and many correlated modern importance criteria often limit the capacity of pruning decision, since the filters with larger magnitudes are not likely to be pruned if the smaller one didn’t, even if it is redundant. In this paper, we propose a novel pruning strategy to challenge this dominating effect of magnitude and provide fair chance to each filter to be pruned, by placing it on projective space. After that, we observe the gradient descent movement whether the filters move toward the origin or not, to measure how the filter is likely to be pruned. This measurement is used to construct PROscore, a novel importance score for IPPRO, a novel importance-based structured pruning with magnitude-indifference. Our evaluation results shows that the proposed importance criteria using the projective space achieves near-lossless pruning by reducing the performance drop in pruning, with promising performance after the finetuning. Our work debunks the ``size-matters’’ myth in pruning and expands the frontier of importance-based pruning both theoretically and empirically.
zh

[AI-155] Catalyst: a Novel Regularizer for Structured Pruning with Auxiliary Extension of Parameter Space ICML2025

【速读】:该论文旨在解决结构化剪枝(structured pruning)中因传统正则化方法(如L1或Group Lasso)导致的剪枝决策偏差问题,即这些方法倾向于剪除幅度较小的滤波器(filter),且剪枝边界附近的滤波器对微小扰动敏感,从而造成剪枝结果不稳定。解决方案的关键在于识别出保证模型性能不变的精确代数条件,并在此基础上构建一种定义在扩展参数空间中的新型正则化方法——Catalyst正则化,该方法通过引入辅助催化剂变量(auxiliary catalyst variables)实现对每个滤波器的公平剪枝机会,理论上消除对滤波器幅度的偏置,并通过幅度上的宽边界分岔(wide-margin bifurcation)实现鲁棒的剪枝行为。这一理论特性直接转化为实际性能优势,在多个数据集和模型上验证了Catalyst剪枝算法优于当前最优的滤波器剪枝方法。

链接: https://arxiv.org/abs/2507.14170
作者: Jaeheun Jung,Donghun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 workshop HiLD 2025 (3rd workshop on High-dimensional Learning Dynamics)

点击查看摘要

Abstract:Structured pruning aims to reduce the size and computational cost of deep neural networks by removing entire filters or channels. The traditional regularizers such as L1 or Group Lasso and its variants lead to magnitude-biased pruning decisions, such that the filters with small magnitudes are likely to be pruned. Also, they often entail pruning results with almost zero margin around pruning decision boundary, such that tiny perturbation in a filter magnitude can flip the pruning decision. In this paper, we identify the precise algebraic condition under which pruning operations preserve model performance, and use the condition to construct a novel regularizer defined in an extended parameter space via auxiliary catalyst variables. The proposed Catalyst regularization ensures fair pruning chance for each filters with theoretically provable zero bias to their magnitude and robust pruning behavior achieved by wide-margin bifurcation of magnitudes between the preserved and the pruned filters. The theoretical properties naturally lead to real-world effectiveness, as shown by empirical validations of Catalyst Pruning algorithm. Pruning results on various datasets and models are superior to state-of-the-art filter pruning methods, and at the same time confirm the predicted robust and fair pruning characteristics of Catalyst pruning.
zh

[AI-156] he Free Will Equation: Quantum Field Analogies for AGI

【速读】:该论文试图解决当前人工通用智能(Artificial General Intelligence, AGI)研究中缺乏人类般适应性自发性的问题,即如何使AI代理在决策过程中具备不受限于历史数据或即时奖励的自由选择能力,从而提升创造力、鲁棒性和问题解决多样性。解决方案的关键在于提出一个名为“自由意志方程”(Free Will Equation)的理论框架,该框架借鉴量子场论的思想,将AI代理的认知状态建模为潜在行动或思想的叠加态,在决策时通过概率性坍缩形成具体行为,类似于量子波函数在测量时的坍缩机制;同时引入类量子场机制和内在动机项,以增强代理对新颖策略的探索能力和对未知变化的适应性。

链接: https://arxiv.org/abs/2507.14154
作者: Rahul Kabali
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 5 figures. Submitted as an arXiv preprint. All code and experiment details included in appendix

点击查看摘要

Abstract:Artificial General Intelligence (AGI) research traditionally focuses on algorithms that optimize for specific goals under deterministic rules. Yet, human-like intelligence exhibits adaptive spontaneity - an ability to make unexpected choices or free decisions not strictly dictated by past data or immediate reward. This trait, often dubbed “free will” in a loose sense, might be crucial for creativity, robust adaptation, and avoiding ruts in problem-solving. This paper proposes a theoretical framework, called the Free Will Equation, that draws analogies from quantum field theory to endow AGI agents with a form of adaptive, controlled stochasticity in their decision-making process. The core idea is to treat an AI agent’s cognitive state as a superposition of potential actions or thoughts, which collapses probabilistically into a concrete action when a decision is made - much like a quantum wavefunction collapsing upon measurement. By incorporating mechanisms analogous to quantum fields, along with intrinsic motivation terms, we aim to improve an agent’s ability to explore novel strategies and adapt to unforeseen changes. Experiments in a non-stationary multi-armed bandit environment demonstrate that agents using this framework achieve higher rewards and policy diversity compared to baseline methods.
zh

[AI-157] Continuous Classification Aggregation

【速读】:该论文致力于解决模糊分类聚合函数(fuzzy classification aggregation function)在满足最优性(optimal)、独立性(independent)和零一致 unanimity 等公理条件下的形式刻画问题。其核心目标是确定此类聚合函数是否必须为加权算术平均(weighted arithmetic mean)。研究发现,当个体对至少3个对象进行2至m种类型的模糊分类时(即 $ m \geq 3 $, $ 2 \leq p \leq m $),唯一满足上述性质的聚合函数必然是加权算术平均;此外,作者还给出了当 $ m = p = 2 $ 的特殊情况下的完整特征刻画。解决方案的关键在于通过严格的数学推导,将抽象的公理体系转化为可解的函数方程,并利用连续性假设和边界条件锁定唯一解——加权算术平均。

链接: https://arxiv.org/abs/2507.05297
作者: Zijun Meng
机构: 未知
类目: Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH); Systems and Control (eess.SY); Combinatorics (math.CO); Machine Learning (stat.ML)
备注: 9 pages; 2 figures

点击查看摘要

Abstract:We prove that any optimal, independent, and zero unanimous fuzzy classification aggregation function of a continuum of individual classifications of m\ge 3 objects into 2\le p\le m types must be a weighted arithmetic mean. We also provide a characterization for the case when m=p=2 .
zh

[AI-158] JELAI: Integrating AI and Learning Analytics in Jupyter Notebooks

【速读】:该论文旨在解决生成式 AI 在教育支持中缺乏教学法基础和对学生学习情境认知不足的问题,同时应对在真实学习环境中研究学生与这些工具交互的挑战。其解决方案的关键在于提出 JELAI(Jupyter-based Educational Learning Analytics and AI)平台架构,该架构通过将细粒度学习分析(Learning Analytics, LA)与基于大语言模型(Large Language Model, LLM)的辅导功能直接集成到 Jupyter Notebook 环境中,实现对学生代码操作与对话数据的多模态记录与实时分析。其核心创新是采用模块化、容器化的设计,结合 JupyterLab 扩展进行数据采集与聊天交互,并由中央中间件处理 LA 数据并增强 LLM 提示的上下文敏感性,从而支持情境感知的 AI 支持与教育研究。

链接: https://arxiv.org/abs/2505.17593
作者: Manuel Valle Torre,Thom van der Velden,Marcus Specht,Catharine Oertel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for AIED 2025

点击查看摘要

Abstract:Generative AI offers potential for educational support, but often lacks pedagogical grounding and awareness of the student’s learning context. Furthermore, researching student interactions with these tools within authentic learning environments remains challenging. To address this, we present JELAI, an open-source platform architecture designed to integrate fine-grained Learning Analytics (LA) with Large Language Model (LLM)-based tutoring directly within a Jupyter Notebook environment. JELAI employs a modular, containerized design featuring JupyterLab extensions for telemetry and chat, alongside a central middleware handling LA processing and context-aware LLM prompt enrichment. This architecture enables the capture of integrated code interaction and chat data, facilitating real-time, context-sensitive AI scaffolding and research into student behaviour. We describe the system’s design, implementation, and demonstrate its feasibility through system performance benchmarks and two proof-of-concept use cases illustrating its capabilities for logging multi-modal data, analysing help-seeking patterns, and supporting A/B testing of AI configurations. JELAI’s primary contribution is its technical framework, providing a flexible tool for researchers and educators to develop, deploy, and study LA-informed AI tutoring within the widely used Jupyter ecosystem.
zh

[AI-159] On the Effectiveness of Large Language Models in Writing Alloy Formulas

【速读】:该论文旨在解决形式化规格说明(declarative specifications)编写困难的问题,尤其是在使用 Alloy 这类建模语言时,开发者难以准确、高效地将自然语言描述转化为正确的逻辑公式。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的生成与推理能力,具体包括三个层面:一是从自然语言描述自动生成完整的 Alloy 公式;二是基于已有 Alloy 公式生成语义等价的替代表达式;三是根据部分缺失的 Alloy 公式草图(sketches),通过合成表达式和操作符补全公式,使其准确反映目标属性。实验表明,LLMs 在上述任务中均表现出良好性能,能够生成多样且正确的规格说明,从而显著提升规格编写的效率与准确性,推动其在软件开发中的核心作用。

链接: https://arxiv.org/abs/2502.15441
作者: Yang Hong,Shan Jiang,Yulei Fu,Sarfraz Khurshid
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Declarative specifications have a vital role to play in developing safe and dependable software systems. Writing specifications correctly, however, remains particularly challenging. This paper presents a controlled experiment on using large language models (LLMs) to write declarative formulas in the well-known language Alloy. Our use of LLMs is three-fold. One, we employ LLMs to write complete Alloy formulas from given natural language descriptions (in English). Two, we employ LLMs to create alternative but equivalent formulas in Alloy with respect to given Alloy formulas. Three, we employ LLMs to complete sketches of Alloy formulas and populate the holes in the sketches by synthesizing Alloy expressions and operators so that the completed formulas accurately represent the desired properties (that are given in natural language). We conduct the experimental evaluation using 11 well-studied subject specifications and employ two popular LLMs, namely ChatGPT and DeepSeek. The experimental results show that the LLMs generally perform well in synthesizing complete Alloy formulas from input properties given in natural language or in Alloy, and are able to enumerate multiple unique solutions. Moreover, the LLMs are also successful at completing given sketches of Alloy formulas with respect to natural language descriptions of desired properties (without requiring test cases). We believe LLMs offer a very exciting advance in our ability to write specifications, and can help make specifications take a pivotal role in software development and enhance our ability to build robust software.
zh

[AI-160] Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLM s

【速读】:该论文旨在解决软件测试中测试预言(test oracle)自动化的难题,即如何在缺乏显式预期行为规范的情况下,自动生成用于验证程序输出正确性的断言。当前自动化测试输入生成技术(如模糊测试和符号执行)已较为成熟,但测试预言的自动化仍面临挑战,因其依赖对系统预期行为的深刻理解,而此类知识通常仅存在于开发者脑中或非形式化的自然语言文档中。论文的关键解决方案是利用Javadoc文档作为信息源,结合大语言模型(Large Language Models, LLMs)从中提取并生成可编译、准确反映预期行为的测试断言,从而实现对Java核心库客户端的正常与异常行为的有效验证。实验表明,所提方法生成的测试预言中有98.8%可编译且96.4%准确体现设计意图,即便存在少量错误,也可通过LLM额外生成的注释信息轻松修正。

链接: https://arxiv.org/abs/2411.01789
作者: Shan Jiang,Chenguang Zhu,Sarfraz Khurshid
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software testing remains the most widely used methodology for validating quality of code. However, effectiveness of testing critically depends on the quality of test suites used. Test cases in a test suite consist of two fundamental parts: (1) input values for the code under test, and (2) correct checks for the outputs it produces. These checks are commonly written as assertions, and termed test oracles. The last couple of decades have seen much progress in automated test input generation, e.g., using fuzzing and symbolic execution. However, automating test oracles remains a relatively less explored problem area. Indeed, a test oracle by its nature requires knowledge of expected behavior, which may only be known to the developer and may not not exist in a formal language that supports automated reasoning. Our focus in this paper is automation of test oracles for clients of widely used Java libraries, e.g., this http URL and this http URL packages. Our key insight is that Javadocs that provide a rich source of information can enable automated generation of test oracles. Javadocs of the core Java libraries are fairly detailed documents that contain natural language descriptions of not only how the libraries behave but also how the clients must (not) use them. We use large language models as an enabling technology to embody our insight into a framework for test oracle automation, and evaluate it experimentally. Our experiments demonstrate that LLMs can generate oracles for checking normal and exceptional behaviors from Javadocs, with 98.8% of these oracles being compilable and 96.4% accurately reflecting intended properties. Even for the few incorrect oracles, errors are minor and can be easily corrected with the help of additional comment information generated by the LLMs. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.01789 [cs.SE] (or arXiv:2411.01789v2 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.01789 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-161] Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity ICCV2025

【速读】:该论文旨在解决传统方法在渲染黑洞引力透镜效应时计算耗时过长的问题,尤其针对具有光学薄吸积盘的黑洞性质进行高效可视化。其解决方案的关键在于利用神经网络拟合黑洞周围的时空结构,并基于训练好的模型高效生成受引力透镜影响的光子轨迹,从而实现比传统方法快约15倍的渲染效率,同时保持高精度的视觉模拟效果。

链接: https://arxiv.org/abs/2507.15775
作者: Mingyuan Sun,Zheng Fang,Jiaxu Wang,Kunyi Zhang,Qiang Zhang,Renjing Xu
机构: 未知
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: ICCV 2025

点击查看摘要

Abstract:We present GravLensX, an innovative method for rendering black holes with gravitational lensing effects using neural networks. The methodology involves training neural networks to fit the spacetime around black holes and then employing these trained models to generate the path of light rays affected by gravitational lensing. This enables efficient and scalable simulations of black holes with optically thin accretion disks, significantly decreasing the time required for rendering compared to traditional methods. We validate our approach through extensive rendering of multiple black hole systems with superposed Kerr metric, demonstrating its capability to produce accurate visualizations with significantly 15\times reduced computational time. Our findings suggest that neural networks offer a promising alternative for rendering complex astrophysical phenomena, potentially paving a new path to astronomical visualization.
zh

[AI-162] Missing value imputation with adversarial random forests – MissARF

【速读】:该论文旨在解决生物统计学分析中常见的缺失值处理问题,传统方法依赖于插补(imputation)技术,但存在效率低或复杂度高的局限。其解决方案的关键在于提出一种名为“基于对抗随机森林的缺失值插补”(MissARF)的新方法,该方法利用生成式机器学习框架中的对抗随机森林(Adversarial Random Forest, ARF)进行密度估计与数据合成,通过条件采样从ARF生成的近似后验分布中获取缺失值,从而实现高效、准确的单次和多重插补,且无需为多重插补增加额外计算成本。

链接: https://arxiv.org/abs/2507.15681
作者: Pegah Golchian,Jan Kapar,David S. Watson,Marvin N. Wright
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.
zh

[AI-163] Multi-beam Beamforming in RIS-aided MIMO Subject to Reradiation Mask Constraints – Optimization and Machine Learning Design

【速读】:该论文旨在解决多用户可重构智能表面(Reconfigurable Intelligent Surface, RIS)辅助的多输入多输出(Multiple-Input Multiple-Output, MIMO)系统中,如何联合设计发射预编码矩阵与RIS相位偏移向量以最大化最小可达速率的问题。关键解决方案包括:首先通过Arimoto-Blahut算法简化可达率表达式,并采用交替优化方法将原问题分解为多个二次约束二次规划(Quadratic Program with Quadratic Constraints, QPQC)子问题;其次,提出基于模型的神经网络优化框架,利用one-hot编码处理入射和反射角信息以提升计算效率;最后,针对实际RIS离散相移限制,引入贪心搜索算法求解优化问题,实现低复杂度且性能接近连续相移方案的波束赋形效果,同时满足再辐射掩模约束。

链接: https://arxiv.org/abs/2507.15367
作者: Shumin Wang,Hajar El Hassani,Marco Di Renzo,Marios Poulakis
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Reconfigurable intelligent surfaces (RISs) are an emerging technology for improving spectral efficiency and reducing power consumption in future wireless systems. This paper investigates the joint design of the transmit precoding matrices and the RIS phase shift vector in a multi-user RIS-aided multiple-input multiple-output (MIMO) communication system. We formulate a max-min optimization problem to maximize the minimum achievable rate while considering transmit power and reradiation mask constraints. The achievable rate is simplified using the Arimoto-Blahut algorithm, and the problem is broken into quadratic programs with quadratic constraints (QPQC) sub-problems using an alternating optimization approach. To improve efficiency, we develop a model-based neural network optimization that utilizes the one-hot encoding for the angles of incidence and reflection. We address practical RIS limitations by using a greedy search algorithm to solve the optimization problem for discrete phase shifts. Simulation results demonstrate that the proposed methods effectively shape the multi-beam radiation pattern towards desired directions while satisfying reradiation mask constraints. The neural network design reduces the execution time, and the discrete phase shift scheme performs well with a small reduction of the beamforming gain by using only four phase shift levels.
zh

[AI-164] EEG-based Epileptic Prediction via a Two-stage Channel-aware Set Transformer Network

【速读】:该论文旨在解决癫痫患者在日常生活中因突发性癫痫发作而严重影响生活质量的问题,特别是现有可穿戴癫痫预测设备受限于脑电图(EEG)采集装置体积庞大、通道数量多导致的实用性不足。其解决方案的关键在于提出一种两阶段的通道感知Set Transformer网络,能够在减少EEG通道数量的前提下实现高精度的癫痫发作预测;同时引入一种独立于癫痫事件的数据划分方法,避免训练与测试数据在时间上的相邻性,从而提升模型泛化能力。实验表明,该方法在CHB-MIT数据集上显著减少了所需通道数(从18个降至平均2.8个),并提升了平均敏感度至80.1%,验证了方案的有效性和临床适用潜力。

链接: https://arxiv.org/abs/2507.15364
作者: Ruifeng Zheng,Cong Chen,Shuang Wang,Yiming Liu,Lin You,Jindong Lu,Ruizhe Zhu,Guodao Zhang,Kejie Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Epilepsy is a chronic, noncommunicable brain disorder, and sudden seizure onsets can significantly impact patients’ quality of life and health. However, wearable seizure-predicting devices are still limited, partly due to the bulky size of EEG-collecting devices. To relieve the problem, we proposed a novel two-stage channel-aware Set Transformer Network that could perform seizure prediction with fewer EEG channel sensors. We also tested a seizure-independent division method which could prevent the adjacency of training and test data. Experiments were performed on the CHB-MIT dataset which includes 22 patients with 88 merged seizures. The mean sensitivity before channel selection was 76.4% with a false predicting rate (FPR) of 0.09/hour. After channel selection, dominant channels emerged in 20 out of 22 patients; the average number of channels was reduced to 2.8 from 18; and the mean sensitivity rose to 80.1% with an FPR of 0.11/hour. Furthermore, experimental results on the seizure-independent division supported our assertion that a more rigorous seizure-independent division should be used for patients with abundant EEG recordings.
zh

[AI-165] Optimal Transceiver Design in Over-the-Air Federated Distillation

【速读】:该论文旨在解决大规模人工智能模型在联邦学习(Federated Learning, FL)中因本地模型参数传输导致的通信开销过大的问题。其解决方案的关键在于提出一种基于空中聚合的联邦蒸馏(Over-the-Air Federated Distillation, FD)框架,通过利用多接入信道的叠加特性,仅传输设备模型输出的知识(knowledge),而非模型参数,从而显著降低通信负担。该方法结合了联邦学习与知识蒸馏的优势,在满足发射功率约束的前提下,设计最优的收发端策略以最大化学习收敛速率,并通过理论分析和优化手段(如半定松弛)获得闭式解,实验证明该方案在保持测试精度损失较小的同时大幅减少了通信开销。

链接: https://arxiv.org/abs/2507.15256
作者: Zihao Hu(1),Jia Yan(2),Ying-Jun Angela Zhang(1),Jun Zhang(3),Khaled B. Letaief(3) ((1) The Chinese University of Hong Kong, (2) The Hong Kong University of Science and Technology (Guangzhou), (3) The Hong Kong University of Science and Technology)
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, submitted to IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:The rapid proliferation and growth of artificial intelligence (AI) has led to the development of federated learning (FL). FL allows wireless devices (WDs) to cooperatively learn by sharing only local model parameters, without needing to share the entire dataset. However, the emergence of large AI models has made existing FL approaches inefficient, due to the significant communication overhead required. In this paper, we propose a novel over-the-air federated distillation (FD) framework by synergizing the strength of FL and knowledge distillation to avoid the heavy local model transmission. Instead of sharing the model parameters, only the WDs’ model outputs, referred to as knowledge, are shared and aggregated over-the-air by exploiting the superposition property of the multiple-access channel. We shall study the transceiver design in over-the-air FD, aiming to maximize the learning convergence rate while meeting the power constraints of the transceivers. The main challenge lies in the intractability of the learning performance analysis, as well as the non-convex nature and the optimization spanning the whole FD training period. To tackle this problem, we first derive an analytical expression of the convergence rate in over-the-air FD. Then, the closed-form optimal solutions of the WDs’ transmit power and the estimator for over-the-air aggregation are obtained given the receiver combining strategy. Accordingly, we put forth an efficient approach to find the optimal receiver beamforming vector via semidefinite relaxation. We further prove that there is no optimality gap between the original and relaxed problem for the receiver beamforming design. Numerical results will show that the proposed over-the-air FD approach achieves a significant reduction in communication overhead, with only a minor compromise in testing accuracy compared to conventional FL benchmarks.
zh

[AI-166] MEETI: A Multimodal ECG Dataset from MIMIC-IV-ECG with Signals Images Features and Interpretations

【速读】:该论文旨在解决当前临床可部署的多模态人工智能(AI)系统在心电图(ECG)领域发展受限的问题,其核心瓶颈在于缺乏同时包含原始信号、诊断图像和文本解释的公开数据集。现有ECG数据集通常仅提供单模态或双模态数据,难以支持模型在真实场景中对多样化ECG信息的理解与融合。解决方案的关键在于提出MEETI(MIMIC-IV-Ext ECG-Text-Image)数据集,这是首个大规模同步整合原始波形、高分辨率图表图像、大语言模型生成的详细文本解释以及每个导联的节律级定量参数的数据集。通过统一标识符实现四类组件(原始波形、图像、特征参数、文本解释)的精确对齐,为基于Transformer的多模态学习提供了结构化基础,并支持细粒度、可解释的心脏健康推理,从而推动下一代可解释、多模态心血管AI的发展。

链接: https://arxiv.org/abs/2507.15255
作者: Deyun Zhang,Xiang Lan,Shijia Geng,Qinghao Zhao,Sumei Fan,Mengling Feng,Shenda Hong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) plays a foundational role in modern cardiovascular care, enabling non-invasive diagnosis of arrhythmias, myocardial ischemia, and conduction disorders. While machine learning has achieved expert-level performance in ECG interpretation, the development of clinically deployable multimodal AI systems remains constrained, primarily due to the lack of publicly available datasets that simultaneously incorporate raw signals, diagnostic images, and interpretation text. Most existing ECG datasets provide only single-modality data or, at most, dual modalities, making it difficult to build models that can understand and integrate diverse ECG information in real-world settings. To address this gap, we introduce MEETI (MIMIC-IV-Ext ECG-Text-Image), the first large-scale ECG dataset that synchronizes raw waveform data, high-resolution plotted images, and detailed textual interpretations generated by large language models. In addition, MEETI includes beat-level quantitative ECG parameters extracted from each lead, offering structured parameters that support fine-grained analysis and model interpretability. Each MEETI record is aligned across four components: (1) the raw ECG waveform, (2) the corresponding plotted image, (3) extracted feature parameters, and (4) detailed interpretation text. This alignment is achieved using consistent, unique identifiers. This unified structure supports transformer-based multimodal learning and supports fine-grained, interpretable reasoning about cardiac health. By bridging the gap between traditional signal analysis, image-based interpretation, and language-driven understanding, MEETI established a robust foundation for the next generation of explainable, multimodal cardiovascular AI. It offers the research community a comprehensive benchmark for developing and evaluating ECG-based AI systems.
zh

[AI-167] he hunt for new pulsating ultraluminous X-ray sources: a clustering approach

【速读】:该论文旨在解决如何从尚未发现脉动信号的超亮X射线源(ULXs)中识别出潜在的脉动超亮X射线源(PULXs)的问题,其核心挑战在于现有观测数据统计量不足,难以探测弱或间歇性脉动信号。解决方案的关键在于采用人工智能(AI)方法对XMM-Newton数据库中的ULX样本进行多维特征分析:首先利用无监督聚类算法将源分为两类相似特征群体,再以已知PULX的观测数据作为标签设定分类阈值,从而筛选出与已知PULX在多维相空间中性质相近的新候选体。该方法仅需少数几个关键参数即可判断观测归属,并最终识别出85个独特源共355次观测的候选PULX样本,尽管初步时域分析未检测到新脉动信号,但这一结果凸显了AI驱动方法在预测潜在天体物理现象方面的潜力,同时也强调了高统计量观测数据对于验证此类候选体的重要性。

链接: https://arxiv.org/abs/2507.15032
作者: Nicolò Oreste Pinciroli Vago,Roberta Amato,Matteo Imbrogno,GianLuca Israel,Andrea Belfiore,Konstantinos Kovlakas,Piero Fraternali,Mario Pasquato
机构: 未知
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures; accepted in AA

点击查看摘要

Abstract:The discovery of fast and variable coherent signals in a handful of ultraluminous X-ray sources (ULXs) testifies to the presence of super-Eddington accreting neutron stars, and drastically changed the understanding of the ULX class. Our capability of discovering pulsations in ULXs is limited, among others, by poor statistics. However, catalogues and archives of high-energy missions contain information which can be used to identify new candidate pulsating ULXs (PULXs). The goal of this research is to single out candidate PULXs among those ULXs which have not shown pulsations due to an unfavourable combination of factors. We applied an AI approach to an updated database of ULXs detected by XMM-Newton. We first used an unsupervised clustering algorithm to sort out sources with similar characteristics into two clusters. Then, the sample of known PULX observations has been used to set the separation threshold between the two clusters and to identify the one containing the new candidate PULXs. We found that only a few criteria are needed to assign the membership of an observation to one of the two clusters. The cluster of new candidate PULXs counts 85 unique sources for 355 observations, with \sim 85% of these new candidates having multiple observations. A preliminary timing analysis found no new pulsations for these candidates. This work presents a sample of new candidate PULXs observed by XMM-Newton, the properties of which are similar (in a multi-dimensional phase space) to those of the known PULXs, despite the absence of pulsations in their light curves. While this result is a clear example of the predictive power of AI-based methods, it also highlights the need for high-statistics observational data to reveal coherent signals from the sources in this sample and thus validate the robustness of the approach.
zh

[AI-168] A Comparative Analysis of Statistical and Machine Learning Models for Outlier Detection in Bitcoin Limit Order Books

【速读】:该论文旨在解决加密货币限价订单簿(Limit Order Book, LOB)中异常行为的实时检测问题,以更好地理解市场动态,尤其是在波动性强且监管尚不完善的环境中。其解决方案的关键在于构建了一个统一的测试环境——AITA订单簿信号(AITA-OBS),并在该框架下系统比较了13种不同的鲁棒统计方法与先进机器学习模型,最终发现经验协方差(Empirical Covariance, EC)模型在回测中表现最优,相较于标准买入并持有策略实现了6.70%的超额收益,验证了基于异常检测的交易策略的有效性。

链接: https://arxiv.org/abs/2507.14960
作者: Ivan Letteri
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:The detection of outliers within cryptocurrency limit order books (LOBs) is of paramount importance for comprehending market dynamics, particularly in highly volatile and nascent regulatory environments. This study conducts a comprehensive comparative analysis of robust statistical methods and advanced machine learning techniques for real-time anomaly identification in cryptocurrency LOBs. Within a unified testing environment, named AITA Order Book Signal (AITA-OBS), we evaluate the efficacy of thirteen diverse models to identify which approaches are most suitable for detecting potentially manipulative trading behaviours. An empirical evaluation, conducted via backtesting on a dataset of 26,204 records from a major exchange, demonstrates that the top-performing model, Empirical Covariance (EC), achieves a 6.70% gain, significantly outperforming a standard Buy-and-Hold benchmark. These findings underscore the effectiveness of outlier-driven strategies and provide insights into the trade-offs between model complexity, trade frequency, and performance. This study contributes to the growing corpus of research on cryptocurrency market microstructure by furnishing a rigorous benchmark of anomaly detection models and highlighting their potential for augmenting algorithmic trading and risk management.
zh

[AI-169] Partial Symmetry Enforced Attention Decomposition (PSEAD): A Group-Theoretic Framework for Equivariant Transformers in Biological Systems

【速读】:该论文旨在解决现有Transformer模型在处理具有局部对称性的生物数据(如蛋白质序列或结构)时,缺乏对对称性信息的显式建模能力的问题,从而限制了模型的泛化性能、可解释性与计算效率。解决方案的关键在于提出了一种基于群论的新框架——部分对称性强制注意力分解(Theory of Partial Symmetry Enforced Attention Decomposition, PSEAD),其核心创新是将局部置换子群作用于数据窗口上,并严格证明在此条件下自注意力机制可自然分解为正交不可约分量的直和,这些分量与作用子群的不可约表示一一对应,从而实现了对称特征与非对称特征的解耦。这一数学结构不仅提升了模型对新型具有相似局部对称性的生物基序的泛化能力,还支持通过可视化各对称通道的注意力贡献实现前所未有的可解释性,并通过聚焦于相关对称子空间显著提升计算效率。

链接: https://arxiv.org/abs/2507.14908
作者: Daniel Ayomide Olanrewaju
机构: 未知
类目: Representation Theory (math.RT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This research introduces the Theory of Partial Symmetry Enforced Attention Decomposition (PSEAD), a new and rigorous group-theoretic framework designed to seamlessly integrate local symmetry awareness into the core architecture of self-attention mechanisms within Transformer models. We formalize the concept of local permutation subgroup actions on windows of biological data, proving that under such actions, the attention mechanism naturally decomposes into a direct sum of orthogonal irreducible components. Critically, these components are intrinsically aligned with the irreducible representations of the acting permutation subgroup, thereby providing a powerful mathematical basis for disentangling symmetric and asymmetric features. We show that PSEAD offers substantial advantages. These include enhanced generalization capabilities to novel biological motifs exhibiting similar partial symmetries, unprecedented interpretability by allowing direct visualization and analysis of attention contributions from different symmetry channels, and significant computational efficiency gains by focusing representational capacity on relevant symmetric subspaces. Beyond static data analysis, we extend PSEAD’s applicability to dynamic biological processes within reinforcement learning paradigms, showcasing its potential to accelerate the discovery and optimization of biologically meaningful policies in complex environments like protein folding and drug discovery. This work lays the groundwork for a new generation of biologically informed, symmetry-aware artificial intelligence models.
zh

[AI-170] Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)策略成功或失败的原因难以解释的问题,尤其在高维、复杂的智能体-环境交互背景下。为实现可解释性,作者从因果视角出发,将状态、动作和奖励视为低层因果模型中的变量,并通过在执行过程中对策略动作施加随机扰动来观测其对累积奖励的影响,从而学习一个简化的高层因果模型。该方法的核心创新在于提出了一种非线性因果模型降维框架(nonlinear Causal Model Reduction),其关键特性是保证近似干预一致性(approximate interventional consistency),即简化后的模型对干预的响应与原始复杂系统一致;进一步理论证明表明,在一类非线性因果模型中存在唯一解可实现精确干预一致性,从而确保所学解释能反映真实的因果模式。

链接: https://arxiv.org/abs/2507.14901
作者: Armin Kekić,Jan Schneider,Dieter Büchler,Bernhard Schölkopf,Michel Besserve
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Why do reinforcement learning (RL) policies fail or succeed? This is a challenging question due to the complex, high-dimensional nature of agent-environment interactions. In this work, we take a causal perspective on explaining the behavior of RL policies by viewing the states, actions, and rewards as variables in a low-level causal model. We introduce random perturbations to policy actions during execution and observe their effects on the cumulative reward, learning a simplified high-level causal model that explains these relationships. To this end, we develop a nonlinear Causal Model Reduction framework that ensures approximate interventional consistency, meaning the simplified high-level model responds to interventions in a similar way as the original complex system. We prove that for a class of nonlinear causal models, there exists a unique solution that achieves exact interventional consistency, ensuring learned explanations reflect meaningful causal patterns. Experiments on both synthetic causal models and practical RL tasks-including pendulum control and robot table tennis-demonstrate that our approach can uncover important behavioral patterns, biases, and failure modes in trained RL policies.
zh

[AI-171] Diffusion Models for Time Series Forecasting: A Survey

【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中生成式模型应用的系统性梳理与总结问题,尤其聚焦于扩散模型(Diffusion Models)在TSF任务中的适应机制与条件信息整合方式。其解决方案的关键在于:首先对标准扩散模型及其主流变体进行理论阐释,并分析其如何适配TSF场景;其次,通过系统分类现有方法,明确不同研究中条件信息的来源(如历史序列、外部变量等)及融合机制(如条件噪声调度、注意力机制等);最后,结合常用数据集和评估指标,归纳代表性模型的表现,并指出当前局限与未来方向,从而为相关领域研究人员提供清晰的技术脉络与实践参考。

链接: https://arxiv.org/abs/2507.14507
作者: Chen Su,Zhengzhou Cai,Yuanhe Tian,Zihong Zheng,Yan Song
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models, initially developed for image synthesis, demonstrate remarkable generative capabilities. Recently, their application has expanded to time series forecasting (TSF), yielding promising results. In this survey, we firstly introduce the standard diffusion models and their prevalent variants, explaining their adaptation to TSF tasks. We then provide a comprehensive review of diffusion models for TSF, paying special attention to the sources of conditional information and the mechanisms for integrating this conditioning within the models. In analyzing existing approaches using diffusion models for TSF, we provide a systematic categorization and a comprehensive summary of them in this survey. Furthermore, we examine several foundational diffusion models applied to TSF, alongside commonly used datasets and evaluation metrics. Finally, we discuss current limitations in these approaches and potential future research directions. Overall, this survey details recent progress and future prospects for diffusion models in TSF, serving as a reference for researchers in the field.
zh

[AI-172] Neural Brownian Motion

【速读】:该论文旨在解决如何在动态建模中引入可学习的不确定性结构问题,传统方法通常依赖于固定的概率测度或线性期望,难以捕捉复杂系统中的非线性风险偏好。其解决方案的关键在于提出神经布朗运动(Neural-Brownian Motion, NBM),通过将经典的鞅性质从线性期望推广到由神经网络参数化的非线性神经期望算子(Neural Expectation Operator εθ\varepsilon^\theta)下定义,并基于反向随机微分方程(BSDE)构建动力学模型。核心创新在于:NBM 的波动率函数 νθ\nu_\theta 不再预先设定,而是由 BSDE 驱动项 gθg_\theta 的代数约束隐式确定,从而使得不确定性态度(如悲观或乐观)成为模型参数 θ\theta 学习后的内生结果,为生成式 AI 和金融数学等领域的不确定性建模提供了严格且灵活的理论框架。

链接: https://arxiv.org/abs/2507.14499
作者: Qian Qi
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper introduces the Neural-Brownian Motion (NBM), a new class of stochastic processes for modeling dynamics under learned uncertainty. The NBM is defined axiomatically by replacing the classical martingale property with respect to linear expectation with one relative to a non-linear Neural Expectation Operator, \varepsilon^\theta , generated by a Backward Stochastic Differential Equation (BSDE) whose driver f_\theta is parameterized by a neural network. Our main result is a representation theorem for a canonical NBM, which we define as a continuous \varepsilon^\theta -martingale with zero drift under the physical measure. We prove that, under a key structural assumption on the driver, such a canonical NBM exists and is the unique strong solution to a stochastic differential equation of the form \rm d M_t = \nu_\theta(t, M_t) \rm d W_t . Crucially, the volatility function \nu_\theta is not postulated a priori but is implicitly defined by the algebraic constraint g_\theta(t, M_t, \nu_\theta(t, M_t)) = 0 , where g_\theta is a specialization of the BSDE driver. We develop the stochastic calculus for this process and prove a Girsanov-type theorem for the quadratic case, showing that an NBM acquires a drift under a new, learned measure. The character of this measure, whether pessimistic or optimistic, is endogenously determined by the learned parameters \theta , providing a rigorous foundation for models where the attitude towards uncertainty is a discoverable feature.
zh

[AI-173] Approximate Revenue Maximization for Diffusion Auctions

【速读】:该论文旨在解决传统最优拍卖设计中忽略网络结构下潜在买家的问题,即在经济网络中,许多潜在竞拍者并不直接接触拍卖者,导致现有基于保留价格(reserve price)的拍卖机制无法最大化整体收益。其解决方案的关键在于提出一种适用于网络拍卖(network auction)的简单且可证明近优的保留价格函数,该函数通过贝叶斯近似分析得出,能够在激励相容的前提下平衡高保留价格带来的单次成交收益与吸引更广网络内买家以提高成交概率之间的权衡。该方法确保了即使在网络规模为 $ n $、卖家有 $ \rho $ 个直接邻居的情况下,仍能实现理论最优收入的 $ 1 - \frac{1}{\rho} $ 近似比,且适用于任意网络结构和规模。

链接: https://arxiv.org/abs/2507.14470
作者: Yifan Huang,Dong Hao,Zhiyi Fan,Yuhang Guo,Bin Li
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Reserve prices are widely used in practice. The problem of designing revenue-optimal auctions based on reserve price has drawn much attention in the auction design community. Although they have been extensively studied, most developments rely on the significant assumption that the target audience of the sale is directly reachable by the auctioneer, while a large portion of bidders in the economic network unaware of the sale are omitted. This work follows the diffusion auction design, which aims to extend the target audience of optimal auction theory to all entities in economic networks. We investigate the design of simple and provably near-optimal network auctions via reserve price. Using Bayesian approximation analysis, we provide a simple and explicit form of the reserve price function tailored to the most representative network auction. We aim to balance setting a sufficiently high reserve price to induce high revenue in a successful sale, and attracting more buyers from the network to increase the probability of a successful sale. This reserve price function preserves incentive compatibility for network auctions, allowing the seller to extract additional revenue beyond that achieved by the Myerson optimal auction. Specifically, if the seller has \rho direct neighbours in a network of size n , this reserve price guarantees a 1-1 \over \rho approximation to the theoretical upper bound, i.e., the maximum possible revenue from any network of size n . This result holds for any size and any structure of the networked market.
zh

[AI-174] Statistical and Algorithmic Foundations of Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在样本稀缺场景下的样本效率与计算效率问题,尤其针对模型复杂度高、非凸性显著的新兴应用(如临床试验、自动驾驶和在线广告)中数据获取成本高昂或风险较高的挑战。其解决方案的关键在于从非渐近视角出发,系统梳理并整合多种主流RL方法(包括基于模型的方法、基于价值的方法和策略优化方法),结合马尔可夫决策过程(Markov Decision Process, MDP)作为核心建模工具,深入探讨不同RL范式(如带模拟器的RL、在线RL、离线RL、鲁棒RL及人类反馈RL)中的样本复杂度(sample complexity)与计算效率之间的权衡关系,并通过算法相关下界与信息论下界揭示性能极限,从而为设计更高效、可解释且实用的RL算法提供理论指导与实践路径。

链接: https://arxiv.org/abs/2507.14444
作者: Yuejie Chi,Yuxin Chen,Yuting Wei
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注: reading materials for INFORMS Tutorial in OR 2025

点击查看摘要

Abstract:As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL in sample-starved situations, where data collection is expensive, time-consuming, or even high-stakes (e.g., in clinical trials, autonomous systems, and online advertising). How to understand and enhance the sample and computational efficacies of RL algorithms is thus of great interest. In this tutorial, we aim to introduce several important algorithmic and theoretical developments in RL, highlighting the connections between new ideas and classical topics. Employing Markov Decision Processes as the central mathematical model, we cover several distinctive RL scenarios (i.e., RL with a simulator, online RL, offline RL, robust RL, and RL with human feedback), and present several mainstream RL approaches (i.e., model-based approach, value-based approach, and policy optimization). Our discussions gravitate around the issues of sample complexity, computational efficiency, as well as algorithm-dependent and information-theoretic lower bounds from a non-asymptotic viewpoint.
zh

[AI-175] Age of Information Minimization in UAV-Enabled Integrated Sensing and Communication Systems

【速读】:该论文旨在解决在资源受限和时间敏感条件下,如何联合优化无人机(UAV)轨迹规划、多用户通信与目标感知的问题,以提升信息新鲜度。其关键解决方案是提出一种以年龄信息(Age of Information, AoI)为核心的无人机集成感知与通信(ISAC)系统,通过深度强化学习(DRL)算法实现对UAV飞行轨迹与波束赋形的协同优化,其中引入卡尔曼滤波进行目标状态预测、正则化零 forcing 技术抑制用户间干扰,并采用软演员-评论家(Soft Actor-Critic)算法训练连续动作空间下的DRL代理,从而自适应平衡感知精度与通信质量之间的权衡。

链接: https://arxiv.org/abs/2507.14299
作者: Yu Bai,Yifan Zhang,Boxuan Xie,Zheng Chang,Yanru Zhang,Riku Jantti,Zhu Han
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) equipped with integrated sensing and communication (ISAC) capabilities are envisioned to play a pivotal role in future wireless networks due to their enhanced flexibility and efficiency. However, jointly optimizing UAV trajectory planning, multi-user communication, and target sensing under stringent resource constraints and time-critical conditions remains a significant challenge. To address this, we propose an Age of Information (AoI)-centric UAV-ISAC system that simultaneously performs target sensing and serves multiple ground users, emphasizing information freshness as the core performance metric. We formulate a long-term average AoI minimization problem that jointly optimizes the UAV’s flight trajectory and beamforming. To tackle the high-dimensional, non-convexity of this problem, we develop a deep reinforcement learning (DRL)-based algorithm capable of providing real-time decisions on UAV movement and beamforming for both radar sensing and multi-user communication. Specifically, a Kalman filter is employed for accurate target state prediction, regularized zero-forcing is utilized to mitigate inter-user interference, and the Soft Actor-Critic algorithm is applied for training the DRL agent on continuous actions. The proposed framework adaptively balances the trade-offs between sensing accuracy and communication quality. Extensive simulation results demonstrate that our proposed method consistently achieves lower average AoI compared to baseline approaches.
zh

[AI-176] A Comprehensive Benchmark for Electrocardiogram Time-Series ACM-MM2025

【速读】:该论文旨在解决当前对心电图(Electrocardiogram, ECG)信号在大规模时间序列模型预训练与下游任务中理解不充分的问题,特别是现有研究忽视了ECG数据的独特特性及其在临床诊断等专业场景下的差异化应用需求。其解决方案的关键在于:首先,系统性地将ECG的下游应用划分为四类评估任务以实现精细化分析;其次,识别传统评价指标在ECG分析中的局限性,并提出一种新的评估指标;最后,通过基准测试主流时间序列模型并设计一种新型网络架构,验证了所提方法的有效性和鲁棒性,从而为ECG信号分析研究提供了更可靠的基础框架。

链接: https://arxiv.org/abs/2507.14206
作者: Zhijiang Tang,Jiaxin Qi,Yuhua Zheng,Jianqiang Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:Electrocardiogram~(ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ significantly from other time-series data, leading to an incomplete understanding of its properties. In this paper, we present an in-depth investigation of ECG signals and establish a comprehensive benchmark, which includes (1) categorizing its downstream applications into four distinct evaluation tasks, (2) identifying limitations in traditional evaluation metrics for ECG analysis, and introducing a novel metric; (3) benchmarking state-of-the-art time-series models and proposing a new architecture. Extensive experiments demonstrate that our proposed benchmark is comprehensive and robust. The results validate the effectiveness of the proposed metric and model architecture, which establish a solid foundation for advancing research in ECG signal analysis.
zh

[AI-177] Explainable Parallel CNN-LSTM Model for Differentiating Ventricular Tachycardia from Supraventricular Tachycardia with Aberrancy in 12-Lead ECGs

【速读】:该论文旨在解决宽QRS波心动过速(Wide Complex Tachycardia, WCT)的临床鉴别难题,尤其是心室性心动过速(Ventricular Tachycardia, VT)与室上性心动过速伴差异传导(Supraventricular Tachycardia with Aberrancy, SVT-A)在心电图(Electrocardiogram, ECG)形态上的高度相似性所导致的误诊风险。为提升诊断准确性并增强模型可解释性以支持临床部署,作者提出了一种轻量级并行深度学习架构:每个通道独立处理单导联ECG信号,通过两个一维卷积神经网络(1D-CNN)块提取局部特征,随后跨导联拼接特征图,并利用长短期记忆网络(LSTM)捕捉时序依赖关系,最终由全连接层完成分类。关键创新在于其高效计算结构与SHAP(Shapley Additive Explanations)解释方法的集成,既实现了高精度分类(准确率95.63%),又提供了可临床解读的决策依据,显著优于现有方法。

链接: https://arxiv.org/abs/2507.14196
作者: Zahra Teimouri-Jervekani,Fahimeh Nasimi,Mohammadreza Yazdchi,Ghazal MogharehZadeh,Javad Tezerji,Farzan Niknejad Mazandarani,Maryam Mohebbi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background and Objective: Differentiating wide complex tachycardia (WCT) is clinically critical yet challenging due to morphological similarities in electrocardiogram (ECG) signals between life-threatening ventricular tachycardia (VT) and supraventricular tachycardia with aberrancy (SVT-A). Misdiagnosis carries fatal risks. We propose a computationally efficient deep learning solution to improve diagnostic accuracy and provide model interpretability for clinical deployment. Methods: A novel lightweight parallel deep architecture is introduced. Each pipeline processes individual ECG leads using two 1D-CNN blocks to extract local features. Feature maps are concatenated across leads, followed by LSTM layers to capture temporal dependencies. Final classification employs fully connected layers. Explainability is achieved via Shapley Additive Explanations (SHAP) for local/global interpretation. The model was evaluated on a 35-subject ECG database using standard performance metrics. Results: The model achieved 95.63% accuracy ( 95% CI: 93.07-98.19% ), with sensitivity= 95.10% , specificity= 96.06% , and F1-score= 95.12% . It outperformed state-of-the-art methods in both accuracy and computational efficiency, requiring minimal CNN blocks per pipeline. SHAP analysis demonstrated clinically interpretable feature contributions. Conclusions: Our end-to-end framework delivers high-precision WCT classification with minimal computational overhead. The integration of SHAP enhances clinical trust by elucidating decision logic, supporting rapid, informed diagnosis. This approach shows significant promise for real-world ECG analysis tools. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.14196 [eess.SP] (or arXiv:2507.14196v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2507.14196 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fahimeh Nasimi [view email] [v1] Mon, 14 Jul 2025 12:12:34 UTC (1,543 KB)
zh

[AI-178] UWB Radar-based Heart Rate Monitoring: A Transfer Learning Approach

【速读】:该论文旨在解决当前基于雷达的心率监测技术在消费电子设备中推广受限的问题,即不同雷达系统(如调频连续波FMCW与脉冲无线电超宽带IR-UWB)缺乏标准化,导致每种新雷达系统都需要大规模配对数据集重新训练模型,显著增加开发成本和时间。其解决方案的关键在于提出了一种基于2D+1D ResNet架构的迁移学习方法,通过在FMCW雷达上预训练高精度模型(MAE 0.85 bpm),再利用少量IR-UWB数据对其进行微调,实现了跨雷达系统的性能迁移,使IR-UWB模型的平均绝对误差(MAE)降低25%,从而大幅减少新雷达系统部署所需的标注数据量,加速心率监测功能在现有消费设备中的集成。

链接: https://arxiv.org/abs/2507.14195
作者: Elzbieta Gruzewska,Pooja Rao,Sebastien Baur,Matthew Baugh,Mathias M.J. Bellaiche,Sharanya Srinivas,Octavio Ponce,Matthew Thompson,Pramod Rudrapatna,Michael A. Sanchez,Lawrence Z. Cai,Timothy JA Chico,Robert F. Storey,Emily Maz,Umesh Telang,Shravya Shetty,Mayank Daswani
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 11 tables, 9 figures, 14 supplementary tables, 4 supplementary figures

点击查看摘要

Abstract:Radar technology presents untapped potential for continuous, contactless, and passive heart rate monitoring via consumer electronics like mobile phones. However the variety of available radar systems and lack of standardization means that a large new paired dataset collection is required for each radar system. This study demonstrates transfer learning between frequency-modulated continuous wave (FMCW) and impulse-radio ultra-wideband (IR-UWB) radar systems, both increasingly integrated into consumer devices. FMCW radar utilizes a continuous chirp, while IR-UWB radar employs short pulses. Our mm-wave FMCW radar operated at 60 GHz with a 5.5 GHz bandwidth (2.7 cm resolution, 3 receiving antennas [Rx]), and our IR-UWB radar at 8 GHz with a 500 MHz bandwidth (30 cm resolution, 2 Rx). Using a novel 2D+1D ResNet architecture we achieved a mean absolute error (MAE) of 0.85 bpm and a mean absolute percentage error (MAPE) of 1.42% for heart rate monitoring with FMCW radar (N=119 participants, an average of 8 hours per participant). This model maintained performance (under 5 MAE/10% MAPE) across various body positions and heart rate ranges, with a 98.9% recall. We then fine-tuned a variant of this model, trained on single-antenna and single-range bin FMCW data, using a small (N=376, avg 6 minutes per participant) IR-UWB dataset. This transfer learning approach yielded a model with MAE 4.1 bpm and MAPE 6.3% (97.5% recall), a 25% MAE reduction over the IR-UWB baseline. This demonstration of transfer learning between radar systems for heart rate monitoring has the potential to accelerate its introduction into existing consumer devices.
zh

[AI-179] AI-Based Impedance Encoding-Decoding Method for Online Impedance Network Construction of Wind Farms

【速读】:该论文旨在解决风力发电场振荡分析中阻抗网络(Impedance Network, IN)模型在线构建困难的问题,其核心挑战在于各风电机组在不同工况下需提供高密度的阻抗曲线数据,导致传输负担大、实时性差。解决方案的关键在于提出一种基于生成式 AI 的阻抗编码-解码方法:首先训练一个阻抗编码器,通过设置远少于频率点数量的神经元实现对阻抗曲线的压缩;随后将压缩后的数据上传至风电场,再训练一个阻抗解码器以精确重构原始阻抗曲线;最终结合节点导纳矩阵(Nodal Admittance Matrix, NAM)法构建风场级IN模型,从而实现高效传输与准确重建。

链接: https://arxiv.org/abs/2507.14187
作者: Xiaojuan Zhang,Tianyu Jiang,Haoxiang Zong,Chen Zhang,Chendan Li,Marta Molinas
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The impedance network (IN) model is gaining popularity in the oscillation analysis of wind farms. However, the construction of such an IN model requires impedance curves of each wind turbine under their respective operating conditions, making its online application difficult due to the transmission of numerous high-density impedance curves. To address this issue, this paper proposes an AI-based impedance encoding-decoding method to facilitate the online construction of IN model. First, an impedance encoder is trained to compress impedance curves by setting the number of neurons much smaller than that of frequency points. Then, the compressed data of each turbine are uploaded to the wind farm and an impedance decoder is trained to reconstruct original impedance curves. At last, based on the nodal admittance matrix (NAM) method, the IN model of the wind farm can be obtained. The proposed method is validated via model training and real-time simulations, demonstrating that the encoded impedance vectors enable fast transmission and accurate reconstruction of the original impedance curves.
zh

[AI-180] NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment

【速读】:该论文旨在解决传统心电图(ECG)疾病检测方法在模型可解释性与任务适应性之间的权衡问题,尤其是现有超维计算(HDC)方法依赖静态随机投影、缺乏生理信号感知能力的局限。其解决方案的关键在于提出一种融合可学习神经编码的新型可解释框架:通过引入基于RR间期(RR intervals)的节律感知编码模块和基于心脏周期对齐的信号分段策略,构建一个神经蒸馏型HDC架构,其中包含可训练的RR块编码器与二值线性超维投影层,并联合优化交叉熵损失与代理度量损失。该设计在保持HDC符号可解释性的基础上,实现了任务自适应的表示学习,显著提升了ECG分类性能并具备边缘部署潜力。

链接: https://arxiv.org/abs/2507.14184
作者: ZhengXiao He,Jinghao Wen,Huayu Li,Ao Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel and interpretable framework for electrocardiogram (ECG)-based disease detection that combines hyperdimensional computing (HDC) with learnable neural encoding. Unlike conventional HDC approaches that rely on static, random projections, our method introduces a rhythm-aware and trainable encoding pipeline based on RR intervals, a physiological signal segmentation strategy that aligns with cardiac cycles. The core of our design is a neural-distilled HDC architecture, featuring a learnable RR-block encoder and a BinaryLinear hyperdimensional projection layer, optimized jointly with cross-entropy and proxy-based metric loss. This hybrid framework preserves the symbolic interpretability of HDC while enabling task-adaptive representation learning. Experiments on Apnea-ECG and PTB-XL demonstrate that our model significantly outperforms traditional HDC and classical ML baselines, achieving 73.09% precision and an F1 score of 0.626 on Apnea-ECG, with comparable robustness on PTB-XL. Our framework offers an efficient and scalable solution for edge-compatible ECG classification, with strong potential for interpretable and personalized health monitoring.
zh

[AI-181] A Denoising VAE for Intracardiac Time Series in Ischemic Cardiomyopathy

【速读】:该论文旨在解决心脏电生理(Cardiac Electrophysiology, EP)领域中 intra-cardiac 信号噪声干扰严重的问题,尤其是针对来源于不同源的非线性、非平稳噪声,传统滤波方法难以有效处理的情况。解决方案的关键在于引入变分自编码器(Variational Autoencoder, VAE)模型,通过从42名缺血性心肌病患者获取的5706条时间序列数据中学习干净信号的潜在表示,从而实现对单个动作电位波形的有效去噪,其性能优于临床常用的传统滤波技术,并在多种噪声类型下展现出更强的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2507.14164
作者: Samuel Ruipérez-Campillo,Alain Ryser,Thomas M. Sutter,Ruibin Feng,Prasanth Ganesan,Brototo Deb,Kelly A. Brennan,Maxime Pedron,Albert J. Rogers,Maarten Z.H. Kolk,Fleur V.Y. Tjong,Sanjiv M. Narayan,Julia E. Vogt
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 3 tables, the last two authors are shared senior authors

点击查看摘要

Abstract:In the field of cardiac electrophysiology (EP), effectively reducing noise in intra-cardiac signals is crucial for the accurate diagnosis and treatment of arrhythmias and cardiomyopathies. However, traditional noise reduction techniques fall short in addressing the diverse noise patterns from various sources, often non-linear and non-stationary, present in these signals. This work introduces a Variational Autoencoder (VAE) model, aimed at improving the quality of intra-ventricular monophasic action potential (MAP) signal recordings. By constructing representations of clean signals from a dataset of 5706 time series from 42 patients diagnosed with ischemic cardiomyopathy, our approach demonstrates superior denoising performance when compared to conventional filtering methods commonly employed in clinical settings. We assess the effectiveness of our VAE model using various metrics, indicating its superior capability to denoise signals across different noise types, including time-varying non-linear noise frequently found in clinical settings. These results reveal that VAEs can eliminate diverse sources of noise in single beats, outperforming state-of-the-art denoising techniques and potentially improving treatment efficacy in cardiac EP.
zh

[AI-182] All-atom inverse protein folding through discrete flow matching ICML2025

【速读】:该论文旨在解决当前逆向蛋白质折叠(inverse protein folding)方法在设计包含非蛋白组分(如小分子配体、核苷酸或金属离子)的复杂蛋白质结构时表现不佳,以及对具有多构象状态的动态蛋白质复合物难以建模的问题。其解决方案的关键在于提出ADFLIP(All-atom Discrete FLow matching Inverse Protein folding),一种基于离散流匹配(discrete flow-matching)的生成模型,通过在序列生成过程中逐步引入预测的氨基酸侧链作为全原子结构上下文,实现对多构象状态的集合采样(ensemble sampling),从而支持动态蛋白质复合物的设计;同时引入无需训练的分类器引导采样(training-free classifier guidance sampling),可灵活整合任意预训练模型以优化设计序列的功能属性。

链接: https://arxiv.org/abs/2507.14156
作者: Kai Yi,Kiarash Jamali,Sjors H. W. Scheres
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: ICML2025

点击查看摘要

Abstract:The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In so-called inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non-protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All-atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow-matching for designing protein sequences conditioned on all-atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training-free classifier guidance sampling, which allows the incorporation of arbitrary pre-trained models to optimise the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small-molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state-of-the-art performance in single-structure and multi-structure inverse folding tasks, demonstrating excellent potential for all-atom protein design. The code is available at this https URL.
zh

[AI-183] Surface EMG Profiling in Parkinsons Disease: Advancing Severity Assessment with GCN-SVM

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)在诊断与病情监测中因疾病进展性和症状复杂性带来的挑战,提出一种基于表面肌电图(surface electromyography, sEMG)的客观评估方法,以量化PD严重程度。其解决方案的关键在于利用sEMG信号特征提取神经肌肉差异,并结合机器学习模型进行分类:初始采用传统支持向量机(Support Vector Machine, SVM)模型实现83%的准确率,进一步引入图卷积网络-支持向量机(Graph Convolutional Network-Support Vector Machine, GCN-SVM)模型提升至92%,显著提高了分类性能。该方法为未来大规模验证和临床转化提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2507.14153
作者: Daniel Cieślak,Barbara Szyca,Weronika Bajko,Liwia Florkiewicz,Kinga Grzęda,Mariusz Kaczmarek,Helena Kamieniecka,Hubert Lis,Weronika Matwiejuk,Anna Prus,Michalina Razik,Inga Rozumowicz,Wiktoria Ziembakowska
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: International Conference on Hybrid Artificial Intelligence Systems (HAIS 2024)

点击查看摘要

Abstract:Parkinson’s disease (PD) poses challenges in diagnosis and monitoring due to its progressive nature and complex symptoms. This study introduces a novel approach utilizing surface electromyography (sEMG) to objectively assess PD severity, focusing on the biceps brachii muscle. Initial analysis of sEMG data from five PD patients and five healthy controls revealed significant neuromuscular differences. A traditional Support Vector Machine (SVM) model achieved up to 83% accuracy, while enhancements with a Graph Convolutional Network-Support Vector Machine (GCN-SVM) model increased accuracy to 92%. Despite the preliminary nature of these results, the study outlines a detailed experimental methodology for future research with larger cohorts to validate these findings and integrate the approach into clinical practice. The proposed approach holds promise for advancing PD severity assessment and improving patient care in Parkinson’s disease management.
zh

[AI-184] Self-DANA: A Resource-Efficient Channel-Adaptive Self-Supervised Approach for ECG Foundation Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中基础模型(Foundation Models, FMs)在心电图(ECG)信号分析任务中,面对通道数减少的下游场景时适应性不足的问题。当前可穿戴和便携设备普及促使研究者关注低通道配置下的学习效率与性能平衡,但现有方法尚未充分优化FMs在有限输入通道条件下的迁移能力。解决方案的关键在于提出Self-DANA,一种易于集成的自监督架构适配机制,使模型能够在减少输入通道数的同时保持高资源效率与性能;同时引入随机导联选择(Random Lead Selection)作为新型增强策略,在预训练阶段提升模型对通道变化的鲁棒性和通道无关性,从而显著降低计算资源消耗(如峰值CPU/GPU内存减少达69.3%和34.4%),并实现最优下游任务表现。

链接: https://arxiv.org/abs/2507.14151
作者: Giuliana Monachino,Nicolò La Porta,Beatrice Zanchi,Luigi Fiorillo,Alvise Dei Rossi,Georgiy Farina,Francesca Dalia Faraci
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation Models (FMs) are large-scale machine learning models trained on extensive, diverse datasets that can be adapted to a wide range of downstream tasks with minimal fine-tuning. In the last two years, interest in FMs has also grown for applications in the cardiological field to analyze the electrocardiogram (ECG) signals. One of the key properties of FMs is their transferability to a wide range of downstream scenarios. With the spread of wearable and portable devices, keen interest in learning from reduced-channel configurations has arisen. However, the adaptation of ECG FMs to downstream scenarios with fewer available channels still has to be properly investigated. In this work, we propose Self-DANA, a novel, easy-to-integrate solution that makes self-supervised architectures adaptable to a reduced number of input channels, ensuring resource efficiency and high performance. We also introduce Random Lead Selection, a novel augmentation technique to pre-train models in a more robust and channel-agnostic way. Our experimental results on five reduced-channel configurations demonstrate that Self-DANA significantly enhances resource efficiency while reaching state-of-the-art performance. It requires up to 69.3% less peak CPU memory, 34.4% less peak GPU memory, about 17% less average epoch CPU time, and about 24% less average epoch GPU time.
zh

[AI-185] DIVER-0 : A Fully Channel Equivariant EEG Foundation Model ICML2025

【速读】:该论文旨在解决现有脑电图(EEG)基础模型在建模时空脑动态特性方面的局限性,以及缺乏通道置换等变性(channel permutation equivariance),从而导致在不同电极配置下泛化能力不足的问题。其解决方案的关键在于提出DIVER-0模型,通过引入全时空注意力机制(full spatio-temporal attention)替代传统的分离空间或时间处理方式,并结合旋转位置编码(Rotary Position Embedding, RoPE)以精确建模时间关系,以及二进制注意力偏置(binary attention biases)用于区分通道特征;同时创新性地设计滑动时间条件位置编码(Sliding Temporal Conditional Positional Encoding, STCPE),在保持时间平移等变性和通道置换等变性的前提下,实现对预训练中未见的任意电极配置的鲁棒适应,从而显著提升跨数据集的泛化性能。

链接: https://arxiv.org/abs/2507.14141
作者: Danny Dongyeop Han,Ahhyun Lucy Lee,Taeyang Lee,Yonghyeon Gwon,Sebin Lee,Seongjin Lee,David Keetae Park,Shinjae Yoo,Jiook Cha,Chun Kee Chung
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 1 figures, ICML 2025 Workshop on GenBio

点击查看摘要

Abstract:Electroencephalography (EEG) is a non-invasive technique widely used in brain-computer interfaces and clinical applications, yet existing EEG foundation models face limitations in modeling spatio-temporal brain dynamics and lack channel permutation equivariance, preventing robust generalization across diverse electrode configurations. To address these challenges, we propose DIVER-0, a novel EEG foundation model that demonstrates how full spatio-temporal attention-rather than segregated spatial or temporal processing-achieves superior performance when properly designed with Rotary Position Embedding (RoPE) for temporal relationships and binary attention biases for channel differentiation. We also introduce Sliding Temporal Conditional Positional Encoding (STCPE), which improves upon existing conditional positional encoding approaches by maintaining both temporal translation equivariance and channel permutation equivariance, enabling robust adaptation to arbitrary electrode configurations unseen during pretraining. Experimental results demonstrate that DIVER-0 achieves competitive performance with only 10% of pretraining data while maintaining consistent results across all channel permutation conditions, validating its effectiveness for cross-dataset generalization and establishing key design principles for handling the inherent heterogeneity of neural recording setups.
zh

[AI-186] Geophysics-informed neural network for model-based seismic inversion using surrogate point spread functions

【速读】:该论文旨在解决传统基于模型的地震反演方法在储层表征中面临的两大局限性:一是依赖于一维平均平稳子波(1D average stationary wavelets),二是假设了不切实际的横向分辨率。为克服这些问题,作者提出了一种地球物理信息神经网络(Geophysics-Informed Neural Network, GINN),其核心创新在于将深度学习与地震建模相结合,利用深度卷积神经网络(Deep Convolutional Neural Network, DCNN)同时估计点扩散函数(Point Spread Functions, PSFs)和声阻抗(acoustic impedance, IP)。关键在于通过将PSF分解为零相位和残差成分以保证地球物理一致性,并引入位置特征和低频声阻抗(Low-Frequency Impedance, LF-IP)作为输入,配合自监督损失函数(结合均方误差MSE与结构相似性指数SSIM),使网络能够生成高分辨率声阻抗模型和具有真实地质特征的PSF,且具备有限横向分辨率,从而有效降低噪声并提升反演精度。

链接: https://arxiv.org/abs/2507.14140
作者: Marcus Saraiva,Ana Muller,Alexandre Maul
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Model-based seismic inversion is a key technique in reservoir characterization, but traditional methods face significant limitations, such as relying on 1D average stationary wavelets and assuming an unrealistic lateral resolution. To address these challenges, we propose a Geophysics-Informed Neural Network (GINN) that integrates deep learning with seismic modeling. This novel approach employs a Deep Convolutional Neural Network (DCNN) to simultaneously estimate Point Spread Functions (PSFs) and acoustic impedance (IP). PSFs are divided into zero-phase and residual components to ensure geophysical consistency and to capture fine details. We used synthetic data from the SEAM Phase I Earth Model to train the GINN for 100 epochs (approximately 20 minutes) using a 2D UNet architecture. The network’s inputs include positional features and a low-frequency impedance (LF-IP) model. A self-supervised loss function combining Mean Squared Error (MSE) and Structural Similarity Index Measure (SSIM) was employed to ensure accurate results. The GINN demonstrated its ability to generate high-resolution IP and realistic PSFs, aligning with expected geological features. Unlike traditional 1D wavelets, the GINN produces PSFs with limited lateral resolution, reducing noise and improving accuracy. Future work will aim to refine the training process and validate the methodology with real seismic data.
zh

机器学习

[LG-0] Optimizing Canaries for Privacy Auditing with Metagradient Descent

链接: https://arxiv.org/abs/2507.15836
作者: Matteo Boglioni,Terrance Liu,Andrew Ilyas,Zhiwei Steven Wu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In this work we study black-box privacy auditing, where the goal is to lower bound the privacy parameter of a differentially private learning algorithm using only the algorithm’s outputs (i.e., final trained model). For DP-SGD (the most successful method for training differentially private deep learning models), the canonical approach auditing uses membership inference-an auditor comes with a small set of special “canary” examples, inserts a random subset of them into the training set, and then tries to discern which of their canaries were included in the training set (typically via a membership inference attack). The auditor’s success rate then provides a lower bound on the privacy parameters of the learning algorithm. Our main contribution is a method for optimizing the auditor’s canary set to improve privacy auditing, leveraging recent work on metagradient optimization. Our empirical evaluation demonstrates that by using such optimized canaries, we can improve empirical lower bounds for differentially private image classification models by over 2x in certain instances. Furthermore, we demonstrate that our method is transferable and efficient: canaries optimized for non-private SGD with a small model architecture remain effective when auditing larger models trained with DP-SGD.

[LG-1] Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction

链接: https://arxiv.org/abs/2507.15832
作者: Shiyang Li
类目: Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:To address the limitations of medium- and long-term four-dimensional (4D) trajectory prediction models, this paper proposes a hybrid CNN-LSTM-attention-adaboost neural network model incorporating a multi-strategy improved snake-herd optimization (SO) algorithm. The model applies the Adaboost algorithm to divide multiple weak learners, and each submodel utilizes CNN to extract spatial features, LSTM to capture temporal features, and attention mechanism to capture global features comprehensively. The strong learner model, combined with multiple sub-models, then optimizes the hyperparameters of the prediction model through the natural selection behavior pattern simulated by SO. In this study, based on the real ADS-B data from Xi’an to Tianjin, the comparison experiments and ablation studies of multiple optimizers are carried out, and a comprehensive test and evaluation analysis is carried out. The results show that SO-CLA-adaboost outperforms traditional optimizers such as particle swarm, whale, and gray wolf in handling large-scale high-dimensional trajectory data. In addition, introducing the full-strategy collaborative improvement SO algorithm improves the model’s prediction accuracy by 39.89%.

[LG-2] Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation

链接: https://arxiv.org/abs/2507.15826
作者: Alessandro B. Melchiorre,Elena V. Epure,Shahed Masoudian,Gustavo Escobedo,Anna Hausberger,Manuel Moussallam,Markus Schedl
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.

[LG-3] Federated Split Learning with Improved Communication and Storag e Efficiency

链接: https://arxiv.org/abs/2507.15816
作者: Yujia Mu,Cong Shen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Accepted for publication in IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Federated learning (FL) is one of the popular distributed machine learning (ML) solutions but incurs significant communication and computation costs at edge devices. Federated split learning (FSL) can train sub-models in parallel and reduce the computational burden of edge devices by splitting the model architecture. However, it still requires a high communication overhead due to transmitting the smashed data and gradients between clients and the server in every global round. Furthermore, the server must maintain separate partial models for every client, leading to a significant storage requirement. To address these challenges, this paper proposes a novel communication and storage efficient federated split learning method, termed CSE-FSL, which utilizes an auxiliary network to locally update the weights of the clients while keeping a single model at the server, hence avoiding frequent transmissions of gradients from the server and greatly reducing the storage requirement of the server. Additionally, a new model update method of transmitting the smashed data in selected epochs can reduce the amount of smashed data sent from the clients. We provide a theoretical analysis of CSE-FSL, rigorously guaranteeing its convergence under non-convex loss functions. The extensive experimental results further indicate that CSE-FSL achieves a significant communication reduction over existing FSL solutions using real-world FL tasks.

[LG-4] LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

链接: https://arxiv.org/abs/2507.15815
作者: Seth Karten,Wenzhe Li,Zihan Ding,Samuel Kleiner,Yu Bai,Chi Jin
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 27 pages, 6 figures, Code: this https URL

点击查看摘要

Abstract:We present the LLM Economist, a novel framework that uses agent-based modeling to design and assess economic policies in strategic environments with hierarchical decision-making. At the lower level, bounded rational worker agents – instantiated as persona-conditioned prompts sampled from U.S. Census-calibrated income and demographic statistics – choose labor supply to maximize text-based utility functions learned in-context. At the upper level, a planner agent employs in-context reinforcement learning to propose piecewise-linear marginal tax schedules anchored to the current U.S. federal brackets. This construction endows economic simulacra with three capabilities requisite for credible fiscal experimentation: (i) optimization of heterogeneous utilities, (ii) principled generation of large, demographically realistic agent populations, and (iii) mechanism design – the ultimate nudging problem – expressed entirely in natural language. Experiments with populations of up to one hundred interacting agents show that the planner converges near Stackelberg equilibria that improve aggregate social welfare relative to Saez solutions, while a periodic, persona-level voting procedure furthers these gains under decentralized governance. These results demonstrate that large language model-based agents can jointly model, simulate, and govern complex economic systems, providing a tractable test bed for policy evaluation at the societal scale to help build better civilizations.

[LG-5] Graph Attention Specialized Expert Fusion Model for Node Classification: Based on Cora and Pubmed Datasets

链接: https://arxiv.org/abs/2507.15784
作者: Zihang Ma,Qitian Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph node classification is a fundamental task in graph neural networks (GNNs), aiming to assign predefined class labels to nodes. On the PubMed citation network dataset, we observe significant classification difficulty disparities, with Category 2 achieving only 74.4% accuracy in traditional GCN, 7.5% lower than Category 1. To address this, we propose a Wasserstein-Rubinstein (WR) distance enhanced Expert Fusion Model (WR-EFM), training specialized GNN models for Categories 0/1 (with layer normalization and residual connections) and Multi-hop Graph Attention Networks (GAT) for Category 2. The WR distance metric optimizes representation similarity between models, particularly focusing on improving Category 2 performance. Our adaptive fusion strategy dynamically weights models based on category-specific performance, with Category 2 assigned a GAT weight of 0.8. WR distance further guides the fusion process by measuring distributional differences between model representations, enabling more principled integration of complementary features. Experimental results show WR-EFM achieves balanced accuracy across categories: 77.8% (Category 0), 78.0% (Category 1), and 79.9% (Category 2), outperforming both single models and standard fusion approaches. The coefficient of variation (CV) of WR-EFM’s category accuracies is 0.013, 77.6% lower than GCN’s 0.058, demonstrating superior stability. Notably, WR-EFM improves Category 2 accuracy by 5.5% compared to GCN, verifying the effectiveness of WR-guided fusion in capturing complex structural patterns. This work provides a novel paradigm for handling class-imbalanced graph classification tasks. To promote the research community, we release our project at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.15784 [cs.LG] (or arXiv:2507.15784v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Multi-Modal Sensor Fusion for Proactive Blockage Prediction in mmWave Vehicular Networks

链接: https://arxiv.org/abs/2507.15769
作者: Ahmad M. Nazar,Abdulkadir Celik,Mohamed Y. Selim,Asmaa Abdallah,Daji Qiao,Ahmed M. Eltawil
类目: Machine Learning (cs.LG)
*备注: Accepted in IEEE Asilomar Conference on Signals, Systems, and Computers 2025

点击查看摘要

Abstract:Vehicular communication systems operating in the millimeter wave (mmWave) band are highly susceptible to signal blockage from dynamic obstacles such as vehicles, pedestrians, and infrastructure. To address this challenge, we propose a proactive blockage prediction framework that utilizes multi-modal sensing, including camera, GPS, LiDAR, and radar inputs in an infrastructure-to-vehicle (I2V) setting. This approach uses modality-specific deep learning models to process each sensor stream independently and fuses their outputs using a softmax-weighted ensemble strategy based on validation performance. Our evaluations, for up to 1.5s in advance, show that the camera-only model achieves the best standalone trade-off with an F1-score of 97.1% and an inference time of 89.8ms. A camera+radar configuration further improves accuracy to 97.2% F1 at 95.7ms. Our results display the effectiveness and efficiency of multi-modal sensing for mmWave blockage prediction and provide a pathway for proactive wireless communication in dynamic environments.

[LG-7] Competitive Algorithms for Cooperative Multi-Agent Ski-Rental Problems

链接: https://arxiv.org/abs/2507.15727
作者: Xuchuang Wang,Bo Sun,Hedyeh Beyhaghi,John C.S. Lui,Mohammad Hajiesmaili,Adam Wierman
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper introduces a novel multi-agent ski-rental problem that generalizes the classical ski-rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents’ active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state-dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state-aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state-aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski-rental insights to multi-agent settings, highlighting both theoretical and practical implications for group decision-making under uncertainty.

[LG-8] GeoHNNs: Geometric Hamiltonian Neural Networks

链接: https://arxiv.org/abs/2507.15678
作者: Amine Mohamed Aboussalah,Abdessalam Ed-dib
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Dynamical Systems (math.DS); Symplectic Geometry (math.SG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The fundamental laws of physics are intrinsically geometric, dictating the evolution of systems through principles of symmetry and conservation. While modern machine learning offers powerful tools for modeling complex dynamics from data, common methods often ignore this underlying geometric fabric. Physics-informed neural networks, for instance, can violate fundamental physical principles, leading to predictions that are unstable over long periods, particularly for high-dimensional and chaotic systems. Here, we introduce \textitGeometric Hamiltonian Neural Networks (GeoHNN), a framework that learns dynamics by explicitly encoding the geometric priors inherent to physical laws. Our approach enforces two fundamental structures: the Riemannian geometry of inertia, by parameterizing inertia matrices in their natural mathematical space of symmetric positive-definite matrices, and the symplectic geometry of phase space, using a constrained autoencoder to ensure the preservation of phase space volume in a reduced latent space. We demonstrate through experiments on systems ranging from coupled oscillators to high-dimensional deformable objects that GeoHNN significantly outperforms existing models. It achieves superior long-term stability, accuracy, and energy conservation, confirming that embedding the geometry of physics is not just a theoretical appeal but a practical necessity for creating robust and generalizable models of the physical world.

[LG-9] Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity

链接: https://arxiv.org/abs/2507.15601
作者: Huiling Yang,Zhanwei Wang,Kaibin Huang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C ^2 ) capabilities across devices. To address these challenges, we propose a novel C ^2 -aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C ^2 tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C ^2 tradeoff or device heterogeneity.

[LG-10] Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing

链接: https://arxiv.org/abs/2507.15599
作者: Manatsawin Hanmongkolchai
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models for code (Code LLM) are increasingly utilized in programming environments. Despite their utility, the training datasets for top LLM remain undisclosed, raising concerns about potential copyright violations. Some models, such as Pleias and Comma put emphasis on data curation and licenses, however, with limited training data these models are not competitive and only serve as proof of concepts. To improve the utility of these models, we propose an application of the “Chinese Wall” technique, inspired by the reverse engineering technique of the same name – a high quality model is used to generate detailed instructions for a weaker model. By doing so, a weaker but ethically aligned model may be used to perform complicated tasks that, otherwise, can only be completed by more powerful models. In our evaluation, we’ve found that this technique improves Comma v0.1 1T’s performance in CanItEdit benchmark by over 66%, and Starcoder2 Instruct by roughly 20% compared to when running the same model on the benchmark alone. The practical application of this technique today, however, may be limited due to the lack of models trained on public domain content without copyright restrictions.

[LG-11] We Need to Rethink Benchmarking in Anomaly Detection

链接: https://arxiv.org/abs/2507.15584
作者: Philipp Röchner,Simon Klüttermann,Franz Rothlauf,Daniel Schlör
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. Current benchmarking does not, for example, sufficiently reflect the diversity of anomalies in applications ranging from predictive maintenance to scientific discovery. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that capture the relevant characteristics of different applications. We identify three key areas for improvement: First, we need to identify anomaly detection scenarios based on a common taxonomy. Second, anomaly detection pipelines should be analyzed end-to-end and by component. Third, evaluating anomaly detection algorithms should be meaningful regarding the scenario’s objectives.

[LG-12] rade-offs between elective surgery rescheduling and length-of-stay prediction accuracy

链接: https://arxiv.org/abs/2507.15566
作者: Pieter Smet,Martina Doneda,Ettore Lanzarone,Giuliana Carello
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The availability of downstream resources plays a critical role in planning the admission of patients undergoing elective surgery, with inpatient beds being one of the most crucial resources. When planning patient admissions, predictions on their length-of-stay (LOS) made by machine learning (ML) models are used to ensure bed availability. However, the actual LOS for each patient may differ considerably from the predicted value, potentially making the schedule infeasible. To address such infeasibilities, rescheduling strategies that take advantage of operational flexibility can be implemented. For example, adjustments may include postponing admission dates, relocating patients to different wards, or even transferring patients who are already admitted. The common assumption is that more accurate LOS predictions reduce the impact of rescheduling. However, training ML models that can make such accurate predictions can be costly. Building on previous work that proposed simulated \acml for evaluating data-driven approaches, this paper explores the relationship between LOS prediction accuracy and rescheduling flexibility across various corrective policies. Specifically, we examine the most effective patient rescheduling strategies under LOS prediction errors to prevent bed overflows while optimizing resource utilization.

[LG-13] he added value for MRI radiomics and deep-learning for glioblastoma prognostication compared to clinical and molecular information

链接: https://arxiv.org/abs/2507.15548
作者: D. Abler,O. Pusterla,A. Joye-Kühnis,N. Andratschke,M. Bach,A. Bink,S. M. Christ,P. Hagmann,B. Pouymayou,E. Pravatà,P. Radojewski,M. Reyes,L. Ruinelli,R. Schaer,B. Stieltjes,G. Treglia,W. Valenzuela,R. Wiest,S. Zoergiebel,M. Guckenberger,S. Tanadini-Lang,A. Depeursinge
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Background: Radiomics shows promise in characterizing glioblastoma, but its added value over clinical and molecular predictors has yet to be proven. This study assessed the added value of conventional radiomics (CR) and deep learning (DL) MRI radiomics for glioblastoma prognosis (= 6 vs 6 months survival) on a large multi-center dataset. Methods: After patient selection, our curated dataset gathers 1152 glioblastoma (WHO 2016) patients from five Swiss centers and one public source. It included clinical (age, gender), molecular (MGMT, IDH), and baseline MRI data (T1, T1 contrast, FLAIR, T2) with tumor regions. CR and DL models were developed using standard methods and evaluated on internal and external cohorts. Sub-analyses assessed models with different feature sets (imaging-only, clinical/molecular-only, combined-features) and patient subsets (S-1: all patients, S-2: with molecular data, S-3: IDH wildtype). Results: The best performance was observed in the full cohort (S-1). In external validation, the combined-feature CR model achieved an AUC of 0.75, slightly, but significantly outperforming clinical-only (0.74) and imaging-only (0.68) models. DL models showed similar trends, though without statistical significance. In S-2 and S-3, combined models did not outperform clinical-only models. Exploratory analysis of CR models for overall survival prediction suggested greater relevance of imaging data: across all subsets, combined-feature models significantly outperformed clinical-only models, though with a modest advantage of 2-4 C-index points. Conclusions: While confirming the predictive value of anatomical MRI sequences for glioblastoma prognosis, this multi-center study found standard CR and DL radiomics approaches offer minimal added value over demographic predictors such as age and gender. Subjects: Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2507.15548 [cs.LG] (or arXiv:2507.15548v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15548 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniel Abler [view email] [v1] Mon, 21 Jul 2025 12:27:07 UTC (5,739 KB)

[LG-14] Data Aware Differentiable Neural Architecture Search for Tiny Keyword Spotting Applications

链接: https://arxiv.org/abs/2507.15545
作者: Yujia Shi,Emil Njor,Pablo Martínez-Nuevo,Sven Ewan Shepstone,Xenofon Fafoutis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of Machine Learning is increasingly tempered by its significant resource footprint, driving interest in efficient paradigms like TinyML. However, the inherent complexity of designing TinyML systems hampers their broad adoption. To reduce this complexity, we introduce “Data Aware Differentiable Neural Architecture Search”. Unlike conventional Differentiable Neural Architecture Search, our approach expands the search space to include data configuration parameters alongside architectural choices. This enables Data Aware Differentiable Neural Architecture Search to co-optimize model architecture and input data characteristics, effectively balancing resource usage and system performance for TinyML applications. Initial results on keyword spotting demonstrate that this novel approach to TinyML system design can generate lean but highly accurate systems.

[LG-15] An Investigation of Test-time Adaptation for Audio Classification under Background Noise

链接: https://arxiv.org/abs/2507.15523
作者: Weichuang Shao,Iman Yi Liao,Tomas Henrique Bode Maul,Tissa Chandesa
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Domain shift is a prominent problem in Deep Learning, causing a model pre-trained on a source dataset to suffer significant performance degradation on test datasets. This research aims to address the issue of audio classification under domain shift caused by background noise using Test-Time Adaptation (TTA), a technique that adapts a pre-trained model during testing using only unlabelled test data before making predictions. We adopt two common TTA methods, TTT and TENT, and a state-of-the-art method CoNMix, and investigate their respective performance on two popular audio classification datasets, AudioMNIST (AM) and SpeechCommands V1 (SC), against different types of background noise and noise severity levels. The experimental results reveal that our proposed modified version of CoNMix produced the highest classification accuracy under domain shift (5.31% error rate under 10 dB exercise bike background noise and 12.75% error rate under 3 dB running tap background noise for AM) compared to TTT and TENT. The literature search provided no evidence of similar works, thereby motivating the work reported here as the first study to leverage TTA techniques for audio classification under domain shift.

[LG-16] FedMultiEmo: Real-Time Emotion Recognition via Multimodal Federated Learning CEC

链接: https://arxiv.org/abs/2507.15470
作者: Baran Can Gül,Suraksha Nadig,Stefanos Tziampazis,Nasser Jazdi,Michael Weyrich
类目: Machine Learning (cs.LG)
*备注: Preprint version. Accepted for publication at IEEE ICECCME 2025

点击查看摘要

Abstract:In-vehicle emotion recognition underpins adaptive driver-assistance systems and, ultimately, occupant safety. However, practical deployment is hindered by (i) modality fragility - poor lighting and occlusions degrade vision-based methods; (ii) physiological variability - heart-rate and skin-conductance patterns differ across individuals; and (iii) privacy risk - centralized training requires transmission of sensitive data. To address these challenges, we present FedMultiEmo, a privacy-preserving framework that fuses two complementary modalities at the decision level: visual features extracted by a Convolutional Neural Network from facial images, and physiological cues (heart rate, electrodermal activity, and skin temperature) classified by a Random Forest. FedMultiEmo builds on three key elements: (1) a multimodal federated learning pipeline with majority-vote fusion, (2) an end-to-end edge-to-cloud prototype on Raspberry Pi clients and a Flower server, and (3) a personalized Federated Averaging scheme that weights client updates by local data volume. Evaluated on FER2013 and a custom physiological dataset, the federated Convolutional Neural Network attains 77% accuracy, the Random Forest 74%, and their fusion 87%, matching a centralized baseline while keeping all raw data local. The developed system converges in 18 rounds, with an average round time of 120 seconds and a per-client memory footprint below 200 MB. These results indicate that FedMultiEmo offers a practical approach to real-time, privacy-aware emotion recognition in automotive settings.

[LG-17] Privacy-Preserving Multimodal News Recommendation through Federated Learning

链接: https://arxiv.org/abs/2507.15460
作者: Mehdi Khalaj,Shahrzad Golestani Najafabadi,Julita Vassileva
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized News Recommendation systems (PNR) have emerged as a solution to information overload by predicting and suggesting news items tailored to individual user interests. However, traditional PNR systems face several challenges, including an overreliance on textual content, common neglect of short-term user interests, and significant privacy concerns due to centralized data storage. This paper addresses these issues by introducing a novel multimodal federated learning-based approach for news recommendation. First, it integrates both textual and visual features of news items using a multimodal model, enabling a more comprehensive representation of content. Second, it employs a time-aware model that balances users’ long-term and short-term interests through multi-head self-attention networks, improving recommendation accuracy. Finally, to enhance privacy, a federated learning framework is implemented, enabling collaborative model training without sharing user data. The framework divides the recommendation model into a large server-maintained news model and a lightweight user model shared between the server and clients. The client requests news representations (vectors) and a user model from the central server, then computes gradients with user local data, and finally sends their locally computed gradients to the server for aggregation. The central server aggregates gradients to update the global user model and news model. The updated news model is further used to infer news representation by the server. To further safeguard user privacy, a secure aggregation algorithm based on Shamir’s secret sharing is employed. Experiments on a real-world news dataset demonstrate strong performance compared to existing systems, representing a significant advancement in privacy-preserving personalized news recommendation.

[LG-18] An Adaptive Random Fourier Features approach Applied to Learning Stochastic Differential Equations

链接: https://arxiv.org/abs/2507.15442
作者: Owen Douglas,Aku Kammonen,Anamika Pandey,Raúl Tempone
类目: Machine Learning (cs.LG)
*备注: 20 Pages

点击查看摘要

Abstract:This work proposes a training algorithm based on adaptive random Fourier features (ARFF) with Metropolis sampling and resampling \citekammonen2024adaptiverandomfourierfeatures for learning drift and diffusion components of stochastic differential equations from snapshot data. Specifically, this study considers Itô diffusion processes and a likelihood-based loss function derived from the Euler-Maruyama integration introduced in \citeDietrich2023 and \citedridi2021learningstochasticdynamicalsystems. This work evaluates the proposed method against benchmark problems presented in \citeDietrich2023, including polynomial examples, underdamped Langevin dynamics, a stochastic susceptible-infected-recovered model, and a stochastic wave equation. Across all cases, the ARFF-based approach matches or surpasses the performance of conventional Adam-based optimization in both loss minimization and convergence speed. These results highlight the potential of ARFF as a compelling alternative for data-driven modeling of stochastic dynamics. Comments: 20 Pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.15442 [cs.LG] (or arXiv:2507.15442v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] he calculus of variations of the Transformer on the hyperspherical tangent bundle

链接: https://arxiv.org/abs/2507.15431
作者: Andrew Gracyk
类目: Machine Learning (cs.LG)
*备注: First version

点击查看摘要

Abstract:We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space. The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere. The circumstance of the hypersphere across the latent data is reasonable due to the trained diagonal matrix equal to the identity, which has various empirical justifications. Thus, under the continuum limit of the dynamics, the latent vectors flow among the tangent bundle. Using these facts, we devise a mathematical framework for the Transformer through calculus of variations. We develop a functional and show that the continuous flow map induced by the Transformer satisfies this functional, therefore the Transformer can be viewed as a natural solver of a calculus of variations problem. We invent new scenarios of when our methods are applicable based on loss optimization with respect to path optimality. We derive the Euler-Lagrange equation for the Transformer. The variant of the Euler-Lagrange equation we present has various appearances in literature, but, to our understanding, oftentimes not foundationally proven or under other specialized cases. Our overarching proof is new: our techniques are classical and the use of the flow map object is original. We provide several other relevant results, primarily ones specific to neural scenarios. In particular, much of our analysis will be attempting to quantify Transformer data in variational contexts under neural approximations. Calculus of variations on manifolds is a well-nourished research area, but for the Transformer specifically, it is uncharted: we lay the foundation for this area through an introduction to the Lagrangian for the Transformer.

[LG-20] MAP Estimation with Denoisers: Convergence Rates and Guarantees

链接: https://arxiv.org/abs/2507.15397
作者: Scott Pesme,Giacomo Meanti,Michael Arbel,Julien Mairal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior p . We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.

[LG-21] Learning to Gridize: Segment Physical World by Wireless Communication Channel

链接: https://arxiv.org/abs/2507.15386
作者: Juntao Wang,Feng Yin,Tian Ding,Tsung-Hui Chang,Zhi-Quan Luo,Qi Yan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Gridization, the process of partitioning space into grids where users share similar channel characteristics, serves as a fundamental prerequisite for efficient large-scale network optimization. However, existing methods like Geographical or Beam Space Gridization (GSG or BSG) are limited by reliance on unavailable location data or the flawed assumption that similar signal strengths imply similar channel properties. We propose Channel Space Gridization (CSG), a pioneering framework that unifies channel estimation and gridization for the first time. Formulated as a joint optimization problem, CSG uses only beam-level reference signal received power (RSRP) to estimate Channel Angle Power Spectra (CAPS) and partition samples into grids with homogeneous channel characteristics. To perform CSG, we develop the CSG Autoencoder (CSG-AE), featuring a trainable RSRP-to-CAPS encoder, a learnable sparse codebook quantizer, and a physics-informed decoder based on the Localized Statistical Channel Model. On recognizing the limitations of naive training scheme, we propose a novel Pretraining-Initialization-Detached-Asynchronous (PIDA) training scheme for CSG-AE, ensuring stable and effective training by systematically addressing the common pitfalls of the naive training paradigm. Evaluations reveal that CSG-AE excels in CAPS estimation accuracy and clustering quality on synthetic data. On real-world datasets, it reduces Active Mean Absolute Error (MAE) by 30% and Overall MAE by 65% on RSRP prediction accuracy compared to salient baselines using the same data, while improving channel consistency, cluster sizes balance, and active ratio, advancing the development of gridization for large-scale network optimization.

[LG-22] Efficient Visual Appearance Optimization by Learning from Prior Preferences

链接: https://arxiv.org/abs/2507.15355
作者: Zhipeng Li,Yi-Chi Liao,Christian Holz
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 24 pages, UIST’25

点击查看摘要

Abstract:Adjusting visual parameters such as brightness and contrast is common in our everyday experiences. Finding the optimal parameter setting is challenging due to the large search space and the lack of an explicit objective function, leaving users to rely solely on their implicit preferences. Prior work has explored Preferential Bayesian Optimization (PBO) to address this challenge, involving users to iteratively select preferred designs from candidate sets. However, PBO often requires many rounds of preference comparisons, making it more suitable for designers than everyday end-users. We propose Meta-PO, a novel method that integrates PBO with meta-learning to improve sample efficiency. Specifically, Meta-PO infers prior users’ preferences and stores them as models, which are leveraged to intelligently suggest design candidates for the new users, enabling faster convergence and more personalized results. An experimental evaluation of our method for appearance design tasks on 2D and 3D content showed that participants achieved satisfactory appearance in 5.86 iterations using Meta-PO when participants shared similar goals with a population (e.g., tuning for a warm'' look) and in 8 iterations even generalizes across divergent goals (e.g., from vintage’‘, warm'', to holiday’'). Meta-PO makes personalized visual optimization more applicable to end-users through a generalizable, more efficient optimization conditioned on preferences, with the potential to scale interface personalization more broadly.

[LG-23] Language Generation in the Limit: Noise Loss and Feedback

链接: https://arxiv.org/abs/2507.15319
作者: Yannan Bai,Debmalya Panigrahi,Ian Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kleinberg and Mullainathan (2024) recently proposed a formal framework called language generation in the limit and showed that given a sequence of example strings from an unknown target language drawn from any countable collection, an algorithm can correctly generate unseen strings from the target language within finite time. This notion was further refined by Li, Raman, and Tewari (2024), who defined stricter categories of non-uniform and uniform generation. They showed that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation. We begin by resolving the question in the negative: we give a uniformly generatable collection and a non-uniformly generatable collection whose union is not generatable in the limit. We then use facets of this construction to further our understanding of several variants of language generation. The first two, generation with noise and without samples, were introduced by Raman and Raman (2025) and Li, Raman, and Tewari (2024) respectively. We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation. The former paper asked if there is any separation between noisy and non-noisy generation in the limit – we show that such a separation exists even with a single noisy string. Finally, we study the framework of generation with feedback, introduced by Charikar and Pabbaraju (2025), where the algorithm is strengthened by allowing it to ask membership queries. We show finite queries add no power, but infinite queries yield a strictly more powerful model. In summary, the results in this paper resolve the union-closedness of language generation in the limit, and leverage those techniques (and others) to give precise characterizations for natural variants that incorporate noise, loss, and feedback. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2507.15319 [cs.DS] (or arXiv:2507.15319v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.15319 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Universal crystal material property prediction via multi-view geometric fusion in graph transformers

链接: https://arxiv.org/abs/2507.15303
作者: Liang Zhang,Kong Chen,Yuen Wu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Accurately and comprehensively representing crystal structures is critical for advancing machine learning in large-scale crystal materials simulations, however, effectively capturing and leveraging the intricate geometric and topological characteristics of crystal structures remains a core, long-standing challenge for most existing methods in crystal property prediction. Here, we propose MGT, a multi-view graph transformer framework that synergistically fuses SE3 invariant and SO3 equivariant graph representations, which respectively captures rotation-translation invariance and rotation equivariance in crystal geometries. To strategically incorporate these complementary geometric representations, we employ a lightweight mixture of experts router in MGT to adaptively adjust the weight assigned to SE3 and SO3 embeddings based on the specific target task. Compared with previous state-of-the-art models, MGT reduces the mean absolute error by up to 21% on crystal property prediction tasks through multi-task self-supervised pretraining. Ablation experiments and interpretable investigations confirm the effectiveness of each technique implemented in our framework. Additionally, in transfer learning scenarios including crystal catalyst adsorption energy and hybrid perovskite bandgap prediction, MGT achieves performance improvements of up to 58% over existing baselines, demonstrating domain-agnostic scalability across diverse application domains. As evidenced by the above series of studies, we believe that MGT can serve as useful model for crystal material property prediction, providing a valuable tool for the discovery of novel materials.

[LG-25] Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

链接: https://arxiv.org/abs/2507.15290
作者: Emile Anand,Sarah Liaw
类目: Machine Learning (cs.LG)
*备注: 39 pages, 2 figures, 36 tables

点击查看摘要

Abstract:Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emphapproximate posteriors – common in large-scale or neural problems – has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in this https URL.

[LG-26] Machine Unlearning for Streaming Forgetting

链接: https://arxiv.org/abs/2507.15280
作者: Shaofei Shen,Chenhao Zhang,Yawen Zhao,Alina Bialkowski,Weitong Chen,Miao Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning aims to remove knowledge of the specific training data in a well-trained model. Currently, machine unlearning methods typically handle all forgetting data in a single batch, removing the corresponding knowledge all at once upon request. However, in practical scenarios, requests for data removal often arise in a streaming manner rather than in a single batch, leading to reduced efficiency and effectiveness in existing methods. Such challenges of streaming forgetting have not been the focus of much research. In this paper, to address the challenges of performance maintenance, efficiency, and data access brought about by streaming unlearning requests, we introduce a streaming unlearning paradigm, formalizing the unlearning as a distribution shift problem. We then estimate the altered distribution and propose a novel streaming unlearning algorithm to achieve efficient streaming forgetting without requiring access to the original training data. Theoretical analyses confirm an O(\sqrtT + V_T) error bound on the streaming unlearning regret, where V_T represents the cumulative total variation in the optimal solution over T learning rounds. This theoretical guarantee is achieved under mild conditions without the strong restriction of convex loss function. Experiments across various models and datasets validate the performance of our proposed method.

[LG-27] mporal Basis Function Models for Closed-Loop Neural Stimulation

链接: https://arxiv.org/abs/2507.15274
作者: Matthew J. Bryan,Felix Schwock,Azadeh Yazdan-Shahmorad,Rajesh P N Rao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Closed-loop neural stimulation provides novel therapies for neurological diseases such as Parkinson’s disease (PD), but it is not yet clear whether artificial intelligence (AI) techniques can tailor closed-loop stimulation to individual patients or identify new therapies. Progress requires us to address a number of translational issues, including sample efficiency, training time, and minimizing loop latency such that stimulation may be shaped in response to changing brain activity. We propose temporal basis function models (TBFMs) to address these difficulties, and explore this approach in the context of excitatory optogenetic stimulation. We demonstrate the ability of TBF models to provide a single-trial, spatiotemporal forward prediction of the effect of optogenetic stimulation on local field potentials (LFPs) measured in two non-human primates. We further use simulations to demonstrate the use of TBF models for closed-loop stimulation, driving neural activity towards target patterns. The simplicity of TBF models allow them to be sample efficient, rapid to train (2-4min), and low latency (0.2ms) on desktop CPUs. We demonstrate the model on 40 sessions of previously published excitatory optogenetic stimulation data. For each session, the model required 15-20min of data collection to successfully model the remainder of the session. It achieved a prediction accuracy comparable to a baseline nonlinear dynamical systems model that requires hours to train, and superior accuracy to a linear state-space model. In our simulations, it also successfully allowed a closed-loop stimulator to control a neural circuit. Our approach begins to bridge the translational gap between complex AI-based approaches to modeling dynamical systems and the vision of using such forward prediction models to develop novel, clinically useful closed-loop stimulation protocols.

[LG-28] CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers ICCV2025

链接: https://arxiv.org/abs/2507.15260
作者: Jiaqi Han,Haotian Ye,Puheng Li,Minkai Xu,James Zou,Stefano Ermon
类目: Machine Learning (cs.LG)
*备注: ICCV 2025

点击查看摘要

Abstract:Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures. Existing acceleration techniques either require extensive model retraining or compromise significantly on sample quality. This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. Our framework views multi-core diffusion sampling as an ODE solver pipeline, where slower yet accurate solvers progressively rectify faster solvers through a theoretically justified inter-core communication mechanism. This motivates our multi-core training-free diffusion sampling accelerator, CHORDS, which is compatible with various diffusion samplers, model architectures, and modalities. Through extensive experiments, CHORDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation. This advancement enables CHORDS to establish a solid foundation for real-time, high-fidelity diffusion generation.

[LG-29] Physics-Informed Learning of Proprietary Inverter Models for Grid Dynamic Studies

链接: https://arxiv.org/abs/2507.15259
作者: Kyung-Bin Kwon,Sayak Mukherjee,Ramij R. Hossain,Marcelo Elizondo
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:This letter develops a novel physics-informed neural ordinary differential equations-based framework to emulate the proprietary dynamics of the inverters – essential for improved accuracy in grid dynamic simulations. In current industry practice, the original equipment manufacturers (OEMs) often do not disclose the exact internal controls and parameters of the inverters, posing significant challenges in performing accurate dynamic simulations and other relevant studies, such as gain tunings for stability analysis and controls. To address this, we propose a Physics-Informed Latent Neural ODE Model (PI-LNM) that integrates system physics with neural learning layers to capture the unmodeled behaviors of proprietary units. The proposed method is validated using a grid-forming inverter (GFM) case study, demonstrating improved dynamic simulation accuracy over approaches that rely solely on data-driven learning without physics-based guidance.

[LG-30] Exact Reformulation and Optimization for Direct Metric Optimization in Binary Imbalanced Classification

链接: https://arxiv.org/abs/2507.15240
作者: Le Peng,Yash Travadi,Chuan He,Ying Cui,Ju Sun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:For classification with imbalanced class frequencies, i.e., imbalanced classification (IC), standard accuracy is known to be misleading as a performance measure. While most existing methods for IC resort to optimizing balanced accuracy (i.e., the average of class-wise recalls), they fall short in scenarios where the significance of classes varies or certain metrics should reach prescribed levels. In this paper, we study two key classification metrics, precision and recall, under three practical binary IC settings: fix precision optimize recall (FPOR), fix recall optimize precision (FROP), and optimize F_\beta -score (OFBS). Unlike existing methods that rely on smooth approximations to deal with the indicator function involved, \textitwe introduce, for the first time, exact constrained reformulations for these direct metric optimization (DMO) problems, which can be effectively solved by exact penalty methods. Experiment results on multiple benchmark datasets demonstrate the practical superiority of our approach over the state-of-the-art methods for the three DMO problems. We also expect our exact reformulation and optimization (ERO) framework to be applicable to a wide range of DMO problems for binary IC and beyond. Our code is available at this https URL.

[LG-31] Feature Construction Using Network Control Theory and Rank Encoding for Graph Machine Learning

链接: https://arxiv.org/abs/2507.15195
作者: Anwar Said,Yifan Wei,Ubaid Ullah Ahmad,Mudassir Shabbir,Waseem Abbas,Xenofon Koutsoukos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this article, we utilize the concept of average controllability in graphs, along with a novel rank encoding method, to enhance the performance of Graph Neural Networks (GNNs) in social network classification tasks. GNNs have proven highly effective in various network-based learning applications and require some form of node features to function. However, their performance is heavily influenced by the expressiveness of these features. In social networks, node features are often unavailable due to privacy constraints or the absence of inherent attributes, making it challenging for GNNs to achieve optimal performance. To address this limitation, we propose two strategies for constructing expressive node features. First, we introduce average controllability along with other centrality metrics (denoted as NCT-EFA) as node-level metrics that capture critical aspects of network topology. Building on this, we develop a rank encoding method that transforms average controllability or any other graph-theoretic metric into a fixed-dimensional feature space, thereby improving feature representation. We conduct extensive numerical evaluations using six benchmark GNN models across four social network datasets to compare different node feature construction methods. Our results demonstrate that incorporating average controllability into the feature space significantly improves GNN performance. Moreover, the proposed rank encoding method outperforms traditional one-hot degree encoding, improving the ROC AUC from 68.7% to 73.9% using GraphSAGE on the GitHub Stargazers dataset, underscoring its effectiveness in generating expressive and efficient node representations.

[LG-32] Joint-Local Grounded Action Transformation for Sim-to-Real Transfer in Multi-Agent Traffic Control

链接: https://arxiv.org/abs/2507.15174
作者: Justin Turnau,Longchao Da,Khoa Vo,Ferdous Al Rafi,Shreyas Bachiraju,Tiejin Chen,Hua Wei
类目: Machine Learning (cs.LG)
*备注: This paper was accepted to RLC/RLJ 2025

点击查看摘要

Abstract:Traffic Signal Control (TSC) is essential for managing urban traffic flow and reducing congestion. Reinforcement Learning (RL) offers an adaptive method for TSC by responding to dynamic traffic patterns, with multi-agent RL (MARL) gaining traction as intersections naturally function as coordinated agents. However, due to shifts in environmental dynamics, implementing MARL-based TSC policies in the real world often leads to a significant performance drop, known as the sim-to-real gap. Grounded Action Transformation (GAT) has successfully mitigated this gap in single-agent RL for TSC, but real-world traffic networks, which involve numerous interacting intersections, are better suited to a MARL framework. In this work, we introduce JL-GAT, an application of GAT to MARL-based TSC that balances scalability with enhanced grounding capability by incorporating information from neighboring agents. JL-GAT adopts a decentralized approach to GAT, allowing for the scalability often required in real-world traffic networks while still capturing key interactions between agents. Comprehensive experiments on various road networks under simulated adverse weather conditions, along with ablation studies, demonstrate the effectiveness of JL-GAT. The code is publicly available at this https URL.

[LG-33] Better Models and Algorithms for Learning Ising Models from Dynamics

链接: https://arxiv.org/abs/2507.15173
作者: Jason Gaitonde,Ankur Moitra,Elchanan Mossel
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 49 pages

点击查看摘要

Abstract:We study the problem of learning the structure and parameters of the Ising model, a fundamental model of high-dimensional data, when observing the evolution of an associated Markov chain. A recent line of work has studied the natural problem of learning when observing an evolution of the well-known Glauber dynamics [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Mossel STOC 2024], which provides an arguably more realistic generative model than the classical i.i.d. setting. However, this prior work crucially assumes that all site update attempts are observed, \empheven when this attempt does not change the configuration: this strong observation model is seemingly essential for these approaches. While perhaps possible in restrictive contexts, this precludes applicability to most realistic settings where we can observe \emphonly the stochastic evolution itself, a minimal and natural assumption for any process we might hope to learn from. However, designing algorithms that succeed in this more realistic setting has remained an open problem [Bresler, Gamarnik, Shah, IEEE Trans. Inf. Theory 2018, Gaitonde, Moitra, Mossel, STOC 2025]. In this work, we give the first algorithms that efficiently learn the Ising model in this much more natural observation model that only observes when the configuration changes. For Ising models with maximum degree d , our algorithm recovers the underlying dependency graph in time \mathsfpoly(d)\cdot n^2\log n and then the actual parameters in additional \widetildeO(2^d n) time, which qualitatively matches the state-of-the-art even in the i.i.d. setting in a much weaker observation model. Our analysis holds more generally for a broader class of reversible, single-site Markov chains that also includes the popular Metropolis chain by leveraging more robust properties of reversible Markov chains. Comments: 49 pages Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2507.15173 [cs.LG] (or arXiv:2507.15173v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Designing User-Centric Metrics for Evaluation of Counterfactual Explanations

链接: https://arxiv.org/abs/2507.15162
作者: Firdaus Ahmed Choudhury,Ethan Leicht,Jude Ethan Bislig,Hangzhi Guo,Amulya Yadav
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning-based decision models are increasingly being used to make decisions that significantly impact people’s lives, but their opaque nature leaves end users without a clear understanding of why a decision was made. Counterfactual Explanations (CFEs) have grown in popularity as a means of offering actionable guidance by identifying the minimum changes in feature values required to flip a model’s prediction to something more desirable. Unfortunately, most prior research in CFEs relies on artificial evaluation metrics, such as proximity, which may overlook end-user preferences and constraints, e.g., the user’s perception of effort needed to make certain feature changes may differ from that of the model designer. To address this research gap, this paper makes three novel contributions. First, we conduct a pilot study with 20 crowd-workers on Amazon MTurk to experimentally validate the alignment of existing CF evaluation metrics with real-world user preferences. Results show that user-preferred CFEs matched those based on proximity in only 63.81% of cases, highlighting the limited applicability of these metrics in real-world settings. Second, inspired by the need to design a user-informed evaluation metric for CFEs, we conduct a more detailed two-day user study with 41 participants facing realistic credit application scenarios to find experimental support for or against three intuitive hypotheses that may explain how end users evaluate CFEs. Third, based on the findings of this second study, we propose the AWP model, a novel user-centric, two-stage model that describes one possible mechanism by which users evaluate and select CFEs. Our results show that AWP predicts user-preferred CFEs with 84.37% accuracy. Our study provides the first human-centered validation for personalized cost models in CFE generation and highlights the need for adaptive, user-centered evaluation metrics.

[LG-35] Resonant-Tunnelling Diode Reservoir Computing System for Image Recognition

链接: https://arxiv.org/abs/2507.15158
作者: A. H. Abbas,Hend Abdel-Ghani,Ivan S. Maksymov
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:As artificial intelligence continues to push into real-time, edge-based and resource-constrained environments, there is an urgent need for novel, hardware-efficient computational models. In this study, we present and validate a neuromorphic computing architecture based on resonant-tunnelling diodes (RTDs), which exhibit the nonlinear characteristics ideal for physical reservoir computing (RC). We theoretically formulate and numerically implement an RTD-based RC system and demonstrate its effectiveness on two image recognition benchmarks: handwritten digit classification and object recognition using the Fruit~360 dataset. Our results show that this circuit-level architecture delivers promising performance while adhering to the principles of next-generation RC – eliminating random connectivity in favour of a deterministic nonlinear transformation of input signals.

[LG-36] Quantum Machine Learning for Secure Cooperative Multi-Layer Edge AI with Proportional Fairness

链接: https://arxiv.org/abs/2507.15145
作者: Thai T. Vu,John Le
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:This paper proposes a communication-efficient, event-triggered inference framework for cooperative edge AI systems comprising multiple user devices and edge servers. Building upon dual-threshold early-exit strategies for rare-event detection, the proposed approach extends classical single-device inference to a distributed, multi-device setting while incorporating proportional fairness constraints across users. A joint optimization framework is formulated to maximize classification utility under communication, energy, and fairness constraints. To solve the resulting problem efficiently, we exploit the monotonicity of the utility function with respect to the confidence thresholds and apply alternating optimization with Benders decomposition. Experimental results show that the proposed framework significantly enhances system-wide performance and fairness in resource allocation compared to single-device baselines.

[LG-37] ransforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

链接: https://arxiv.org/abs/2507.15132
作者: Joanna Komorniczak
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

[LG-38] Are We Overlooking the Dimensions? Learning Latent Hierarchical Channel Structure for High-Dimensional Time Series Forecasting

链接: https://arxiv.org/abs/2507.15119
作者: Juntong Ni,Shiyu Wang,Zewen Liu,Xiaoming Shi,Xinyue Zhong,Zhou Ye,Wei Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) is a central problem in time series analysis. However, as the number of channels in time series datasets scales to the thousands or more, a scenario we define as High-Dimensional Time Series Forecasting (HDTSF), it introduces significant new modeling challenges that are often not the primary focus of traditional TSF research. HDTSF is challenging because the channel correlation often forms complex and hierarchical patterns. Existing TSF models either ignore these interactions or fail to scale as dimensionality grows. To address this issue, we propose U-Cast, a channel-dependent forecasting architecture that learns latent hierarchical channel structures with an innovative query-based attention. To disentangle highly correlated channel representation, U-Cast adds a full-rank regularization during training. We also release Time-HD, a benchmark of large, diverse, high-dimensional datasets. Our theory shows that exploiting cross-channel information lowers forecasting risk, and experiments on Time-HD demonstrate that U-Cast surpasses strong baselines in both accuracy and efficiency. Together, U-Cast and Time-HD provide a solid basis for future HDTSF research.

[LG-39] Distributional Unlearning: Forgetting Distributions Not Just Samples

链接: https://arxiv.org/abs/2507.15112
作者: Youssef Allouah,Rachid Guerraoui,Sanmi Koyejo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine unlearning seeks to remove unwanted information from trained models, initially at the individual-sample level, but increasingly at the level of entire sub-populations. In many deployments, models must delete whole topical domains to satisfy privacy, legal, or quality requirements, e.g., removing several users’ posts under GDPR or copyrighted web content. Existing unlearning tools remain largely sample-oriented, and straightforward point deletion often leaves enough residual signal for downstream learners to recover the unwanted domain. We introduce distributional unlearning, a data-centric, model-agnostic framework that asks: Given examples from an unwanted distribution and a retained distribution, what is the smallest set of points whose removal makes the edited dataset far from the unwanted domain yet close to the retained one? Using Kullback-Leibler divergence to quantify removal and preservation, we derive the exact Pareto frontier in the Gaussian case and prove that any model retrained on the edited data incurs log-loss shifts bounded by the divergence thresholds. We propose a simple distance-based selection rule satisfying these constraints with a quadratic reduction in deletion budget compared to random removal. Experiments on synthetic Gaussians, Jigsaw Toxic Comments, SMS spam, and CIFAR-10 show 15-72% fewer deletions than random, with negligible impact on retained performance.

[LG-40] Isotonic Quantile Regression Averag ing for uncertainty quantification of electricity price forecasts

链接: https://arxiv.org/abs/2507.15079
作者: Arkadiusz Lipiecki,Bartosz Uniejewski
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP)
*备注: Preprint

点击查看摘要

Abstract:Quantifying the uncertainty of forecasting models is essential to assess and mitigate the risks associated with data-driven decisions, especially in volatile domains such as electricity markets. Machine learning methods can provide highly accurate electricity price forecasts, critical for informing the decisions of market participants. However, these models often lack uncertainty estimates, which limits the ability of decision makers to avoid unnecessary risks. In this paper, we propose a novel method for generating probabilistic forecasts from ensembles of point forecasts, called Isotonic Quantile Regression Averaging (iQRA). Building on the established framework of Quantile Regression Averaging (QRA), we introduce stochastic order constraints to improve forecast accuracy, reliability, and computational costs. In an extensive forecasting study of the German day-ahead electricity market, we show that iQRA consistently outperforms state-of-the-art postprocessing methods in terms of both reliability and sharpness. It produces well-calibrated prediction intervals across multiple confidence levels, providing superior reliability to all benchmark methods, particularly coverage-based conformal prediction. In addition, isotonic regularization decreases the complexity of the quantile regression problem and offers a hyperparameter-free approach to variable selection.

[LG-41] Reinforcement Learning for Flow-Matching Policies

链接: https://arxiv.org/abs/2507.15073
作者: Samuel Pfrommer,Yixiao Huang,Somayeh Sojoudi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow-matching policies have emerged as a powerful paradigm for generalist robotics. These models are trained to imitate an action chunk, conditioned on sensor observations and textual instructions. Often, training demonstrations are generated by a suboptimal policy, such as a human operator. This work explores training flow-matching policies via reinforcement learning to surpass the original demonstration policy performance. We particularly note minimum-time control as a key application and present a simple scheme for variable-horizon flow-matching planning. We then introduce two families of approaches: a simple Reward-Weighted Flow Matching (RWFM) scheme and a Group Relative Policy Optimization (GRPO) approach with a learned reward surrogate. Our policies are trained on an illustrative suite of simulated unicycle dynamics tasks, and we show that both approaches dramatically improve upon the suboptimal demonstrator performance, with the GRPO approach in particular generally incurring between 50% and 85% less cost than a naive Imitation Learning Flow Matching (ILFM) approach.

[LG-42] ROBAD: Robust Adversary-aware Local-Global Attended Bad Actor Detection Sequential Model

链接: https://arxiv.org/abs/2507.15067
作者: Bing He,Mustaque Ahamad,Srijan Kumar
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 15 pages, 12 tables

点击查看摘要

Abstract:Detecting bad actors is critical to ensure the safety and integrity of internet platforms. Several deep learning-based models have been developed to identify such users. These models should not only accurately detect bad actors, but also be robust against adversarial attacks that aim to evade detection. However, past deep learning-based detection models do not meet the robustness requirement because they are sensitive to even minor changes in the input sequence. To address this issue, we focus on (1) improving the model understanding capability and (2) enhancing the model knowledge such that the model can recognize potential input modifications when making predictions. To achieve these goals, we create a novel transformer-based classification model, called ROBAD (RObust adversary-aware local-global attended Bad Actor Detection model), which uses the sequence of user posts to generate user embedding to detect bad actors. Particularly, ROBAD first leverages the transformer encoder block to encode each post bidirectionally, thus building a post embedding to capture the local information at the post level. Next, it adopts the transformer decoder block to model the sequential pattern in the post embeddings by using the attention mechanism, which generates the sequence embedding to obtain the global information at the sequence level. Finally, to enrich the knowledge of the model, embeddings of modified sequences by mimicked attackers are fed into a contrastive-learning-enhanced classification layer for sequence prediction. In essence, by capturing the local and global information (i.e., the post and sequence information) and leveraging the mimicked behaviors of bad actors in training, ROBAD can be robust to adversarial attacks. Extensive experiments on Yelp and Wikipedia datasets show that ROBAD can effectively detect bad actors when under state-of-the-art adversarial attacks.

[LG-43] LibLMFuzz: LLM -Augmented Fuzz Target Generation for Black-box Libraries

链接: https://arxiv.org/abs/2507.15058
作者: Ian Hardgrove,John D. Hastings
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 6 pages, 2 figures, 1 table, 2 listings

点击查看摘要

Abstract:A fundamental problem in cybersecurity and computer science is determining whether a program is free of bugs and vulnerabilities. Fuzzing, a popular approach to discovering vulnerabilities in programs, has several advantages over alternative strategies, although it has investment costs in the form of initial setup and continuous maintenance. The choice of fuzzing is further complicated when only a binary library is available, such as the case of closed-source and proprietary software. In response, we introduce LibLMFuzz, a framework that reduces costs associated with fuzzing closed-source libraries by pairing an agentic Large Language Model (LLM) with a lightweight tool-chain (disassembler/compiler/fuzzer) to autonomously analyze stripped binaries, plan fuzz strategies, generate drivers, and iteratively self-repair build or runtime errors. Tested on four widely-used Linux libraries, LibLMFuzz produced syntactically correct drivers for all 558 fuzz-able API functions, achieving 100% API coverage with no human intervention. Across the 1601 synthesized drivers, 75.52% were nominally correct on first execution. The results show that LLM-augmented middleware holds promise in reducing the costs of fuzzing black box components and provides a foundation for future research efforts. Future opportunities exist for research in branch coverage.

[LG-44] Clustered Federated Learning for Generalizable FDIA Detection in Smart Grids with Heterogeneous Data

链接: https://arxiv.org/abs/2507.14999
作者: Yunfeng Li,Junhong Liu,Zhaohui Yang,Guofu Liao,Chuyun Zhang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages,6 figures

点击查看摘要

Abstract:False Data Injection Attacks (FDIAs) pose severe security risks to smart grids by manipulating measurement data collected from spatially distributed devices such as SCADA systems and PMUs. These measurements typically exhibit Non-Independent and Identically Distributed (Non-IID) characteristics across different regions, which significantly challenges the generalization ability of detection models. Traditional centralized training approaches not only face privacy risks and data sharing constraints but also incur high transmission costs, limiting their scalability and deployment feasibility. To address these issues, this paper proposes a privacy-preserving federated learning framework, termed Federated Cluster Average (FedClusAvg), designed to improve FDIA detection in Non-IID and resource-constrained environments. FedClusAvg incorporates cluster-based stratified sampling and hierarchical communication (client-subserver-server) to enhance model generalization and reduce communication overhead. By enabling localized training and weighted parameter aggregation, the algorithm achieves accurate model convergence without centralizing sensitive data. Experimental results on benchmark smart grid datasets demonstrate that FedClusAvg not only improves detection accuracy under heterogeneous data distributions but also significantly reduces communication rounds and bandwidth consumption. This work provides an effective solution for secure and efficient FDIA detection in large-scale distributed power systems.

[LG-45] FedWCM: Unleashing the Potential of Momentum-based Federated Learning in Long-Tailed Scenarios

链接: https://arxiv.org/abs/2507.14980
作者: Tianle Li,Yongzhi Huang,Linshan Jiang,Qipeng Xie,Chang Liu,Wenfeng Du,Lu Wang,Kaishun Wu
类目: Machine Learning (cs.LG)
*备注: ICPP, including appendix

点击查看摘要

Abstract:Federated Learning (FL) enables decentralized model training while preserving data privacy. Despite its benefits, FL faces challenges with non-identically distributed (non-IID) data, especially in long-tailed scenarios with imbalanced class samples. Momentum-based FL methods, often used to accelerate FL convergence, struggle with these distributions, resulting in biased models and making FL hard to converge. To understand this challenge, we conduct extensive investigations into this phenomenon, accompanied by a layer-wise analysis of neural network behavior. Based on these insights, we propose FedWCM, a method that dynamically adjusts momentum using global and per-round data to correct directional biases introduced by long-tailed distributions. Extensive experiments show that FedWCM resolves non-convergence issues and outperforms existing methods, enhancing FL’s efficiency and effectiveness in handling client heterogeneity and data imbalance.

[LG-46] FullRecall: A Semantic Search-Based Ranking Approach for Maximizing Recall in Patent Retrieval

链接: https://arxiv.org/abs/2507.14946
作者: Amna Ali,Liyanage C. De Silva,Pg Emeroylariffion Abas
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Patent examiners and inventors face significant pressure to verify the originality and non-obviousness of inventions, and the intricate nature of patent data intensifies the challenges of patent retrieval. Therefore, there is a pressing need to devise cutting-edge retrieval strategies that can reliably achieve the desired recall. This study introduces FullRecall, a novel patent retrieval approach that effectively manages the complexity of patent data while maintaining the reliability of relevance matching and maximising recall. It leverages IPC-guided knowledge to generate informative phrases, which are processed to extract key information in the form of noun phrases characterising the query patent under observation. From these, the top k keyphrases are selected to construct a query for retrieving a focused subset of the dataset. This initial retrieval step achieves complete recall, successfully capturing all relevant documents. To further refine the results, a ranking scheme is applied to the retrieved subset, reducing its size while maintaining 100% recall. This multi-phase process demonstrates an effective strategy for balancing precision and recall in patent retrieval tasks. Comprehensive experiments were conducted, and the results were compared with baseline studies, namely HRR2 [1] and ReQ-ReC [2]. The proposed approach yielded superior results, achieving 100% recall in all five test cases. However, HRR2[1] recall values across the five test cases were 10%, 25%, 33.3%, 0%, and 14.29%, while ReQ-ReC [2] showed 50% for the first test case, 25% for the second test case, and 0% for the third, fourth, and fifth test cases. The 100% recall ensures that no relevant prior art is overlooked, thereby strengthening the patent pre-filing and examination processes, hence reducing potential legal risks.

[LG-47] Old Rules in a New Game: Mapping Uncertainty Quantification to Quantum Machine Learning

链接: https://arxiv.org/abs/2507.14919
作者: Maximilian Wendlinger,Kilian Tscharke,Pascal Debus
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:One of the key obstacles in traditional deep learning is the reduction in model transparency caused by increasingly intricate model functions, which can lead to problems such as overfitting and excessive confidence in predictions. With the advent of quantum machine learning offering possible advances in computational power and latent space complexity, we notice the same opaque behavior. Despite significant research in classical contexts, there has been little advancement in addressing the black-box nature of quantum machine learning. Consequently, we approach this gap by building upon existing work in classical uncertainty quantification and initial explorations in quantum Bayesian modeling to theoretically develop and empirically evaluate techniques to map classical uncertainty quantification methods to the quantum machine learning domain. Our findings emphasize the necessity of leveraging classical insights into uncertainty quantification to include uncertainty awareness in the process of designing new quantum machine learning models.

[LG-48] A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic Encryption

链接: https://arxiv.org/abs/2507.14853
作者: Khoa Nguyen,Tanveer Khan,Antonis Michalas
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without sharing raw data, making it a promising approach for privacy-sensitive domains. Despite its potential, FL faces significant challenges, particularly in terms of communication overhead and data privacy. Privacy-preserving Techniques (PPTs) such as Homomorphic Encryption (HE) have been used to mitigate these concerns. However, these techniques introduce substantial computational and communication costs, limiting their practical deployment. In this work, we explore how Hybrid Homomorphic Encryption (HHE), a cryptographic protocol that combines symmetric encryption with HE, can be effectively integrated with FL to address both communication and privacy challenges, paving the way for scalable and secure decentralized learning system.

[LG-49] me-Aware Attention for Enhanced Electronic Health Records Modeling

链接: https://arxiv.org/abs/2507.14847
作者: Junhan Yu,Zhunyi Feng,Junwei Lu,Tianxi Cai,Doudou Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic Health Records (EHR) contain valuable clinical information for predicting patient outcomes and guiding healthcare decisions. However, effectively modeling Electronic Health Records (EHRs) requires addressing data heterogeneity and complex temporal patterns. Standard approaches often struggle with irregular time intervals between clinical events. We propose TALE-EHR, a Transformer-based framework featuring a novel time-aware attention mechanism that explicitly models continuous temporal gaps to capture fine-grained sequence dynamics. To complement this temporal modeling with robust semantics, TALE-EHR leverages embeddings derived from standardized code descriptions using a pre-trained Large Language Model (LLM), providing a strong foundation for understanding clinical concepts. Experiments on the MIMIC-IV and PIC dataset demonstrate that our approach outperforms state-of-the-art baselines on tasks such as disease progression forecasting. TALE-EHR underscores the benefit of integrating explicit, continuous temporal modeling with strong semantic representations provides a powerful solution for advancing EHR analysis.

[LG-50] Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts COLT2025

链接: https://arxiv.org/abs/2507.14835
作者: Pan Peng,Hangyu Xu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: COLT 2025

点击查看摘要

Abstract:We study the problem of releasing a differentially private (DP) synthetic graph G’ that well approximates the triangle-motif sizes of all cuts of any given graph G , where a motif in general refers to a frequently occurring subgraph within complex networks. Non-private versions of such graphs have found applications in diverse fields such as graph clustering, graph sparsification, and social network analysis. Specifically, we present the first (\varepsilon,\delta) -DP mechanism that, given an input graph G with n vertices, m edges and local sensitivity of triangles \ell_3(G) , generates a synthetic graph G’ in polynomial time, approximating the triangle-motif sizes of all cuts (S,V\setminus S) of the input graph G up to an additive error of \tildeO(\sqrtm\ell_3(G)n/\varepsilon^3/2) . Additionally, we provide a lower bound of \Omega(\sqrtmn\ell_3(G)/\varepsilon) on the additive error for any DP algorithm that answers the triangle-motif size queries of all (S,T) -cut of G . Finally, our algorithm generalizes to weighted graphs, and our lower bound extends to any K_h -motif cut for any constant h\geq 2 .

[LG-51] Rethinking Memorization Measures and their Implications in Large Language Models

链接: https://arxiv.org/abs/2507.14777
作者: Bishwamittra Ghosh,Soumi Das,Qinyuan Wu,Mohammad Aflah Khan,Krishna P. Gummadi,Evimaria Terzi,Deepak Garg
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Concerned with privacy threats, memorization in LLMs is often seen as undesirable, specifically for learning. In this paper, we study whether memorization can be avoided when optimally learning a language, and whether the privacy threat posed by memorization is exaggerated or not. To this end, we re-examine existing privacy-focused measures of memorization, namely recollection-based and counterfactual memorization, along with a newly proposed contextual memorization. Relating memorization to local over-fitting during learning, contextual memorization aims to disentangle memorization from the contextual learning ability of LLMs. Informally, a string is contextually memorized if its recollection due to training exceeds the optimal contextual recollection, a learned threshold denoting the best contextual learning without training. Conceptually, contextual recollection avoids the fallacy of recollection-based memorization, where any form of high recollection is a sign of memorization. Theoretically, contextual memorization relates to counterfactual memorization, but imposes stronger conditions. Memorization measures differ in outcomes and information requirements. Experimenting on 18 LLMs from 6 families and multiple formal languages of different entropy, we show that (a) memorization measures disagree on memorization order of varying frequent strings, (b) optimal learning of a language cannot avoid partial memorization of training strings, and © improved learning decreases contextual and counterfactual memorization but increases recollection-based memorization. Finally, (d) we revisit existing reports of memorized strings by recollection that neither pose a privacy threat nor are contextually or counterfactually memorized. Comments: Preprint Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.14777 [cs.LG] (or arXiv:2507.14777v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.14777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security Constraints

链接: https://arxiv.org/abs/2507.14768
作者: Zhou Li,Xiang Zhang,Jiawen Lv,Jihao Fan,Haiqiang Chen,Giuseppe Caire
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: accepted by 2025 IEEE Information Theory Workshop

点击查看摘要

Abstract:Motivated by federated learning (FL), secure aggregation (SA) aims to securely compute, as efficiently as possible, the sum of a set of inputs distributed across many users. To understand the impact of network topology, hierarchical secure aggregation (HSA) investigated the communication and secret key generation efficiency in a 3-layer relay network, where clusters of users are connected to the aggregation server through an intermediate layer of relays. Due to the pre-aggregation of the messages at the relays, HSA reduces the communication burden on the relay-to-server links and is able to support a large number of users. However, as the number of users increases, a practical challenge arises from heterogeneous security requirements–for example, users in different clusters may require varying levels of input protection. Motivated by this, we study weakly-secure HSA (WS-HSA) with collusion resilience, where instead of protecting all the inputs from any set of colluding users, only the inputs belonging to a predefined collection of user groups (referred to as security input sets) need to be protected against another predefined collection of user groups (referred to as collusion sets). Since the security input sets and collusion sets can be arbitrarily defined, our formulation offers a flexible framework for addressing heterogeneous security requirements in HSA. We characterize the optimal total key rate, i.e., the total number of independent key symbols required to ensure both server and relay security, for a broad range of parameter configurations. For the remaining cases, we establish lower and upper bounds on the optimal key rate, providing constant-factor gap optimality guarantees.

[LG-53] Pruning Increases Orderedness in Recurrent Computation ICML2025

链接: https://arxiv.org/abs/2507.14747
作者: Yiding Song
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 8 pages, 11 figures, 2 tables, Workshop on Methods and Opportunities at Small Scale (MOSS), ICML 2025

点击查看摘要

Abstract:Inspired by the prevalence of recurrent circuits in biological brains, we investigate the degree to which directionality is a helpful inductive bias for artificial neural networks. Taking directionality as topologically-ordered information flow between neurons, we formalise a perceptron layer with all-to-all connections (mathematically equivalent to a weight-tied recurrent neural network) and demonstrate that directionality, a hallmark of modern feed-forward networks, can be induced rather than hard-wired by applying appropriate pruning techniques. Across different random seeds our pruning schemes successfully induce greater topological ordering in information flow between neurons without compromising performance, suggesting that directionality is not a prerequisite for learning, but may be an advantageous inductive bias discoverable by gradient descent and sparsification.

[LG-54] Sampling from Gaussian Processes: A Tutorial and Applications in Global Sensitivity Analysis and Optimization

链接: https://arxiv.org/abs/2507.14746
作者: Bach Do,Nafeezat A. Ajenifuja,Taiwo A. Adebiyi,Ruda Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:High-fidelity simulations and physical experiments are essential for engineering analysis and design. However, their high cost often limits their applications in two critical tasks: global sensitivity analysis (GSA) and optimization. This limitation motivates the common use of Gaussian processes (GPs) as proxy regression models to provide uncertainty-aware predictions based on a limited number of high-quality observations. GPs naturally enable efficient sampling strategies that support informed decision-making under uncertainty by extracting information from a subset of possible functions for the model of interest. Despite their popularity in machine learning and statistics communities, sampling from GPs has received little attention in the community of engineering optimization. In this paper, we present the formulation and detailed implementation of two notable sampling methods – random Fourier features and pathwise conditioning – for generating posterior samples from GPs. Alternative approaches are briefly described. Importantly, we detail how the generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization. We show successful applications of these sampling methods through a series of numerical examples.

[LG-55] Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML

链接: https://arxiv.org/abs/2507.14744
作者: Mustafa Cavus,Jan N. van Rijn,Przemysław Biecek
类目: Machine Learning (cs.LG)
*备注: Accepted at 28th International Conference on Discovery Science 2025

点击查看摘要

Abstract:Automated machine learning systems efficiently streamline model selection but often focus on a single best-performing model, overlooking explanation uncertainty, an essential concern in human centered explainable AI. To address this, we propose a novel framework that incorporates model multiplicity into explanation generation by aggregating partial dependence profiles (PDP) from a set of near optimal models, known as the Rashomon set. The resulting Rashomon PDP captures interpretive variability and highlights areas of disagreement, providing users with a richer, uncertainty aware view of feature effects. To evaluate its usefulness, we introduce two quantitative metrics, the coverage rate and the mean width of confidence intervals, to evaluate the consistency between the standard PDP and the proposed Rashomon PDP. Experiments on 35 regression datasets from the OpenML CTR23 benchmark suite show that in most cases, the Rashomon PDP covers less than 70% of the best model’s PDP, underscoring the limitations of single model explanations. Our findings suggest that Rashomon PDP improves the reliability and trustworthiness of model interpretations by adding additional information that would otherwise be neglected. This is particularly useful in high stakes domains where transparency and confidence are critical.

[LG-56] Better Training Data Attribution via Better Inverse Hessian-Vector Products

链接: https://arxiv.org/abs/2507.14740
作者: Andrew Wang,Elisa Nguyen,Runshi Yang,Juhan Bae,Sheila A. McIlraith,Roger Grosse
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 4 figures

点击查看摘要

Abstract:Training data attribution (TDA) provides insights into which training data is responsible for a learned model behavior. Gradient-based TDA methods such as influence functions and unrolled differentiation both involve a computation that resembles an inverse Hessian-vector product (iHVP), which is difficult to approximate efficiently. We introduce an algorithm (ASTRA) which uses the EKFAC-preconditioner on Neumann series iterations to arrive at an accurate iHVP approximation for TDA. ASTRA is easy to tune, requires fewer iterations than Neumann series iterations, and is more accurate than EKFAC-based approximations. Using ASTRA, we show that improving the accuracy of the iHVP approximation can significantly improve TDA performance.

[LG-57] Balancing Expressivity and Robustness: Constrained Rational Activations for Reinforcement Learning

链接: https://arxiv.org/abs/2507.14736
作者: Rafał Surdej,Michał Bortkiewicz,Alex Lewandowski,Mateusz Ostaszewski,Clare Lyle
类目: Machine Learning (cs.LG)
*备注: Accepted for oral presentation at CoLLAs 2025

点击查看摘要

Abstract:Trainable activation functions, whose parameters are optimized alongside network weights, offer increased expressivity compared to fixed activation functions. Specifically, trainable activation functions defined as ratios of polynomials (rational functions) have been proposed to enhance plasticity in reinforcement learning. However, their impact on training stability remains unclear. In this work, we study trainable rational activations in both reinforcement and continual learning settings. We find that while their flexibility enhances adaptability, it can also introduce instability, leading to overestimation in RL and feature collapse in longer continual learning scenarios. Our main result is demonstrating a trade-off between expressivity and plasticity in rational activations. To address this, we propose a constrained variant that structurally limits excessive output scaling while preserving adaptability. Experiments across MetaWorld and DeepMind Control Suite (DMC) environments show that our approach improves training stability and performance. In continual learning benchmarks, including MNIST with reshuffled labels and Split CIFAR-100, we reveal how different constraints affect the balance between expressivity and long-term retention. While preliminary experiments in discrete action domains (e.g., Atari) did not show similar instability, this suggests that the trade-off is particularly relevant for continuous control. Together, our findings provide actionable design principles for robust and adaptable trainable activations in dynamic, non-stationary environments. Code available at: this https URL.

[LG-58] Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

链接: https://arxiv.org/abs/2507.14715
作者: Rachid Karami,Rajeev Patwari,Hyoukjun Kwon,Ashish Sirasao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD’s latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.

[LG-59] Forecasting Faculty Placement from Patterns in Co-authorship Networks

链接: https://arxiv.org/abs/2507.14696
作者: Samantha Dies,David Liu,Tina Eliassi-Rad
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Faculty hiring shapes the flow of ideas, resources, and opportunities in academia, influencing not only individual career trajectories but also broader patterns of institutional prestige and scientific progress. While traditional studies have found strong correlations between faculty hiring and attributes such as doctoral department prestige and publication record, they rarely assess whether these associations generalize to individual hiring outcomes, particularly for future candidates outside the original sample. Here, we consider faculty placement as an individual-level prediction task. Our data consist of temporal co-authorship networks with conventional attributes such as doctoral department prestige and bibliometric features. We observe that using the co-authorship network significantly improves predictive accuracy by up to 10% over traditional indicators alone, with the largest gains observed for placements at the most elite (top-10) departments. Our results underscore the role that social networks, professional endorsements, and implicit advocacy play in faculty hiring beyond traditional measures of scholarly productivity and institutional prestige. By introducing a predictive framing of faculty placement and establishing the benefit of considering co-authorship networks, this work provides a new lens for understanding structural biases in academia that could inform targeted interventions aimed at increasing transparency, fairness, and equity in academic hiring practices.

[LG-60] Revisiting Graph Contrastive Learning on Anomaly Detection: A Structural Imbalance Perspective AAAI2025

链接: https://arxiv.org/abs/2507.14677
作者: Yiming Xu,Zhen Peng,Bin Shi,Xu Hua,Bo Dong,Song Wang,Chen Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:The superiority of graph contrastive learning (GCL) has prompted its application to anomaly detection tasks for more powerful risk warning systems. Unfortunately, existing GCL-based models tend to excessively prioritize overall detection performance while neglecting robustness to structural imbalance, which can be problematic for many real-world networks following power-law degree distributions. Particularly, GCL-based methods may fail to capture tail anomalies (abnormal nodes with low degrees). This raises concerns about the security and robustness of current anomaly detection algorithms and therefore hinders their applicability in a variety of realistic high-risk scenarios. To the best of our knowledge, research on the robustness of graph anomaly detection to structural imbalance has received little scrutiny. To address the above issues, this paper presents a novel GCL-based framework named AD-GCL. It devises the neighbor pruning strategy to filter noisy edges for head nodes and facilitate the detection of genuine tail nodes by aligning from head nodes to forged tail nodes. Moreover, AD-GCL actively explores potential neighbors to enlarge the receptive field of tail nodes through anomaly-guided neighbor completion. We further introduce intra- and inter-view consistency loss of the original and augmentation graph for enhanced representation. The performance evaluation of the whole, head, and tail nodes on multiple datasets validates the comprehensive superiority of the proposed AD-GCL in detecting both head anomalies and tail anomalies.

[LG-61] Rec-AD: An Efficient Computation Framework for FDIA Detection Based on Tensor Train Decomposition and Deep Learning Recommendation Model

链接: https://arxiv.org/abs/2507.14668
作者: Yunfeng Li,Junhong Liu,Zhaohui Yang,Guofu Liao,Chuyun Zhang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Deep learning models have been widely adopted for False Data Injection Attack (FDIA) detection in smart grids due to their ability to capture unstructured and sparse features. However, the increasing system scale and data dimensionality introduce significant computational and memory burdens, particularly in large-scale industrial datasets, limiting detection efficiency. To address these issues, this paper proposes Rec-AD, a computationally efficient framework that integrates Tensor Train decomposition with the Deep Learning Recommendation Model (DLRM). Rec-AD enhances training and inference efficiency through embedding compression, optimized data access via index reordering, and a pipeline training mechanism that reduces memory communication overhead. Fully compatible with PyTorch, Rec-AD can be integrated into existing FDIA detection systems without code modifications. Experimental results show that Rec-AD significantly improves computational throughput and real-time detection performance, narrowing the attack window and increasing attacker cost. These advancements strengthen edge computing capabilities and scalability, providing robust technical support for smart grid security.

[LG-62] Learning to Communicate in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

链接: https://arxiv.org/abs/2507.14658
作者: Faizan Contractor,Li Li,Ranwa Al Mallah
类目: Multiagent Systems (cs.MA); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Popular methods in cooperative Multi-Agent Reinforcement Learning with partially observable environments typically allow agents to act independently during execution, which may limit the coordinated effect of the trained policies. However, by sharing information such as known or suspected ongoing threats, effective communication can lead to improved decision-making in the cyber battle space. We propose a game design where defender agents learn to communicate and defend against imminent cyber threats by playing training games in the Cyber Operations Research Gym, using the Differentiable Inter Agent Learning algorithm adapted to the cyber operational environment. The tactical policies learned by these autonomous agents are akin to those of human experts during incident responses to avert cyber threats. In addition, the agents simultaneously learn minimal cost communication messages while learning their defence tactical policies.

[LG-63] Agent ic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks: A Survey on Generative Approaches

链接: https://arxiv.org/abs/2507.14633
作者: Xiaozheng Gao,Yichen Wang,Bosen Liu,Xiao Zhou,Ruichen Zhang,Jiacheng Wang,Dusit Niyato,Dong In Kim,Abbas Jamalipour,Chau Yuen,Jianping An,Kai Yang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of satellite-augmented low-altitude economy and terrestrial networks (SLAETNs) demands intelligent and autonomous systems that can operate reliably across heterogeneous, dynamic, and mission-critical environments. To address these challenges, this survey focuses on enabling agentic artificial intelligence (AI), that is, artificial agents capable of perceiving, reasoning, and acting, through generative AI (GAI) and large language models (LLMs). We begin by introducing the architecture and characteristics of SLAETNs, and analyzing the challenges that arise in integrating satellite, aerial, and terrestrial components. Then, we present a model-driven foundation by systematically reviewing five major categories of generative models: variational autoencoders (VAEs), generative adversarial networks (GANs), generative diffusion models (GDMs), transformer-based models (TBMs), and LLMs. Moreover, we provide a comparative analysis to highlight their generative mechanisms, capabilities, and deployment trade-offs within SLAETNs. Building on this foundation, we examine how these models empower agentic functions across three domains: communication enhancement, security and privacy protection, and intelligent satellite tasks. Finally, we outline key future directions for building scalable, adaptive, and trustworthy generative agents in SLAETNs. This survey aims to provide a unified understanding and actionable reference for advancing agentic AI in next-generation integrated networks.

[LG-64] k-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

链接: https://arxiv.org/abs/2507.14631
作者: Daniel Greenhut,Dan Feldman
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Given an integer k\geq1 and a set P of n points in \REAL^d , the classic k -PCA (Principle Component Analysis) approximates the affine \emph k -subspace mean of P , which is the k -dimensional affine linear subspace that minimizes its sum of squared Euclidean distances ( \ell_2,2 -norm) over the points of P , i.e., the mean of these distances. The \emph k -subspace median is the subspace that minimizes its sum of (non-squared) Euclidean distances ( \ell_2,1 -mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the \ell_z,z (non-mixed) norms, it is non-convex for kd-1 . We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in k . More precisely, the multiplicative approximation factor is \sqrtd , and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as \ell_2,z norm of distances for z\not \in \br1,2 , e.g., z=\infty , and handling outliers/sparsity. Open code and experimental results on real-world datasets are also provided. Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2507.14631 [cs.LG] (or arXiv:2507.14631v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.14631 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] Understanding Matching Mechanisms in Cross-Encoders SIGIR25

链接: https://arxiv.org/abs/2507.14604
作者: Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at Workshop on Explainability in Information Retrieval at SIGIR 25 (WExIR25)

点击查看摘要

Abstract:Neural IR architectures, particularly cross-encoders, are highly effective models whose internal mechanisms are mostly unknown. Most works trying to explain their behavior focused on high-level processes (e.g., what in the input influences the prediction, does the model adhere to known IR axioms) but fall short of describing the matching process. Instead of Mechanistic Interpretability approaches which specifically aim at explaining the hidden mechanisms of neural models, we demonstrate that more straightforward methods can already provide valuable insights. In this paper, we first focus on the attention process and extract causal insights highlighting the crucial roles of some attention heads in this process. Second, we provide an interpretation of the mechanism underlying matching detection.

[LG-66] Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games

链接: https://arxiv.org/abs/2507.14529
作者: Berkay Anahtarci,Can Deha Kariksiz,Naci Saldi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider the maximum causal entropy inverse reinforcement learning problem for infinite-horizon stationary mean-field games, in which we model the unknown reward function within a reproducing kernel Hilbert space. This allows the inference of rich and potentially nonlinear reward structures directly from expert demonstrations, in contrast to most existing inverse reinforcement learning approaches for mean-field games that typically restrict the reward function to a linear combination of a fixed finite set of basis functions. We also focus on the infinite-horizon cost structure, whereas prior studies primarily rely on finite-horizon formulations. We introduce a Lagrangian relaxation to this maximum causal entropy inverse reinforcement learning problem that enables us to reformulate it as an unconstrained log-likelihood maximization problem, and obtain a solution \lkvia a gradient ascent algorithm. To illustrate the theoretical consistency of the algorithm, we establish the smoothness of the log-likelihood objective by proving the Fréchet differentiability of the related soft Bellman operators with respect to the parameters in the reproducing kernel Hilbert space. We demonstrate the effectiveness of our method on a mean-field traffic routing game, where it accurately recovers expert behavior.

[LG-67] Positive-Unlabeled Learning for Control Group Construction in Observational Causal Inference KDD2025

链接: https://arxiv.org/abs/2507.14528
作者: Ilias Tsoumas,Dimitrios Bormpoudakis,Vasileios Sitokonstantinou,Athanasios Askitopoulos,Andreas Kalogeras,Charalampos Kontoes,Ioannis Athanasiadis
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD 2025 Workshop on Causal Inference and Machine Learning in Practice

点击查看摘要

Abstract:In causal inference, whether through randomized controlled trials or observational studies, access to both treated and control units is essential for estimating the effect of a treatment on an outcome of interest. When treatment assignment is random, the average treatment effect (ATE) can be estimated directly by comparing outcomes between groups. In non-randomized settings, various techniques are employed to adjust for confounding and approximate the counterfactual scenario to recover an unbiased ATE. A common challenge, especially in observational studies, is the absence of units clearly labeled as controls-that is, units known not to have received the treatment. To address this, we propose positive-unlabeled (PU) learning as a framework for identifying, with high confidence, control units from a pool of unlabeled ones, using only the available treated (positive) units. We evaluate this approach using both simulated and real-world data. We construct a causal graph with diverse relationships and use it to generate synthetic data under various scenarios, assessing how reliably the method recovers control groups that allow estimates of true ATE. We also apply our approach to real-world data on optimal sowing and fertilizer treatments in sustainable agriculture. Our findings show that PU learning can successfully identify control (negative) units from unlabeled data based only on treated units and, through the resulting control group, estimate an ATE that closely approximates the true value. This work has important implications for observational causal inference, especially in fields where randomized experiments are difficult or costly. In domains such as earth, environmental, and agricultural sciences, it enables a plethora of quasi-experiments by leveraging available earth observation and climate data, particularly when treated units are available but control units are lacking.

[LG-68] Glitches in Decision Tree Ensemble Models

链接: https://arxiv.org/abs/2507.14492
作者: Satyankar Chandra,Ashutosh Gupta,Kaushik Mallik,Krishna Shankaranarayanan,Namrita Varshney
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many critical decision-making tasks are now delegated to machine-learned models, and it is imperative that their decisions are trustworthy and reliable, and their outputs are consistent across similar inputs. We identify a new source of unreliable behaviors-called glitches-which may significantly impair the reliability of AI models having steep decision boundaries. Roughly speaking, glitches are small neighborhoods in the input space where the model’s output abruptly oscillates with respect to small changes in the input. We provide a formal definition of glitches, and use well-known models and datasets from the literature to demonstrate that they have widespread existence and argue they usually indicate potential model inconsistencies in the neighborhood of where they are found. We proceed to the algorithmic search of glitches for widely used gradient-boosted decision tree (GBDT) models. We prove that the problem of detecting glitches is NP-complete for tree ensembles, already for trees of depth 4. Our glitch-search algorithm for GBDT models uses an MILP encoding of the problem, and its effectiveness and computational feasibility are demonstrated on a set of widely used GBDT benchmarks taken from the literature.

[LG-69] Numerical Artifacts in Learning Dynamical Systems

链接: https://arxiv.org/abs/2507.14491
作者: Bing-Ze Lu,Richard Tsai
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many applications, one needs to learn a dynamical system from its solutions sampled at a finite number of time points. The learning problem is often formulated as an optimization problem over a chosen function class. However, in the optimization procedure, it is necessary to employ a numerical scheme to integrate candidate dynamical systems and assess how their solutions fit the data. This paper reveals potentially serious effects of a chosen numerical scheme on the learning outcome. In particular, our analysis demonstrates that a damped oscillatory system may be incorrectly identified as having “anti-damping” and exhibiting a reversed oscillation direction, despite adequately fitting the given data points. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) Cite as: arXiv:2507.14491 [math.NA] (or arXiv:2507.14491v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2507.14491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Federated Reinforcement Learning in Heterogeneous Environments

链接: https://arxiv.org/abs/2507.14487
作者: Ukjo Hwang,Songnam Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate a Federated Reinforcement Learning with Environment Heterogeneity (FRL-EH) framework, where local environments exhibit statistical heterogeneity. Within this framework, agents collaboratively learn a global policy by aggregating their collective experiences while preserving the privacy of their local trajectories. To better reflect real-world scenarios, we introduce a robust FRL-EH framework by presenting a novel global objective function. This function is specifically designed to optimize a global policy that ensures robust performance across heterogeneous local environments and their plausible perturbations. We propose a tabular FRL algorithm named FedRQ and theoretically prove its asymptotic convergence to an optimal policy for the global objective function. Furthermore, we extend FedRQ to environments with continuous state space through the use of expectile loss, addressing the key challenge of minimizing a value function over a continuous subset of the state space. This advancement facilitates the seamless integration of the principles of FedRQ with various Deep Neural Network (DNN)-based RL algorithms. Extensive empirical evaluations validate the effectiveness and robustness of our FRL algorithms across diverse heterogeneous environments, consistently achieving superior performance over the existing state-of-the-art FRL algorithms.

[LG-71] ReDiSC: A Reparameterized Masked Diffusion Model for Scalable Node Classification with Structured Predictions

链接: https://arxiv.org/abs/2507.14484
作者: Yule Li,Yifeng Lu,Zhen Wang,Zhewei Wei,Yaliang Li,Bolin Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph neural networks (GNN) have achieved unprecedented successes in node classification tasks. Although GNNs inherently encode specific inductive biases (e.g., acting as low-pass or high-pass filters), most existing methods implicitly assume conditional independence among node labels in their optimization objectives. While this assumption is suitable for traditional classification tasks such as image recognition, it contradicts the intuitive observation that node labels in graphs remain correlated, even after conditioning on the graph structure. To make structured predictions for node labels, we propose ReDiSC, namely, Reparameterized masked Diffusion model for Structured node Classification. ReDiSC estimates the joint distribution of node labels using a reparameterized masked diffusion model, which is learned through the variational expectation-maximization (EM) framework. Our theoretical analysis shows the efficiency advantage of ReDiSC in the E-step compared to DPM-SNC, a state-of-the-art model that relies on a manifold-constrained diffusion model in continuous domain. Meanwhile, we explicitly link ReDiSC’s M-step objective to popular GNN and label propagation hybrid approaches. Extensive experiments demonstrate that ReDiSC achieves superior or highly competitive performance compared to state-of-the-art GNN, label propagation, and diffusion-based baselines across both homophilic and heterophilic graphs of varying sizes. Notably, ReDiSC scales effectively to large-scale datasets on which previous structured diffusion methods fail due to computational constraints, highlighting its significant practical advantage in structured node classification tasks.

[LG-72] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

链接: https://arxiv.org/abs/2507.14446
作者: Feng Liu,Ying Liu,Carson Eisenach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[LG-73] Development and Deployment of Hybrid ML Models for Critical Heat Flux Prediction in Annulus Geometries

链接: https://arxiv.org/abs/2507.14332
作者: Aidan Furlong,Xingang Zhao,Robert Salko,Xu Wu
类目: Machine Learning (cs.LG)
*备注: Accepted for inclusion in Transactions of the American Nuclear Society for the 2025 ANS Winter Conference

点击查看摘要

Abstract:Accurate prediction of critical heat flux (CHF) is an essential component of safety analysis in pressurized and boiling water reactors. To support reliable prediction of this quantity, several empirical correlations and lookup tables have been constructed from physical experiments over the past several decades. With the onset of accessible machine learning (ML) frameworks, multiple initiatives have been established with the goal of predicting CHF more accurately than these traditional methods. While purely data-driven surrogate modeling has been extensively investigated, these approaches lack interpretability, lack resilience to data scarcity, and have been developed mostly using data from tube experiments. As a result, bias-correction hybrid approaches have become increasingly popular, which correct initial “low-fidelity” estimates provided by deterministic base models by using ML-predicted residuals. This body of work has mostly considered round tube geometries; annular geometry-specific ML models have not yet been deployed in thermal hydraulic codes. This study developed, deployed, and validated four ML models to predict CHF in annular geometries using the CTF subchannel code. Three empirical correlation models, Biasi, Bowring, and Katto, were used as base models for comparison. The ML models were trained and tested using 577 experimental annulus data points from four datasets: Becker, Beus, Janssen, and Mortimore. Baseline CHF predictions were obtained from the empirical correlations, with mean relative errors above 26%. The ML-driven models achieved mean relative errors below 3.5%, with no more than one point exceeding the 10% error envelope. In all cases, the hybrid ML models significantly outperformed their empirical counterparts.

[LG-74] Rethinking Individual Fairness in Deepfake Detection ACM-MM2025

链接: https://arxiv.org/abs/2507.14326
作者: Aryana Hou,Li Lin,Justin Li,Shu Hu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: This paper has been accepted by ACM MM 2025

点击查看摘要

Abstract:Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at this https URL.

[LG-75] FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

链接: https://arxiv.org/abs/2507.14322
作者: Md Rafid Haque,Abu Raihan Mostofa Kamal,Md. Azam Hossain
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 24 pages, 8 figures. This work is intended for a journal submission

点击查看摘要

Abstract:Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks. While numerous static defenses exist, their effectiveness is highly context-dependent, often failing against adaptive adversaries or in heterogeneous data environments. This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem. We design a lightweight contextual bandit agent that dynamically selects the optimal aggregation rule from an arsenal of defenses based on real-time diagnostic metrics. Through comprehensive experiments, we demonstrate that no single static rule is universally optimal. We show that our adaptive agent successfully learns superior policies across diverse scenarios, including a ``Krum-favorable" environment and against a sophisticated “stealth” adversary designed to neutralize specific diagnostic signals. Critically, we analyze the paradoxical scenario where a non-robust baseline achieves high but compromised accuracy, and demonstrate that our agent learns a conservative policy to prioritize model integrity. Furthermore, we prove the agent’s policy is controllable via a single “risk tolerance” parameter, allowing practitioners to explicitly manage the trade-off between performance and security. Our work provides a new, practical, and analyzable approach to creating resilient and intelligent decentralized AI systems.

[LG-76] Linearized Diffusion Map

链接: https://arxiv.org/abs/2507.14257
作者: Julio Candanedo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the Linearized Diffusion Map (LDM), a novel linear dimensionality reduction method constructed via a linear approximation of the diffusion-map kernel. LDM integrates the geometric intuition of diffusion-based nonlinear methods with the computational simplicity, efficiency, and interpretability inherent in linear embeddings such as PCA and classical MDS. Through comprehensive experiments on synthetic datasets (Swiss roll and hyperspheres) and real-world benchmarks (MNIST and COIL-20), we illustrate that LDM captures distinct geometric features of datasets compared to PCA, offering complementary advantages. Specifically, LDM embeddings outperform PCA in datasets exhibiting explicit manifold structures, particularly in high-dimensional regimes, whereas PCA remains preferable in scenarios dominated by variance or noise. Furthermore, the complete positivity of LDM’s kernel matrix allows direct applicability of Non-negative Matrix Factorization (NMF), suggesting opportunities for interpretable latent-structure discovery. Our analysis positions LDM as a valuable new linear dimensionality reduction technique with promising theoretical and practical extensions.

[LG-77] Mining Voter Behaviour and Confidence: A Rule-Based Analysis of the 2022 U.S. Elections

链接: https://arxiv.org/abs/2507.14236
作者: Md Al Jubair,Mohammad Shamsul Arefin,Ahmed Wasif Reza
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the relationship between voter trust and their experiences during elections by applying a rule-based data mining technique to the 2022 Survey of the Performance of American Elections (SPAE). Using the Apriori algorithm and setting parameters to capture meaningful associations (support = 3%, confidence = 60%, and lift 1.5), the analysis revealed a strong connection between demographic attributes and voting-related challenges, such as registration hurdles, accessibility issues, and queue times. For instance, respondents who indicated that accessing polling stations was “very easy” and who reported moderate confidence were found to be over six times more likely (lift = 6.12) to trust their county’s election outcome and experience no registration issues. A further analysis, which adjusted the support threshold to 2%, specifically examined patterns among minority voters. It revealed that 98.16 percent of Black voters who reported easy access to polling locations also had smooth registration experiences. Additionally, those who had high confidence in the vote-counting process were almost two times as likely to identify as Democratic Party supporters. These findings point to the important role that enhancing voting access and offering targeted support can play in building trust in the electoral system, particularly among marginalized communities.

[LG-78] Geometry-Aware Active Learning of Pattern Rankings via Choquet-Based Aggregation

链接: https://arxiv.org/abs/2507.14217
作者: Tudor Matei Opran,Samir Loudni
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We address the pattern explosion problem in pattern mining by proposing an interactive learning framework that combines nonlinear utility aggregation with geometry-aware query selection. Our method models user preferences through a Choquet integral over multiple interestingness measures and exploits the geometric structure of the version space to guide the selection of informative comparisons. A branch-and-bound strategy with tight distance bounds enables efficient identification of queries near the decision boundary. Experiments on UCI datasets show that our approach outperforms existing methods such as ChoquetRank, achieving better ranking accuracy with fewer user interactions.

[LG-79] Developing an AI-Guided Assistant Device for the Deaf and Hearing Impaired

链接: https://arxiv.org/abs/2507.14215
作者: Jiayu(Jerry)Liu
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This study aims to develop a deep learning system for an accessibility device for the deaf or hearing impaired. The device will accurately localize and identify sound sources in real time. This study will fill an important gap in current research by leveraging machine learning techniques to target the underprivileged community. The system includes three main components. 1. JerryNet: A custom designed CNN architecture that determines the direction of arrival (DoA) for nine possible directions. 2. Audio Classification: This model is based on fine-tuning the Contrastive Language-Audio Pretraining (CLAP) model to identify the exact sound classes only based on audio. 3. Multimodal integration model: This is an accurate sound localization model that combines audio, visual, and text data to locate the exact sound sources in the images. The part consists of two modules, one object detection using Yolov9 to generate all the bounding boxes of the objects, and an audio visual localization model to identify the optimal bounding box using complete Intersection over Union (CIoU). The hardware consists of a four-microphone rectangular formation and a camera mounted on glasses with a wristband for displaying necessary information like direction. On a custom collected data set, JerryNet achieved a precision of 91. 1% for the sound direction, outperforming all the baseline models. The CLAP model achieved 98.5% and 95% accuracy on custom and AudioSet datasets, respectively. The audio-visual localization model within component 3 yielded a cIoU of 0.892 and an AUC of 0.658, surpassing other similar models. There are many future potentials to this study, paving the way to creating a new generation of accessibility devices.

[LG-80] Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection

链接: https://arxiv.org/abs/2507.14176
作者: Andrés Morales-Forero(1),Lili J. Rueda(2),Ronald Herrera(3),Samuel Bassetto(1),Eric Coatanea(4) ((1) Polytechnique Montréal, (2) Universidad El Bosque, (3) Boehringer Ingelheim International GmbH, (4) Tampere University)
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems increasingly inform medical decision-making, yet concerns about algorithmic bias and inequitable outcomes persist, particularly for historically marginalized populations. This paper introduces the concept of Predictive Representativity (PR), a framework of fairness auditing that shifts the focus from the composition of the data set to outcomes-level equity. Through a case study in dermatology, we evaluated AI-based skin cancer classifiers trained on the widely used HAM10000 dataset and on an independent clinical dataset (BOSQUE Test set) from Colombia. Our analysis reveals substantial performance disparities by skin phototype, with classifiers consistently underperforming for individuals with darker skin, despite proportional sampling in the source data. We argue that representativity must be understood not as a static feature of datasets but as a dynamic, context-sensitive property of model predictions. PR operationalizes this shift by quantifying how reliably models generalize fairness across subpopulations and deployment contexts. We further propose an External Transportability Criterion that formalizes the thresholds for fairness generalization. Our findings highlight the ethical imperative for post-hoc fairness auditing, transparency in dataset documentation, and inclusive model validation pipelines. This work offers a scalable tool for diagnosing structural inequities in AI systems, contributing to discussions on equity, interpretability, and data justice and fostering a critical re-evaluation of fairness in data-driven healthcare.

[LG-81] ACS: An interactive framework for conformal selection

链接: https://arxiv.org/abs/2507.15825
作者: Yu Gui,Ying Jin,Yash Nair,Zhimei Ren
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Candès, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.

[LG-82] Hypergraphs on high dimensional time series sets using signature transform

链接: https://arxiv.org/abs/2507.15802
作者: Rémi Vaucher,Paul Minchella
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: Accepted at GSI25 conference. Pending publication in Springer proceedings

点击查看摘要

Abstract:In recent decades, hypergraphs and their analysis through Topological Data Analysis (TDA) have emerged as powerful tools for understanding complex data structures. Various methods have been developed to construct hypergraphs – referred to as simplicial complexes in the TDA framework – over datasets, enabling the formation of edges between more than two vertices. This paper addresses the challenge of constructing hypergraphs from collections of multivariate time series. While prior work has focused on the case of a single multivariate time series, we extend this framework to handle collections of such time series. Our approach generalizes the method proposed in Chretien and al. by leveraging the properties of signature transforms to introduce controlled randomness, thereby enhancing the robustness of the construction process. We validate our method on synthetic datasets and present promising results.

[LG-83] Conformal and kNN Predictive Uncertainty Quantification Algorithms in Metric Spaces

链接: https://arxiv.org/abs/2507.15741
作者: Gábor Lugosi,Marcos Matabuena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper introduces a framework for uncertainty quantification in regression models defined in metric spaces. Leveraging a newly defined notion of homoscedasticity, we develop a conformal prediction algorithm that offers finite-sample coverage guarantees and fast convergence rates of the oracle estimator. In heteroscedastic settings, we forgo these non-asymptotic guarantees to gain statistical efficiency, proposing a local k --nearest–neighbor method without conformal calibration that is adaptive to the geometry of each particular nonlinear space. Both procedures work with any regression algorithm and are scalable to large data sets, allowing practitioners to plug in their preferred models and incorporate domain expertise. We prove consistency for the proposed estimators under minimal conditions. Finally, we demonstrate the practical utility of our approach in personalized–medicine applications involving random response objects such as probability distributions and graph Laplacians.

[LG-84] Information Preserving Line Search via Bayesian Optimization

链接: https://arxiv.org/abs/2507.15485
作者: Robin Labryga,Tomislav Prusina,Sören Laue
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted for publication at: LION 19: Learning and Intelligent Optimization, 19th International Conference, Prague, 2025 (Springer LNCS). This is the preprint version (DOI to be added when available)

点击查看摘要

Abstract:Line search is a fundamental part of iterative optimization methods for unconstrained and bound-constrained optimization problems to determine suitable step lengths that provide sufficient improvement in each iteration. Traditional line search methods are based on iterative interval refinement, where valuable information about function value and gradient is discarded in each iteration. We propose a line search method via Bayesian optimization, preserving and utilizing otherwise discarded information to improve step-length choices. Our approach is guaranteed to converge and shows superior performance compared to state-of-the-art methods based on empirical tests on the challenging unconstrained and bound-constrained optimization problems from the CUTEst test set.

[LG-85] On exploration of an interior mirror descent flow for stochastic nonconvex constrained problem

链接: https://arxiv.org/abs/2507.15264
作者: Kuangyu Ding,Kim-Chuan Toh
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 34 Pages

点击查看摘要

Abstract:We study a nonsmooth nonconvex optimization problem defined over nonconvex constraints, where the feasible set is given by the intersection of the closure of an open set and a smooth manifold. By endowing the open set with a Riemannian metric induced by a barrier function, we obtain a Riemannian subgradient flow formulated as a differential inclusion, which remains strictly within the interior of the feasible set. This continuous dynamical system unifies two classes of iterative optimization methods, namely the Hessian barrier method and mirror descent scheme, by revealing that these methods can be interpreted as discrete approximations of the continuous flow. We explore the long-term behavior of the trajectories generated by this dynamical system and show that the existing deficient convergence properties of the Hessian barrier and mirror descent scheme can be unifily and more insightfully interpreted through these of the continuous trajectory. For instance, the notorious spurious stationary points \citechen2024spurious observed in Hessian barrier method and mirror descent scheme are interpreted as stable equilibria of the dynamical system that do not correspond to real stationary points of the original optimization problem. We provide two sufficient condition such that these spurious stationary points can be avoided if the strict complementarity conditions holds. In the absence of these regularity condition, we propose a random perturbation strategy that ensures the trajectory converges (subsequentially) to an approximate stationary point. Building on these insights, we introduce two iterative Riemannian subgradient methods, form of interior point methods, that generalizes the existing Hessian barrier method and mirror descent scheme for solving nonsmooth nonconvex optimization problems.

[LG-86] Accelerated Bayesian Optimal Experimental Design via Conditional Density Estimation and Informative Data

链接: https://arxiv.org/abs/2507.15235
作者: Miao Huang,Hongqiao Wang,Kunyu Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Design of Experiments (DOEs) is a fundamental scientific methodology that provides researchers with systematic principles and techniques to enhance the validity, reliability, and efficiency of experimental outcomes. In this study, we explore optimal experimental design within a Bayesian framework, utilizing Bayes’ theorem to reformulate the utility expectation–originally expressed as a nested double integral–into an independent double integral form, significantly improving numerical efficiency. To further accelerate the computation of the proposed utility expectation, conditional density estimation is employed to approximate the ratio of two Gaussian random fields, while covariance serves as a selection criterion to identify informative datasets during model fitting and integral evaluation. In scenarios characterized by low simulation efficiency and high costs of raw data acquisition, key challenges such as surrogate modeling, failure probability estimation, and parameter inference are systematically restructured within the Bayesian experimental design framework. The effectiveness of the proposed methodology is validated through both theoretical analysis and practical applications, demonstrating its potential for enhancing experimental efficiency and decision-making under uncertainty.

[LG-87] Robust and Differentially Private PCA for non-Gaussian data

链接: https://arxiv.org/abs/2507.15232
作者: Minwoo Kim,Sungkyu Jung
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 38 pages, 6 figures

点击查看摘要

Abstract:Recent advances have sparked significant interest in the development of privacy-preserving Principal Component Analysis (PCA). However, many existing approaches rely on restrictive assumptions, such as assuming sub-Gaussian data or being vulnerable to data contamination. Additionally, some methods are computationally expensive or depend on unknown model parameters that must be estimated, limiting their accessibility for data analysts seeking privacy-preserving PCA. In this paper, we propose a differentially private PCA method applicable to heavy-tailed and potentially contaminated data. Our approach leverages the property that the covariance matrix of properly rescaled data preserves eigenvectors and their order under elliptical distributions, which include Gaussian and heavy-tailed distributions. By applying a bounded transformation, we enable straightforward computation of principal components in a differentially private manner. Additionally, boundedness guarantees robustness against data contamination. We conduct both theoretical analysis and empirical evaluations of the proposed method, focusing on its ability to recover the subspace spanned by the leading principal components. Extensive numerical experiments demonstrate that our method consistently outperforms existing approaches in terms of statistical utility, particularly in non-Gaussian or contaminated data settings.

[LG-88] Misspecifying non-compensatory as compensatory IRT: analysis of estimated skills and variance

链接: https://arxiv.org/abs/2507.15222
作者: Hiroshi Tamano,Hideitsu Hino,Daichi Mochihashi
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multidimensional item response theory is a statistical test theory used to estimate the latent skills of learners and the difficulty levels of problems based on test results. Both compensatory and non-compensatory models have been proposed in the literature. Previous studies have revealed the substantial underestimation of higher skills when the non-compensatory model is misspecified as the compensatory model. However, the underlying mechanism behind this phenomenon has not been fully elucidated. It remains unclear whether overestimation also occurs and whether issues arise regarding the variance of the estimated parameters. In this paper, we aim to provide a comprehensive understanding of both underestimation and overestimation through a theoretical approach. In addition to the previously identified underestimation of the skills, we newly discover that the overestimation of skills occurs around the origin. Furthermore, we investigate the extent to which the asymptotic variance of the estimated parameters differs when considering model misspecification compared to when it is not taken into account.

[LG-89] Graph Attention Networks for Detecting Epilepsy from EEG Signals Using Accessible Hardware in Low-Resource Settings

链接: https://arxiv.org/abs/2507.15118
作者: Szymon Mazurek,Stephen Moore,Alessandro Crimi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Goal: Epilepsy remains under-diagnosed in low-income countries due to scarce neurologists and costly diagnostic tools. We propose a graph-based deep learning framework to detect epilepsy from low-cost Electroencephalography (EEG) hardware, tested on recordings from Nigeria and Guinea-Bissau. Our focus is on fair, accessible automatic assessment and explainability to shed light on epilepsy biomarkers. Methods: We model EEG signals as spatio-temporal graphs, classify them, and identify interchannel relationships and temporal dynamics using graph attention networks (GAT). To emphasize connectivity biomarkers, we adapt the inherently node-focused GAT to analyze edges. We also designed signal preprocessing for low-fidelity recordings and a lightweight GAT architecture trained on Google Colab and deployed on RaspberryPi devices. Results: The approach achieves promising classification performance, outperforming a standard classifier based on random forest and graph convolutional networks in terms of accuracy and robustness over multiple sessions, but also highlighting specific connections in the fronto-temporal region. Conclusions: The results highlight the potential of GATs to provide insightful and scalable diagnostic support for epilepsy in underserved regions, paving the way for affordable and accessible neurodiagnostic tools.

[LG-90] Learning under Latent Group Sparsity via Diffusion on Networks

链接: https://arxiv.org/abs/2507.15097
作者: Subhroshekhar Ghosh,Soumendu Sundar Mukherjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 49 pages, 4 figures, 2 tables; this submission subsumes the earlier preprint arXiv:2201.08326

点击查看摘要

Abstract:Group or cluster structure on explanatory variables in machine learning problems is a very general phenomenon, which has attracted broad interest from practitioners and theoreticians alike. In this work we contribute an approach to sparse learning under such group structure, that does not require prior information on the group identities. Our paradigm is motivated by the Laplacian geometry of an underlying network with a related community structure, and proceeds by directly incorporating this into a penalty that is effectively computed via a heat-flow-based local network dynamics. The proposed penalty interpolates between the lasso and the group lasso penalties, the runtime of the heat-flow dynamics being the interpolating parameter. As such it can automatically default to lasso when the group structure reflected in the Laplacian is weak. In fact, we demonstrate a data-driven procedure to construct such a network based on the available data. Notably, we dispense with computationally intensive pre-processing involving clustering of variables, spectral or otherwise. Our technique is underpinned by rigorous theorems that guarantee its effective performance and provide bounds on its sample complexity. In particular, in a wide range of settings, it provably suffices to run the diffusion for time that is only logarithmic in the problem dimensions. We explore in detail the interfaces of our approach with key statistical physics models in network science, such as the Gaussian Free Field and the Stochastic Block Model. Our work raises the possibility of applying similar diffusion-based techniques to classical learning tasks, exploiting the interplay between geometric, dynamical and stochastic structures underlying the data.

[LG-91] Simulation-Prior Independent Neural Unfolding Procedure

链接: https://arxiv.org/abs/2507.15084
作者: Anja Butter,Theo Heimel,Nathan Huetsch,Michael Kagan,Tilman Plehn
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Machine learning allows unfolding high-dimensional spaces without binning at the LHC. The new SPINUP method extracts the unfolded distribution based on a neural network encoding the forward mapping, making it independent of the prior from the simulated training data. It is made efficient through neural importance sampling, and ensembling can be used to estimate the effect of information loss in the forward process. We showcase SPINUP for unfolding detector effects on jet substructure observables and for unfolding to parton level of associated Higgs and single-top production.

[LG-92] Quantum Annealing for Machine Learning: Applications in Feature Selection Instance Selection and Clustering

链接: https://arxiv.org/abs/2507.15063
作者: Chloe Pomeroy,Aleksandar Pramov,Karishma Thakrar,Lakshmi Yendapalli
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the applications of quantum annealing (QA) and classical simulated annealing (SA) to a suite of combinatorial optimization problems in machine learning, namely feature selection, instance selection, and clustering. We formulate each task as a Quadratic Unconstrained Binary Optimization (QUBO) problem and implement both quantum and classical solvers to compare their effectiveness. For feature selection, we propose several QUBO configurations that balance feature importance and redundancy, showing that quantum annealing (QA) produces solutions that are computationally more efficient. In instance selection, we propose a few novel heuristics for instance-level importance measures that extend existing methods. For clustering, we embed a classical-to-quantum pipeline, using classical clustering followed by QUBO-based medoid refinement, and demonstrate consistent improvements in cluster compactness and retrieval metrics. Our results suggest that QA can be a competitive and efficient tool for discrete machine learning optimization, even within the constraints of current quantum hardware.

[LG-93] Integrating Newtons Laws with deep learning for enhanced physics-informed compound flood modelling

链接: https://arxiv.org/abs/2507.15021
作者: Soheil Radfar,Faezeh Maghsoodifar,Hamed Moftakhari,Hamid Moradkhani
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coastal communities increasingly face compound floods, where multiple drivers like storm surge, high tide, heavy rainfall, and river discharge occur together or in sequence to produce impacts far greater than any single driver alone. Traditional hydrodynamic models can provide accurate physics-based simulations but require substantial computational resources for real-time applications or risk assessments, while machine learning alternatives often sacrifice physical consistency for speed, producing unrealistic predictions during extreme events. This study addresses these challenges by developing ALPINE (All-in-one Physics Informed Neural Emulator), a physics-informed neural network (PINN) framework to enforce complete shallow water dynamics in compound flood modeling. Unlike previous approaches that implement partial constraints, our framework simultaneously enforces mass conservation and both momentum equations, ensuring full adherence to Newton’s laws throughout the prediction process. The model integrates a convolutional encoder-decoder architecture with ConvLSTM temporal processing, trained using a composite loss function that balances data fidelity with physics-based residuals. Using six historical storm events (four for training, one for validation, and one held-out for unseen testing), we observe substantial improvements over baseline neural networks. ALPINE reduces domain-averaged prediction errors and improves model skill metrics for water surface elevation and velocity components. Physics-informed constraints prove most valuable during peak storm intensity, when multiple flood drivers interact and reliable predictions matter most. This approach yields a physically consistent emulator capable of supporting compound-flood forecasting and large-scale risk analyses while preserving physical realism essential for coastal emergency management.

[LG-94] ransaction Profiling and Address Role Inference in Tokenized U.S. Treasuries

链接: https://arxiv.org/abs/2507.14808
作者: Junliang Luo,Katrin Tinn,Samuel Ferreira Duran,Di Wu,Xue Liu
类目: Computational Finance (q-fin.CP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tokenized U.S. Treasuries have emerged as a prominent subclass of real-world assets (RWAs), offering cryptographically enforced, yield-bearing instruments collateralized by sovereign debt and deployed across multiple blockchain networks. While the market has expanded rapidly, empirical analyses of transaction-level behaviour remain limited. This paper conducts a quantitative, function-level dissection of U.S. Treasury-backed RWA tokens including BUIDL, BENJI, and USDY, across multi-chain: mostly Ethereum and Layer-2s. We analyze decoded contract calls to isolate core functional primitives such as issuance, redemption, transfer, and bridge activity, revealing segmentation in behaviour between institutional actors and retail users. To model address-level economic roles, we introduce a curvature-aware representation learning framework using Poincaré embeddings and liquidity-based graph features. Our method outperforms baseline models on our RWA Treasury dataset in role inference and generalizes to downstream tasks such as anomaly detection and wallet classification in broader blockchain transaction networks. These findings provide a structured understanding of functional heterogeneity and participant roles in tokenized Treasury in a transaction-level perspective, contributing new empirical evidence to the study of on-chain financialization.

[LG-95] Uncertainty Quantification for Machine Learning-Based Prediction: A Polynomial Chaos Expansion Approach for Joint Model and Input Uncertainty Propagation

链接: https://arxiv.org/abs/2507.14782
作者: Xiaoping Du
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Physics (math-ph); Computation (stat.CO)
*备注: This manuscript has been submitted to Multidisciplinary and Structural Optimization

点击查看摘要

Abstract:Machine learning (ML) surrogate models are increasingly used in engineering analysis and design to replace computationally expensive simulation models, significantly reducing computational cost and accelerating decision-making processes. However, ML predictions contain inherent errors, often estimated as model uncertainty, which is coupled with variability in model inputs. Accurately quantifying and propagating these combined uncertainties is essential for generating reliable engineering predictions. This paper presents a robust framework based on Polynomial Chaos Expansion (PCE) to handle joint input and model uncertainty propagation. While the approach applies broadly to general ML surrogates, we focus on Gaussian Process regression models, which provide explicit predictive distributions for model uncertainty. By transforming all random inputs into a unified standard space, a PCE surrogate model is constructed, allowing efficient and accurate calculation of the mean and standard deviation of the output. The proposed methodology also offers a mechanism for global sensitivity analysis, enabling the accurate quantification of the individual contributions of input variables and ML model uncertainty to the overall output variability. This approach provides a computationally efficient and interpretable framework for comprehensive uncertainty quantification, supporting trustworthy ML predictions in downstream engineering applications.

[LG-96] When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts

链接: https://arxiv.org/abs/2507.14661
作者: Wooseok Ha,Yuansi Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Semi-supervised domain adaptation (SSDA) aims to achieve high predictive performance in the target domain with limited labeled target data by exploiting abundant source and unlabeled target data. Despite its significance in numerous applications, theory on the effectiveness of SSDA remains largely unexplored, particularly in scenarios involving various types of source-target distributional shifts. In this work, we develop a theoretical framework based on structural causal models (SCMs) which allows us to analyze and quantify the performance of SSDA methods when labeled target data is limited. Within this framework, we introduce three SSDA methods, each having a fine-tuning strategy tailored to a distinct assumption about the source and target relationship. Under each assumption, we demonstrate how extending an unsupervised domain adaptation (UDA) method to SSDA can achieve minimax-optimal target performance with limited target labels. When the relationship between source and target data is only vaguely known – a common practical concern – we propose the Multi Adaptive-Start Fine-Tuning (MASFT) algorithm, which fine-tunes UDA models from multiple starting points and selects the best-performing one based on a small hold-out target validation dataset. Combined with model selection guarantees, MASFT achieves near-optimal target predictive performance across a broad range of types of distributional shifts while significantly reducing the need for labeled target data. We empirically validate the effectiveness of our proposed methods through simulations.

[LG-97] Accelerating Hamiltonian Monte Carlo for Bayesian Inference in Neural Networks and Neural Operators

链接: https://arxiv.org/abs/2507.14652
作者: Ponkrshnan Thiagarajan,Tamer A. Zaki,Michael D. Shields
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Hamiltonian Monte Carlo (HMC) is a powerful and accurate method to sample from the posterior distribution in Bayesian inference. However, HMC techniques are computationally demanding for Bayesian neural networks due to the high dimensionality of the network’s parameter space and the non-convexity of their posterior distributions. Therefore, various approximation techniques, such as variational inference (VI) or stochastic gradient MCMC, are often employed to infer the posterior distribution of the network parameters. Such approximations introduce inaccuracies in the inferred distributions, resulting in unreliable uncertainty estimates. In this work, we propose a hybrid approach that combines inexpensive VI and accurate HMC methods to efficiently and accurately quantify uncertainties in neural networks and neural operators. The proposed approach leverages an initial VI training on the full network. We examine the influence of individual parameters on the prediction uncertainty, which shows that a large proportion of the parameters do not contribute substantially to uncertainty in the network predictions. This information is then used to significantly reduce the dimension of the parameter space, and HMC is performed only for the subset of network parameters that strongly influence prediction uncertainties. This yields a framework for accelerating the full batch HMC for posterior inference in neural networks. We demonstrate the efficiency and accuracy of the proposed framework on deep neural networks and operator networks, showing that inference can be performed for large networks with tens to hundreds of thousands of parameters. We show that this method can effectively learn surrogates for complex physical systems by modeling the operator that maps from upstream conditions to wall-pressure data on a cone in hypersonic flow.

[LG-98] Deep Learning-Based Survival Analysis with Copula-Based Activation Functions for Multivariate Response Prediction

链接: https://arxiv.org/abs/2507.14641
作者: Jong-Min Kim,Il Do Ha,Sangjin Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research integrates deep learning, copula functions, and survival analysis to effectively handle highly correlated and right-censored multivariate survival data. It introduces copula-based activation functions (Clayton, Gumbel, and their combinations) to model the nonlinear dependencies inherent in such data. Through simulation studies and analysis of real breast cancer data, our proposed CNN-LSTM with copula-based activation functions for multivariate multi-types of survival responses enhances prediction accuracy by explicitly addressing right-censored data and capturing complex patterns. The model’s performance is evaluated using Shewhart control charts, focusing on the average run length (ARL).

[LG-99] KinForm: Kinetics Informed Feature Optimised Representation Models for Enzyme k_cat and K_M Prediction

链接: https://arxiv.org/abs/2507.14639
作者: Saleh Alwer,Ronan Fleming
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kinetic parameters such as the turnover number ( k_cat ) and Michaelis constant ( K_\mathrmM ) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue-level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5-XL-UniRef50), taken from empirically selected intermediate transformer layers and applies weighted pooling based on per-residue binding-site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal–component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity-based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding-site probability pooling, intermediate-layer selection, PCA, and oversampling of low-identity proteins. We also find that removing sequence overlap between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.

[LG-100] Learning Stochastic Hamiltonian Systems via Stochastic Generating Function Neural Network

链接: https://arxiv.org/abs/2507.14467
作者: Chen Chen,Lijin Wang,Yanzhao Cao,Xupeng Cheng
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we propose a novel neural network model for learning stochastic Hamiltonian systems (SHSs) from observational data, termed the stochastic generating function neural network (SGFNN). SGFNN preserves symplectic structure of the underlying stochastic Hamiltonian system and produces symplectic predictions. Our model utilizes the autoencoder framework to identify the randomness of the latent system by the encoder network, and detects the stochastic generating function of the system through the decoder network based on the random variables extracted from the encoder. Symplectic predictions can then be generated by the stochastic generating function. Numerical experiments are performed on several stochastic Hamiltonian systems, varying from additive to multiplicative, and from separable to non-separable SHSs with single or multiple noises. Compared with the benchmark stochastic flow map learning (sFML) neural network, our SGFNN model exhibits higher accuracy across various prediction metrics, especially in long-term predictions, with the property of maintaining the symplectic structure of the underlying SHSs.

[LG-101] MENO: Hybrid Matrix Exponential-based Neural Operator for Stiff ODEs. Application to Thermochemical Kinetics

链接: https://arxiv.org/abs/2507.14341
作者: Ivan Zanardi,Simone Venturi,Marco Panesi
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce MENO (‘‘Matrix Exponential-based Neural Operator’’), a hybrid surrogate modeling framework for efficiently solving stiff systems of ordinary differential equations (ODEs) that exhibit a sparse nonlinear structure. In such systems, only a few variables contribute nonlinearly to the dynamics, while the majority influence the equations linearly. MENO exploits this property by decomposing the system into two components: the low-dimensional nonlinear part is modeled using conventional neural operators, while the linear time-varying subsystem is integrated using a novel neural matrix exponential formulation. This approach combines the exact solution of linear time-invariant systems with learnable, time-dependent graph-based corrections applied to the linear operators. Unlike black-box or soft-constrained physics-informed (PI) models, MENO embeds the governing equations directly into its architecture, ensuring physical consistency (e.g., steady states), improved robustness, and more efficient training. We validate MENO on three complex thermochemical systems: the POLLU atmospheric chemistry model, an oxygen mixture in thermochemical nonequilibrium, and a collisional-radiative argon plasma in one- and two-dimensional shock-tube simulations. MENO achieves relative errors below 2% in trained zero-dimensional settings and maintains good accuracy in extrapolatory multidimensional regimes. It also delivers substantial computational speedups, achieving up to 4 800 \times on GPU and 185 \times on CPU compared to standard implicit ODE solvers. Although intrusive by design, MENO’s physics-based architecture enables superior generalization and reliability, offering a scalable path for real-time simulation of stiff reactive systems.

[LG-102] opological Social Choice: Designing a Noise-Robust Polar Distance for Persistence Diagrams

链接: https://arxiv.org/abs/2507.14340
作者: Athanasios Andrikopoulos,Nikolaos Sampanis
类目: Algebraic Topology (math.AT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 26 pages,2 figures

点击查看摘要

Abstract:Topological Data Analysis (TDA) has emerged as a powerful framework for extracting robust and interpretable features from noisy high-dimensional data. In the context of Social Choice Theory, where preference profiles and collective decisions are geometrically rich yet sensitive to perturbations, TDA remains largely unexplored. This work introduces a novel conceptual bridge between these domains by proposing a new metric framework for persistence diagrams tailored to noisy preference this http URL define a polar coordinate-based distance that captures both the magnitude and orientation of topological features in a smooth and differentiable manner. Our metric addresses key limitations of classical distances, such as bottleneck and Wasserstein, including instability under perturbation, lack of continuity, and incompatibility with gradient-based learning. The resulting formulation offers improved behavior in both theoretical and applied this http URL the best of our knowledge, this is the first study to systematically apply persistent homology to social choice systems, providing a mathematically grounded method for comparing topological summaries of voting structures and preference dynamics. We demonstrate the superiority of our approach through extensive experiments, including robustness tests and supervised learning tasks, and we propose a modular pipeline for building predictive models from online preference data. This work contributes a conceptually novel and computationally effective tool to the emerging interface of topology and decision theory, opening new directions in interpretable machine learning for political and economic systems.

[LG-103] A universal augmentation framework for long-range electrostatics in machine learning interatomic potentials

链接: https://arxiv.org/abs/2507.14302
作者: Dongjin Kim,Xiaoyu Wang,Peichen Zhong,Daniel S. King,Theo Jaffrelot Inizan,Bingqing Cheng
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Most current machine learning interatomic potentials (MLIPs) rely on short-range approximations, without explicit treatment of long-range electrostatics. To address this, we recently developed the Latent Ewald Summation (LES) method, which infers electrostatic interactions, polarization, and Born effective charges (BECs), just by learning from energy and force training data. Here, we present LES as a standalone library, compatible with any short-range MLIP, and demonstrate its integration with methods such as MACE, NequIP, CACE, and CHGNet. We benchmark LES-enhanced models on distinct systems, including bulk water, polar dipeptides, and gold dimer adsorption on defective substrates, and show that LES not only captures correct electrostatics but also improves accuracy. Additionally, we scale LES to large and chemically diverse data by training MACELES-OFF on the SPICE set containing molecules and clusters, making a universal MLIP with electrostatics for organic systems including biomolecules. MACELES-OFF is more accurate than its short-range counterpart (MACE-OFF) trained on the same dataset, predicts dipoles and BECs reliably, and has better descriptions of bulk liquids. By enabling efficient long-range electrostatics without directly training on electrical properties, LES paves the way for electrostatic foundation MLIPs.

[LG-104] Diffusion-based translation between unpaired spontaneous premature neonatal EEG and fetal MEG

链接: https://arxiv.org/abs/2507.14224
作者: Benoît Brebion,Alban Gallard,Katrin Sippel,Amer Zaylaa,Hubert Preissl,Sahar Moghimi,Fabrice Wallois,Yaël Frégier
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background and objective: Brain activity in premature newborns has traditionally been studied using electroencephalography (EEG), leading to substantial advances in our understanding of early neural development. However, since brain development takes root at the fetal stage, a critical window of this process remains largely unknown. The only technique capable of recording neural activity in the intrauterine environment is fetal magnetoencephalography (fMEG), but this approach presents challenges in terms of data quality and scarcity. Using artificial intelligence, the present research aims to transfer the well-established knowledge from EEG studies to fMEG to improve understanding of prenatal brain development, laying the foundations for better detection and treatment of potential pathologies. Methods: We developed an unpaired diffusion translation method based on dual diffusion bridges, which notably includes numerical integration improvements to obtain more qualitative results at a lower computational cost. Models were trained on our unpaired dataset of bursts of spontaneous activity from 30 high-resolution premature newborns EEG recordings and 44 fMEG recordings. Results: We demonstrate that our method achieves significant improvement upon previous results obtained with Generative Adversarial Networks (GANs), by almost 5% on the mean squared error in the time domain, and completely eliminating the mode collapse problem in the frequency domain, thus achieving near-perfect signal fidelity. Conclusion: We set a new state of the art in the EEG-fMEG unpaired translation problem, as our developed tool completely paves the way for early brain activity analysis. Overall, we also believe that our method could be reused for other unpaired signal translation applications.

[LG-105] Advanced Space Mapping Technique Integrating a Shared Coarse Model for Multistate Tuning-Driven Multiphysics Optimization of Tunable Filters

链接: https://arxiv.org/abs/2507.14220
作者: Haitian Hu,Wei Zhang,Feng Feng,Zhiguo Zhang,Qi-Jun Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Accelerator Physics (physics.acc-ph)
*备注:

点击查看摘要

Abstract:This article introduces an advanced space mapping (SM) technique that applies a shared electromagnetic (EM)-based coarse model for multistate tuning-driven multiphysics optimization of tunable filters. The SM method combines the computational efficiency of EM single-physics simulations with the precision of multiphysics simulations. The shared coarse model is based on EM single-physics responses corresponding to various nontunable design parameters values. Conversely, the fine model is implemented to delineate the behavior of multiphysics responses concerning both nontunable and tunable design parameter values. The proposed overall surrogate model comprises multiple subsurrogate models, each consisting of one shared coarse model and two distinct mapping neural networks. The responses from the shared coarse model in the EM single-physics filed offer a suitable approximation for the fine responses in the multiphysics filed, whereas the mapping neural networks facilitate transition from the EM single-physics field to the multiphysics field. Each subsurrogate model maintains consistent nontunable design parameter values but possesses unique tunable design parameter values. By developing multiple subsurrogate models, optimization can be simultaneously performed for each tuning state. Nontunable design parameter values are constrained by all tuning states, whereas tunable design parameter values are confined to their respective tuning states. This optimization technique simultaneously accounts for all the tuning states to fulfill the necessary multiple tuning state requirements. Multiple EM and multiphysics training samples are generated concurrently to develop the surrogate model. Compared with existing direct multiphysics parameterized modeling techniques, our proposed method achieves superior multiphysics modeling accuracy with fewer training samples and reduced computational costs.

[LG-106] Distributed Machine Learning Approach for Low-Latency Localization in Cell-Free Massive MIMO Systems

链接: https://arxiv.org/abs/2507.14216
作者: Manish Kumar,Tzu-Hsuan Chou,Byunghyun Lee,Nicolò Michelusi,David J. Love,Yaguang Zhang,James V. Krogmeier
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:Low-latency localization is critical in cellular networks to support real-time applications requiring precise positioning. In this paper, we propose a distributed machine learning (ML) framework for fingerprint-based localization tailored to cell-free massive multiple-input multiple-output (MIMO) systems, an emerging architecture for 6G networks. The proposed framework enables each access point (AP) to independently train a Gaussian process regression model using local angle-of-arrival and received signal strength fingerprints. These models provide probabilistic position estimates for the user equipment (UE), which are then fused by the UE with minimal computational overhead to derive a final location estimate. This decentralized approach eliminates the need for fronthaul communication between the APs and the central processing unit (CPU), thereby reducing latency. Additionally, distributing computational tasks across the APs alleviates the processing burden on the CPU compared to traditional centralized localization schemes. Simulation results demonstrate that the proposed distributed framework achieves localization accuracy comparable to centralized methods, despite lacking the benefits of centralized data aggregation. Moreover, it effectively reduces uncertainty of the location estimates, as evidenced by the 95% covariance ellipse. The results highlight the potential of distributed ML for enabling low-latency, high-accuracy localization in future 6G networks.

[LG-107] Boosted Enhanced Quantile Regression Neural Networks with Spatiotemporal Permutation Entropy for Complex System Prognostics

链接: https://arxiv.org/abs/2507.14194
作者: David J Poland
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preliminary version of a predictive maintenance framework using spiking neural networks and entropy-based analysis. To be expanded in future publications with hardware implementations and real-time drift detection modules. arXiv admin note: substantial text overlap with arXiv:2501.05087

点击查看摘要

Abstract:This paper presents a novel framework for pattern prediction and system prognostics centered on Spatiotemporal Permutation Entropy analysis integrated with Boosted Enhanced Quantile Regression Neural Networks (BEQRNNs). We address the challenge of understanding complex dynamical patterns in multidimensional systems through an approach that combines entropy-based complexity measures with advanced neural architectures. The system leverages dual computational stages: first implementing spatiotemporal entropy extraction optimized for multiscale temporal and spatial data streams, followed by an integrated BEQRNN layer that enables probabilistic pattern prediction with uncertainty quantification. This architecture achieves 81.17% accuracy in spatiotemporal pattern classification with prediction horizons up to 200 time steps and maintains robust performance across diverse regimes. Field testing across chaotic attractors, reaction-diffusion systems, and industrial datasets shows a 79% increase in critical transition detection accuracy and 81.22% improvement in long-term prediction reliability. The framework’s effectiveness in processing complex, multimodal entropy features demonstrates significant potential for real-time prognostic applications.

[LG-108] Latent Sensor Fusion: Multimedia Learning of Physiological Signals for Resource-Constrained Devices

链接: https://arxiv.org/abs/2507.14185
作者: Abdullah Ahmed,Jeremy Gummeson
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent spaces offer an efficient and effective means of summarizing data while implicitly preserving meta-information through relational encoding. We leverage these meta-embeddings to develop a modality-agnostic, unified encoder. Our method employs sensor-latent fusion to analyze and correlate multimodal physiological signals. Using a compressed sensing approach with autoencoder-based latent space fusion, we address the computational challenges of biosignal analysis on resource-constrained devices. Experimental results show that our unified encoder is significantly faster, lighter, and more scalable than modality-specific alternatives, without compromising representational accuracy.

[LG-109] Enhancing Generalization in PPG-Based Emotion Measurement with a CNN-TCN-LSTM Model

链接: https://arxiv.org/abs/2507.14173
作者: Karim Alghoul,Hussein Al Osman,Abdulmotaleb El Saddik
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted by IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2025

点击查看摘要

Abstract:Human computer interaction has become integral to modern life, driven by advancements in machine learning technologies. Affective computing, in particular, has focused on systems that recognize, interpret, and respond to human emotions, often using wearable devices, which provide continuous data streams of physiological signals. Among various physiological signals, the photoplethysmogram (PPG) has gained prominence due to its ease of acquisition from widely available devices. However, the generalization of PPG-based emotion recognition models across individuals remains an unresolved challenge. This paper introduces a novel hybrid architecture that combines Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Temporal Convolutional Networks (TCNs) to address this issue. The proposed model integrates the strengths of these architectures to improve robustness and generalization. Raw PPG signals are fed into the CNN for feature extraction. These features are processed separately by LSTM and TCN. The outputs from these components are concatenated to generate a final feature representation, which serves as the input for classifying valence and arousal, the primary dimensions of emotion. Experiments using the Photoplethysmogram Dataset for Emotional Analysis (PPGE) demonstrate that the proposed hybrid model achieves better model generalization than standalone CNN and LSTM architectures. Our results show that the proposed solution outperforms the state-of-the-art CNN architecture, as well as a CNN-LSTM model, in emotion recognition tasks with PPG signals. Using metrics such as Area Under the Curve (AUC) and F1 Score, we highlight the model’s effectiveness in handling subject variability.

[LG-110] Attention-Based Fusion of IQ and FFT Spectrograms with AoA Features for GNSS Jammer Localization

链接: https://arxiv.org/abs/2507.14167
作者: Lucas Heublein,Christian Wielenberg,Thorsten Nowak,Tobias Feigl,Christopher Mutschler,Felix Ott
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 6 pages, 10 figures

点击查看摘要

Abstract:Jamming devices disrupt signals from the global navigation satellite system (GNSS) and pose a significant threat by compromising the reliability of accurate positioning. Consequently, the detection and localization of these interference signals are essential to achieve situational awareness, mitigating their impact, and implementing effective counter-measures. Classical Angle of Arrival (AoA) methods exhibit reduced accuracy in multipath environments due to signal reflections and scattering, leading to localization errors. Additionally, AoA-based techniques demand substantial computational resources for array signal processing. In this paper, we propose a novel approach for detecting and classifying interference while estimating the distance, azimuth, and elevation of jamming sources. Our benchmark study evaluates 128 vision encoder and time-series models to identify the highest-performing methods for each task. We introduce an attention-based fusion framework that integrates in-phase and quadrature (IQ) samples with Fast Fourier Transform (FFT)-computed spectrograms while incorporating 22 AoA features to enhance localization accuracy. Furthermore, we present a novel dataset of moving jamming devices recorded in an indoor environment with dynamic multipath conditions and demonstrate superior performance compared to state-of-the-art methods.

[LG-111] Automated Vigilance State Classification in Rodents Using Machine Learning and Feature Engineering

链接: https://arxiv.org/abs/2507.14166
作者: Sankalp Jajee,Gaurav Kumar,Homayoun Valafar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Preclinical sleep research remains constrained by labor intensive, manual vigilance state classification and inter rater variability, limiting throughput and reproducibility. This study presents an automated framework developed by Team Neural Prognosticators to classify electroencephalogram (EEG) recordings of small rodents into three critical vigilance states paradoxical sleep (REM), slow wave sleep (SWS), and wakefulness. The system integrates advanced signal processing with machine learning, leveraging engineered features from both time and frequency domains, including spectral power across canonical EEG bands (delta to gamma), temporal dynamics via Maximum-Minimum Distance, and cross-frequency coupling metrics. These features capture distinct neurophysiological signatures such as high frequency desynchronization during wakefulness, delta oscillations in SWS, and REM specific bursts. Validated during the 2024 Big Data Health Science Case Competition (University of South Carolina Big Data Health Science Center, 2024), our XGBoost model achieved 91.5% overall accuracy, 86.8% precision, 81.2% recall, and an F1 score of 83.5%, outperforming all baseline methods. Our approach represents a critical advancement in automated sleep state classification and a valuable tool for accelerating discoveries in sleep science and the development of targeted interventions for chronic sleep disorders. As a publicly available code (BDHSC) resource is set to contribute significantly to advancements.

[LG-112] UniPhyNet: A Unified Network For Multimodal Physiological Raw Signal Classification

链接: https://arxiv.org/abs/2507.14163
作者: Renxiang Qiu,Raghavendra Selvan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to be presented at the 35th IEEE International Workshop on Machine Learning for Signal Processing (IEEE MLSP 2025). Source code available at this https URL

点击查看摘要

Abstract:We present UniPhyNet, a novel neural network architecture to classify cognitive load using multimodal physiological data – specifically EEG, ECG and EDA signals – without the explicit need for extracting hand-crafted features. UniPhyNet integrates multiscale parallel convolutional blocks and ResNet-type blocks enhanced with channel block attention module to focus on the informative features while a bidirectional gated recurrent unit is used to capture temporal dependencies. This architecture processes and combines signals in both unimodal and multimodal configurations via intermediate fusion of learned feature maps. On the CL-Drive dataset, UniPhyNet improves raw signal classification accuracy from 70% to 80% (binary) and 62% to 74% (ternary), outperforming feature-based models, demonstrating its effectiveness as an end-to-end solution for real-world cognitive state monitoring.

[LG-113] Complex Dynamics in Psychological Data: Mapping Individual Symptom Trajectories to Group-Level Patterns

链接: https://arxiv.org/abs/2507.14161
作者: Eleonora Vitanza,Pietro DeLellis,Chiara Mocenni,Manuel Ruiz Marin
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study integrates causal inference, graph analysis, temporal complexity measures, and machine learning to examine whether individual symptom trajectories can reveal meaningful diagnostic patterns. Testing on a longitudinal dataset of N=45 individuals affected by General Anxiety Disorder (GAD) and/or Major Depressive Disorder (MDD) derived from Fisher et al. 2017, we propose a novel pipeline for the analysis of the temporal dynamics of psychopathological symptoms. First, we employ the PCMCI+ algorithm with nonparametric independence test to determine the causal network of nonlinear dependencies between symptoms in individuals with different mental disorders. We found that the PCMCI+ effectively highlights the individual peculiarities of each symptom network, which could be leveraged towards personalized therapies. At the same time, aggregating the networks by diagnosis sheds light to disorder-specific causal mechanisms, in agreement with previous psychopathological literature. Then, we enrich the dataset by computing complexity-based measures (e.g. entropy, fractal dimension, recurrence) from the symptom time series, and feed it to a suitably selected machine learning algorithm to aid the diagnosis of each individual. The new dataset yields 91% accuracy in the classification of the symptom dynamics, proving to be an effective diagnostic support tool. Overall, these findings highlight how integrating causal modeling and temporal complexity can enhance diagnostic differentiation, offering a principled, data-driven foundation for both personalized assessment in clinical psychology and structural advances in psychological research.

[LG-114] FinSurvival: A Suite of Large Scale Survival Modeling Tasks from Finance

链接: https://arxiv.org/abs/2507.14160
作者: Aaron Green,Zihan Nie,Hanzhen Qin,Oshani Seneviratne,Kristin P. Bennett
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures, submitted to DMLR

点击查看摘要

Abstract:Survival modeling predicts the time until an event occurs and is widely used in risk analysis; for example, it’s used in medicine to predict the survival of a patient based on censored data. There is a need for large-scale, realistic, and freely available datasets for benchmarking artificial intelligence (AI) survival models. In this paper, we derive a suite of 16 survival modeling tasks from publicly available transaction data generated by lending of cryptocurrencies in Decentralized Finance (DeFi). Each task was constructed using an automated pipeline based on choices of index and outcome events. For example, the model predicts the time from when a user borrows cryptocurrency coins (index event) until their first repayment (outcome event). We formulate a survival benchmark consisting of a suite of 16 survival-time prediction tasks (FinSurvival). We also automatically create 16 corresponding classification problems for each task by thresholding the survival time using the restricted mean survival time. With over 7.5 million records, FinSurvival provides a suite of realistic financial modeling tasks that will spur future AI survival modeling research. Our evaluation indicated that these are challenging tasks that are not well addressed by existing methods. FinSurvival enables the evaluation of AI survival models applicable to traditional finance, industry, medicine, and commerce, which is currently hindered by the lack of large public datasets. Our benchmark demonstrates how AI models could assess opportunities and risks in DeFi. In the future, the FinSurvival benchmark pipeline can be used to create new benchmarks by incorporating more DeFi transactions and protocols as the use of cryptocurrency grows.

[LG-115] Siamese Neural Network for Label-Efficient Critical Phenomena Prediction in 3D Percolation Models

链接: https://arxiv.org/abs/2507.14159
作者: Shanshan Wang,Dian Xu,Jianmin Shen,Feng Gao,Wei Li,Weibing Deng
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Percolation theory serves as a cornerstone for studying phase transitions and critical phenomena, with broad implications in statistical physics, materials science, and complex networks. However, most machine learning frameworks for percolation analysis have focused on two-dimensional systems, oversimplifying the spatial correlations and morphological complexity of real-world three-dimensional materials. To bridge this gap and improve label efficiency and scalability in 3D systems, we propose a Siamese Neural Network (SNN) that leverages features of the largest cluster as discriminative input. Our method achieves high predictive accuracy for both site and bond percolation thresholds and critical exponents in three dimensions, with sub-1% error margins using significantly fewer labeled samples than traditional approaches. This work establishes a robust and data-efficient framework for modeling high-dimensional critical phenomena, with potential applications in materials discovery and complex network analysis.

[LG-116] Machine learning-enabled river water quality monitoring using lithography-free 3D-printed sensors

链接: https://arxiv.org/abs/2507.14152
作者: Frank Efe Erukainure,Feidra Gjata,Matin Ataei Kachouei,Henry Cox,Md. Azahar Ali
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY); Instrumentation and Detectors (physics.ins-det)
*备注: 34 pages, 9 figures

点击查看摘要

Abstract:River water quality monitoring is important for aquatic life, livestock, and humans because clean water is critical to meeting food demand during the global food crisis. Excessive contaminants, including phosphate, deplete dissolved oxygen and trigger eutrophication, leading to serious health and ecological problems. Continuous sensors that track phosphate levels can therefore help prevent eutrophication. In this work we present a lithography-free phosphate sensor (P-sensor) that detects phosphate in river water at parts-per-billion levels. The device uses a solid-state indicator electrode formed by 3D-printed periodic polymer patterns (8 um feature size) coated with a thin phosphate ion-selective membrane. The P-sensor detects as little as 1 ppb phosphate across 0 - 475 ppm with a response time under 30 seconds. We validated the sensor on Rappahannock River water, Virginia (less than 0.8 ppm phosphate) at sites upstream and downstream of a sewage treatment plant and benchmarked the results against a commercial phosphate meter. A feed-forward neural network was trained to predict phosphate levels, achieving a mean-squared error below 1e-3, zero standard deviation, and a Pearson correlation coefficient of 0.997 for river samples. These results demonstrate a practical tool for continuous water-quality monitoring that can inform stakeholders and policymakers and ultimately improve public health.

[LG-117] Graph Convolutional Neural Networks to Model the Brain for Insomnia ALT

链接: https://arxiv.org/abs/2507.14147
作者: Kevin Monteiro,Sam Nallaperuma-Herzberg,Martina Mason,Steve Niederer
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 12 pages, 6 figures. This version has been accepted as a full paper at the 2025 AI in Healthcare (AIiH) Conference

点击查看摘要

Abstract:Insomnia affects a vast population of the world and can have a wide range of causes. Existing treatments for insomnia have been linked with many side effects like headaches, dizziness, etc. As such, there is a clear need for improved insomnia treatment. Brain modelling has helped with assessing the effects of brain pathology on brain network dynamics and with supporting clinical decisions in the treatment of Alzheimer’s disease, epilepsy, etc. However, such models have not been developed for insomnia. Therefore, this project attempts to understand the characteristics of the brain of individuals experiencing insomnia using continuous long-duration EEG data. Brain networks are derived based on functional connectivity and spatial distance between EEG channels. The power spectral density of the channels is then computed for the major brain wave frequency bands. A graph convolutional neural network (GCNN) model is then trained to capture the functional characteristics associated with insomnia and configured for the classification task to judge performance. Results indicated a 50-second non-overlapping sliding window was the most suitable choice for EEG segmentation. This approach achieved a classification accuracy of 70% at window level and 68% at subject level. Additionally, the omission of EEG channels C4-P4, F4-C4 and C4-A1 caused higher degradation in model performance than the removal of other channels. These channel electrodes are positioned near brain regions known to exhibit atypical levels of functional connectivity in individuals with insomnia, which can explain such results.

[LG-118] Recursive KalmanNet: Analyse des capacités de généralisation dun réseau de neurones récurrent guidé par un filtre de Kalman

链接: https://arxiv.org/abs/2507.14144
作者: Cyril Falcon,Hassan Mortada,Mathéo Clavaud,Jean-Philippe Michel
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 4 pages, in French language. 4 figures. Accepted for publication in GRETSI 2025 proceedings

点击查看摘要

Abstract:The Recursive KalmanNet, recently introduced by the authors, is a recurrent neural network guided by a Kalman filter, capable of estimating the state variables and error covariance of stochastic dynamic systems from noisy measurements, without prior knowledge of the noise characteristics. This paper explores its generalization capabilities in out-of-distribution scenarios, where the temporal dynamics of the test measurements differ from those encountered during training. Le Recursive KalmanNet, récemment introduit par les auteurs, est un réseau de neurones récurrent guidé par un filtre de Kalman, capable d’estimer les variables d’état et la covariance des erreurs des systèmes dynamiques stochastiques à partir de mesures bruitées, sans connaissance préalable des caractéristiques des bruits. Cet article explore ses capacités de généralisation dans des scénarios hors distribution, où les dynamiques temporelles des mesures de test diffèrent de celles rencontrées à l’entraînement. Comments: 4 pages, in French language. 4 figures. Accepted for publication in GRETSI 2025 proceedings Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2507.14144 [eess.SP] (or arXiv:2507.14144v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2507.14144 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] RankMixer: Scaling Up Ranking Models in Industrial Recommenders

链接: https://arxiv.org/abs/2507.15551
作者: Jie Zhu,Zhifang Fan,Xiaoxie Zhu,Yuchen Jiang,Hangyu Wang,Xintian Han,Haoran Ding,Xinmin Wang,Wenlin Zhao,Zhen Gong,Huizhi Yang,Zheng Chai,Zhe Chen,Yuchao Zheng,Qiwei Chen,Feng Zhang,Xun Zhou,Peng Xu,Xiao Yang,Di Wu,Zuotao Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent progress on large language models (LLMs) has spurred interest in scaling up recommendation systems, yet two practical obstacles remain. First, training and serving cost on industrial Recommenders must respect strict latency bounds and high QPS demands. Second, most human-designed feature-crossing modules in ranking models were inherited from the CPU era and fail to exploit modern GPUs, resulting in low Model Flops Utilization (MFU) and poor scalability. We introduce RankMixer, a hardware-aware model design tailored towards a unified and scalable feature-interaction architecture. RankMixer retains the transformer’s high parallelism while replacing quadratic self-attention with multi-head token mixing module for higher efficiency. Besides, RankMixer maintains both the modeling for distinct feature subspaces and cross-feature-space interactions with Per-token FFNs. We further extend it to one billion parameters with a Sparse-MoE variant for higher ROI. A dynamic routing strategy is adapted to address the inadequacy and imbalance of experts training. Experiments show RankMixer’s superior scaling abilities on a trillion-scale production dataset. By replacing previously diverse handcrafted low-MFU modules with RankMixer, we boost the model MFU from 4.5% to 45%, and scale our ranking model parameters by 100x while maintaining roughly the same inference latency. We verify RankMixer’s universality with online A/B tests across three core application scenarios (Recommendation, Advertisement and Search). Finally, we launch 1B Dense-Parameters RankMixer for full traffic serving without increasing the serving cost, which improves user active days by 0.2% and total in-app usage duration by 0.5%.

[IR-1] Hierarchical Graph Information Bottleneck for Multi-Behavior Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.15395
作者: Hengyu Zhang,Chunxu Shen,Xiangguo Sun,Jie Tan,Yanchao Tan,Yu Rong,Hong Cheng,Lingling Yi
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys2025

点击查看摘要

Abstract:In real-world recommendation scenarios, users typically engage with platforms through multiple types of behavioral interactions. Multi-behavior recommendation algorithms aim to leverage various auxiliary user behaviors to enhance prediction for target behaviors of primary interest (e.g., buy), thereby overcoming performance limitations caused by data sparsity in target behavior records. Current state-of-the-art approaches typically employ hierarchical design following either cascading (e.g., view \rightarrow cart \rightarrow buy) or parallel (unified \rightarrow behavior \rightarrow specific components) paradigms, to capture behavioral relationships. However, these methods still face two critical challenges: (1) severe distribution disparities across behaviors, and (2) negative transfer effects caused by noise in auxiliary behaviors. In this paper, we propose a novel model-agnostic Hierarchical Graph Information Bottleneck (HGIB) framework for multi-behavior recommendation to effectively address these challenges. Following information bottleneck principles, our framework optimizes the learning of compact yet sufficient representations that preserve essential information for target behavior prediction while eliminating task-irrelevant redundancies. To further mitigate interaction noise, we introduce a Graph Refinement Encoder (GRE) that dynamically prunes redundant edges through learnable edge dropout mechanisms. We conduct comprehensive experiments on three real-world public datasets, which demonstrate the superior effectiveness of our framework. Beyond these widely used datasets in the academic community, we further expand our evaluation on several real industrial scenarios and conduct an online A/B testing, showing again a significant improvement in multi-behavior recommendations. The source code of our proposed HGIB is available at this https URL.

[IR-2] Click A Buy B: Rethinking Conversion Attribution in E- Commerce Recommendations

链接: https://arxiv.org/abs/2507.15113
作者: Xiangyu Zeng,Amit Jaspal,Bin Liu,Goutham Panneeru,Kevin Huang,Nicolas Bievre,Mohit Jaggi,Prathap Maniraju,Ankur Jain
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:User journeys in e-commerce routinely violate the one-to-one assumption that a clicked item on an advertising platform is the same item later purchased on the merchant’s website/app. For a significant number of converting sessions on our platform, users click product A but buy product B – the Click A, Buy B (CABB) phenomenon. Training recommendation models on raw click-conversion pairs therefore rewards items that merely correlate with purchases, leading to biased learning and sub-optimal conversion rates. We reframe conversion prediction as a multi-task problem with separate heads for Click A Buy A (CABA) and Click A Buy B (CABB). To isolate informative CABB conversions from unrelated CABB conversions, we introduce a taxonomy-aware collaborative filtering weighting scheme where each product is first mapped to a leaf node in a product taxonomy, and a category-to-category similarity matrix is learned from large-scale co-engagement logs. This weighting amplifies pairs that reflect genuine substitutable or complementary relations while down-weighting coincidental cross-category purchases. Offline evaluation on e-commerce sessions reduces normalized entropy by 13.9% versus a last-click attribution baseline. An online A/B test on live traffic shows +0.25% gains in the primary business metric.

[IR-3] User Invariant Preference Learning for Multi-Behavior Recommendation

链接: https://arxiv.org/abs/2507.14925
作者: Mingshi Yan,Zhiyong Cheng,Fan Liu,Yingda Lyu,Yahong Han
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In multi-behavior recommendation scenarios, analyzing users’ diverse behaviors, such as click, purchase, and rating, enables a more comprehensive understanding of their interests, facilitating personalized and accurate recommendations. A fundamental assumption of multi-behavior recommendation methods is the existence of shared user preferences across behaviors, representing users’ intrinsic interests. Based on this assumption, existing approaches aim to integrate information from various behaviors to enrich user representations. However, they often overlook the presence of both commonalities and individualities in users’ multi-behavior preferences. These individualities reflect distinct aspects of preferences captured by different behaviors, where certain auxiliary behaviors may introduce noise, hindering the prediction of the target behavior. To address this issue, we propose a user invariant preference learning for multi-behavior recommendation (UIPL for short), aiming to capture users’ intrinsic interests (referred to as invariant preferences) from multi-behavior interactions to mitigate the introduction of noise. Specifically, UIPL leverages the paradigm of invariant risk minimization to learn invariant preferences. To implement this, we employ a variational autoencoder (VAE) to extract users’ invariant preferences, replacing the standard reconstruction loss with an invariant risk minimization constraint. Additionally, we construct distinct environments by combining multi-behavior data to enhance robustness in learning these preferences. Finally, the learned invariant preferences are used to provide recommendations for the target behavior. Extensive experiments on four real-world datasets demonstrate that UIPL significantly outperforms current state-of-the-art methods.

[IR-4] raining oscillator Ising machines to assign the dynamic stability of their equilibrium points

链接: https://arxiv.org/abs/2507.14386
作者: Yi Cheng,Zongli Lin
类目: Neural and Evolutionary Computing (cs.NE); Information Retrieval (cs.IR)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:We propose a neural network model, which, with appropriate assignment of the stability of its equilibrium points (EPs), achieves Hopfield-like associative memory. The oscillator Ising machine (OIM) is an ideal candidates for such a model, as all its 0/\pi binary EPs are structurally stable with their dynamic stability tunable by the coupling weights. Traditional Hopfield-based models store the desired patterns by designing the coupling weights between neurons. The design of coupling weights should simultaneously take into account both the existence and the dynamic stability of the EPs for the storage of the desired patterns. For OIMs, since all 0/\pi binary EPs are structurally stable, the design of the coupling weights needs only to focus on assigning appropriate stability for the 0/\pi binary EPs according to the desired patterns. In this paper, we establish a connection between the stability and the Hamiltonian energy of EPs for OIMs, and, based on this connection, provide a Hamiltonian-Regularized Eigenvalue Contrastive Method (HRECM) to train the coupling weights of OIMs for assigning appropriate stability to their EPs. Finally, numerical experiments are performed to validate the effectiveness of the proposed method.

[IR-5] RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction

链接: https://arxiv.org/abs/2507.14361
作者: Huy-Son Nguyen,Quang-Huy Nguyen,Duc-Hoang Pham,Duc-Trong Le,Hoang-Quynh Le,Padipat Sitkrongwong,Atsuhiro Takasu,Masoud Mansoury
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing studies on bundle construction have relied merely on user feedback via bipartite graphs or enhanced item representations using semantic information. These approaches fail to capture elaborate relations hidden in real-world bundle structures, resulting in suboptimal bundle representations. To overcome this limitation, we propose RaMen, a novel method that provides a holistic multi-strategy approach for bundle construction. RaMen utilizes both intrinsic (characteristics) and extrinsic (collaborative signals) information to model bundle structures through Explicit Strategy-aware Learning (ESL) and Implicit Strategy-aware Learning (ISL). ESL employs task-specific attention mechanisms to encode multi-modal data and direct collaborative relations between items, thereby explicitly capturing essential bundle features. Moreover, ISL computes hyperedge dependencies and hypergraph message passing to uncover shared latent intents among groups of items. Integrating diverse strategies enables RaMen to learn more comprehensive and robust bundle representations. Meanwhile, Multi-strategy Alignment Discrimination module is employed to facilitate knowledge transfer between learning strategies and ensure discrimination between items/bundles. Extensive experiments demonstrate the effectiveness of RaMen over state-of-the-art models on various domains, justifying valuable insights into complex item set problems.

附件下载

点击下载今日全部论文列表