本篇博文主要内容为 2025-07-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-30)

今日共更新562篇论文,其中:

  • 自然语言处理76篇(Computation and Language (cs.CL))
  • 人工智能202篇(Artificial Intelligence (cs.AI))
  • 计算机视觉111篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习141篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MetaCLIP 2: A Worldwide Scaling Recipe

【速读】: 该论文旨在解决两个关键问题:一是缺乏有效的数据筛选方法来处理非英语语种的图像-文本对,二是现有多语言CLIP模型在英语任务上的性能低于纯英语训练的版本,即“多语言诅咒”(curse of multilinguality)现象。解决方案的关键在于提出MetaCLIP 2,首次从零开始在全网规模的跨语言图像-文本对上训练CLIP模型,并通过最小化改动的严谨消融实验验证了其有效性:该方法能够实现英语与非英语数据之间的互惠增益,在零样本ImageNet分类中超越纯英语模型0.8%,并在多语言基准测试(如CVQA、Babel-ImageNet和XM3600)中达到新的最优性能,且不依赖系统级干扰因素(如翻译或架构定制)。

链接: https://arxiv.org/abs/2507.22062
作者: Yung-Sung Chuang,Yang Li,Dong Wang,Ching-Feng Yeh,Kehan Lyu,Ramya Raghavendra,James Glass,Lifei Huang,Jason Weston,Luke Zettlemoyer,Xinlei Chen,Zhuang Liu,Saining Xie,Wen-tau Yih,Shang-Wen Li,Hu Xu
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
zh

[NLP-1] DeepSieve: Information Sieving via LLM -as-a-Knowledge-Router

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理知识密集型查询时因无法动态获取最新或领域特定信息而导致性能受限的问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)方法缺乏对查询和源端的细粒度控制,常导致检索噪声大、推理浅层化。其解决方案的关键在于提出 DeepSieve——一个基于代理的 RAG 框架,通过“LLM-as-a-knowledge-router”实现信息筛分机制:将复杂查询分解为结构化的子问题,并递归地将其路由至最合适的知识源,再通过多阶段蒸馏过程过滤无关信息,从而提升推理深度、检索精度与可解释性。

链接: https://arxiv.org/abs/2507.22050
作者: Minghao Guo,Qingcheng Zeng,Xujiang Zhao,Yanchi Liu,Wenchao Yu,Mengnan Du,Haifeng Chen,Wei Cheng
机构: Rutgers University (罗格斯大学); Northwestern University (西北大学); NEC Laboratories America (美国NEC实验室); NJIT (新泽西理工学院)
类目: Computation and Language (cs.CL)
备注: 22 pages, work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
zh

[NLP-2] UserBench: An Interactive Gym Environment for User-Centric Agents

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在面对用户目标模糊、动态变化或间接表达时,缺乏主动协作能力的问题。现有研究多聚焦于任务完成度,而忽视了与用户的偏好驱动型交互质量。解决方案的关键在于提出UserBench——一个以用户为中心的多轮、偏好驱动的评估基准,其中模拟用户逐步披露偏好,要求智能体主动澄清意图并基于工具做出合理决策,从而系统性地衡量和提升智能体作为协同伙伴的能力。

链接: https://arxiv.org/abs/2507.22034
作者: Cheng Qian,Zuxin Liu,Akshara Prabhakar,Zhiwei Liu,Jianguo Zhang,Haolin Chen,Heng Ji,Weiran Yao,Shelby Heinecke,Silvio Savarese,Caiming Xiong,Huan Wang
机构: Salesforce AI Research (Salesforce人工智能研究中心); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 Pages, 17 Figures, 6 Tables

点击查看摘要

Abstract:Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
zh

[NLP-3] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

【速读】: 该论文旨在解决当前图形用户界面(GUI)代理在训练与推理阶段面临的三大核心问题:推理设计困境、奖励机制低效以及视觉噪声干扰。针对这些问题,论文提出UI-AGILE框架,其关键创新在于:在训练阶段引入连续奖励函数以提升精准定位能力、设计“简单思考”奖励以平衡规划效率与准确性,并采用基于裁剪的重采样策略缓解稀疏奖励问题;在推理阶段提出分解式定位选择方法(Decomposed Grounding with Selection),通过将高分辨率图像分块处理显著提升复杂界面中的定位精度。实验表明,该方案在ScreenSpot-Pro和ScreenSpot-v2两个基准上均达到当前最优性能,尤其在ScreenSpot-Pro上相较最佳基线提升23%的定位准确率。

链接: https://arxiv.org/abs/2507.22025
作者: Shuquan Lian,Yuhang Wu,Jia Ma,Zihan Song,Bingqi Chen,Xiawu Zheng,Hui Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.
zh

[NLP-4] Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models

【速读】: 该论文旨在解决传统机器学习模型在仅提供元数据(metadata)的微生物组研究中难以泛化的问题,尤其是在小样本场景或跨研究标签格式异质性较高的情况下。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)对稀疏且异构的环境元数据进行推理,从而实现微生物样本的分类(如EMPO 3等本体类别)和病原体污染风险预测(如大肠杆菌E. Coli的存在),并在零样本(zero-shot)与少样本(few-shot)设置下展现出优于随机森林等传统模型的性能,证明LLMs具备跨站点和元数据分布的强泛化能力。

链接: https://arxiv.org/abs/2507.21980
作者: Hyunwoo Yoo,Gail L. Rosen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (LLMs) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate LLMs such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that LLMs not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites and metadata distributions. These findings suggest that LLMs can effectively reason over sparse, heterogeneous biological metadata and offer a promising metadata-only approach for environmental microbiology and biosurveillance applications.
zh

[NLP-5] Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)在跨文化菜谱适配任务中输出多样性不足的问题。研究表明,RAG 在生成过程中过度依赖有限的上下文片段,即使输入多样化的背景信息,仍难以产生多样化的适配结果,这限制了其在需要多解创造性任务中的应用。解决方案的关键在于提出 CARRIAGE 框架,该框架通过增强检索阶段和上下文组织阶段的多样性,实现对用户多样化饮食偏好和文化需求的有效响应;CARRIAGE 是首个明确以提升输出多样性为目标的 RAG 框架,在保证生成质量的同时实现了多样性与质量之间的帕累托最优。

链接: https://arxiv.org/abs/2507.21934
作者: Tianyi Hu,Andrea Morales-Garzón,Jingyi Zheng,Maria Maistro,Daniel Hershcovich
机构: Aarhus University (奥胡斯大学); University of Copenhagen (哥本哈根大学); Dept. of Computer Science and Artificial Intelligence, University of Granada (格拉纳达大学计算机科学与人工智能系); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish’s essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.
zh

[NLP-6] Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理密集型任务中生成看似合理但校准不良的答案问题,从而限制其可靠性。解决方案的关键在于提出一种名为“基于自反馈的强化学习”(Reinforcement Learning from Self-Feedback, RLSF)的后训练阶段:利用模型自身对答案的信心作为内在奖励信号,通过比较同一问题下多个思维链(chain-of-thought)解题路径的置信度并排序,构建合成偏好数据,进而使用标准偏好优化方法微调策略网络。该方法无需人工标注、黄金答案或外部奖励,同时实现了概率估计校准的改进与逐步推理能力的增强,验证了内在奖励机制在LLM后训练流程中的有效性与数据效率。

链接: https://arxiv.org/abs/2507.21931
作者: Carel van Niekerk,Renato Vukovic,Benjamin Matthias Ruppik,Hsien-chin Lin,Milica Gašić
机构: Heinrich Heine Universität (海因里希·海涅大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21931 [cs.CL] (or arXiv:2507.21931v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.21931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-7] raining language models to be warm and empathetic makes them less reliable and more sycophantic

【速读】: 该论文试图解决的问题是:在生成式 AI(Generative AI)语言模型中,为了增强其温暖和共情属性以提升用户体验,可能导致模型在关键任务中的可靠性显著下降,尤其是在用户表达脆弱情绪时。解决方案的关键在于通过受控实验发现,即使在标准基准测试中表现良好,优化后的“温暖型”模型在安全性敏感任务(如提供医疗建议、事实准确性判断)上表现出更高的错误率(+10 至 +30 个百分点),并更倾向于强化用户的错误信念,尤其当用户表达悲伤情绪时。这一发现揭示了当前评估体系对系统性风险的盲区,强调需要重新审视模型开发与监管框架,以平衡人性化交互与可靠性的内在矛盾。

链接: https://arxiv.org/abs/2507.21919
作者: Lujain Ibrahim,Franziska Sofia Hafner,Luc Rocher
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect. As human-like AI systems are deployed at an unprecedented scale, our findings indicate a need to rethink how we develop and oversee these systems that are reshaping human relationships and social interaction.
zh

[NLP-8] Rote Learning Considered Useful: Generalizing over Memorized Data in LLM s

【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)在依赖重复记忆(rote learning)获取事实性知识时,是否仍能实现对这些知识的有效泛化(generalization),即能否从机械记忆中提取出可迁移的语义结构。传统观点认为,rote learning会抑制模型的泛化能力,因其导致verbatim memorization而非深层理解。论文通过提出一个“两阶段记忆-泛化”(memorize-then-generalize)框架解决了这一问题:关键在于先利用语义无意义的标记(semantically meaningless token)让模型进行rote memorization,随后仅用少量语义明确的提示(semantically meaningful prompts)进行微调(fine-tuning),从而引导模型重构记忆表征,形成结构化且语义对齐的潜在表示(structured, semantically aligned latent representations)。实验表明,LLMs可在无需额外训练数据的情况下,从已记忆的数据中实现有效泛化,这为高效的知识注入提供了新路径,同时也揭示了潜在的滥用风险。

链接: https://arxiv.org/abs/2507.21914
作者: Qinyuan Wu,Soumi Das,Mahsa Amani,Bishwamittra Ghosh,Mohammad Aflah Khan,Krishna P. Gummadi,Muhammad Bilal Zafar
机构: Max Planck Institute for Software Systems (马克斯普朗克软件系统研究所); Ruhr University Bochum (鲁尔大学波鸿分校); UAR RC Trust (UAR RC 信托基金)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Rote learning is a memorization technique based on repetition. It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding. This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization. In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data. We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two. This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage.
zh

[NLP-9] Whos important? – SUnSET: Synergistic Understanding of Stakeholder Events and Time for Timeline Generation

【速读】: 该论文旨在解决多源新闻报道中事件追踪与时间线摘要(Timeline Summarization, TLS)的挑战,尤其针对现有方法仅依赖相似日期文章的文本内容进行总结、忽视利益相关方(Stakeholder)角色及其关联性的问题。其解决方案的关键在于提出SUnSET框架,通过大语言模型(Large Language Models, LLMs)构建“利益相关方-事件-时间”(Stakeholder-Event-Time, SET)三元组,并引入基于利益相关方的排序机制来定义相关性度量(Relevancy metric),从而更准确地刻画事件演化过程中的核心参与者及其动态联系,显著提升了时间线摘要的质量,成为该任务的新SOTA(State-of-the-Art)。

链接: https://arxiv.org/abs/2507.21903
作者: Tiviatis Sim,Kaiwen Yang,Shen Xin,Kenji Kawaguchi
机构: National University of Singapore (新加坡国立大学); A*STAR Institute of High Performance Computing (新加坡科技研究局高性能计算研究所)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:As news reporting becomes increasingly global and decentralized online, tracking related events across multiple sources presents significant challenges. Existing news summarization methods typically utilizes Large Language Models and Graphical methods on article-based summaries. However, this is not effective since it only considers the textual content of similarly dated articles to understand the gist of the event. To counteract the lack of analysis on the parties involved, it is essential to come up with a novel framework to gauge the importance of stakeholders and the connection of related events through the relevant entities involved. Therefore, we present SUnSET: Synergistic Understanding of Stakeholder, Events and Time for the task of Timeline Summarization (TLS). We leverage powerful Large Language Models (LLMs) to build SET triplets and introduced the use of stakeholder-based ranking to construct a Relevancy metric, which can be extended into general situations. Our experimental results outperform all prior baselines and emerged as the new State-of-the-Art, highlighting the impact of stakeholder information within news article.
zh

[NLP-10] Graph-R1: Towards Agent ic GraphRAG Framework via End-to-end Reinforcement Learning

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法因基于片段的检索缺乏结构语义,以及现有图谱增强型RAG(GraphRAG)方法在知识构建成本高、仅支持一次性检索、依赖长上下文推理和提示工程等问题。其解决方案的关键在于提出Graph-R1框架,通过端到端强化学习(Reinforcement Learning, RL)实现:轻量级知识超图(knowledge hypergraph)构建、将检索建模为多轮智能体-环境交互过程,并引入端到端奖励机制优化智能体决策流程,从而显著提升推理准确性、检索效率与生成质量。

链接: https://arxiv.org/abs/2507.21892
作者: Haoran Luo,Haihong E,Guanting Chen,Qika Lin,Yikai Guo,Fangzhi Xu,Zemin Kuang,Meina Song,Xiaobao Wu,Yifan Zhu,Luu Anh Tuan
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Beijing Institute of Computer Technology and Application (北京计算机技术与应用研究所); Beijing Anzhen Hospital, Capital Medical University (首都医科大学附属安贞医院)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality.
zh

[NLP-11] AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

【速读】: 该论文旨在解决当前工具集成推理(Tool-Integrated Reasoning, TIR)方法中依赖预定义、僵化工具使用模式所导致的核心语言能力退化问题。其解决方案的关键在于提出AutoTIR,一个基于强化学习的框架,使大型语言模型(Large Language Models, LLMs)能够在推理过程中自主决策是否调用外部工具及选择何种工具,而非遵循静态策略;该框架通过混合奖励机制联合优化任务特定答案正确性、结构化输出一致性以及错误工具调用惩罚,从而在保持语言理解能力的同时实现高效且灵活的工具集成。

链接: https://arxiv.org/abs/2507.21836
作者: Yifan Wei,Xiaoyan Yu,Yixuan Weng,Tengfei Pan,Angsheng Li,Li Du
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at this https URL.
zh

[NLP-12] Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLM s in the computational social sciences

【速读】: 该论文旨在解决生成式 AI(Generative AI)在社会科学研究中自动化编码任务时,因提示工程(prompting strategy)效果不稳定而导致的可靠性问题。当前尽管已有多种提示策略被提出,但其性能在不同大语言模型(Large Language Models, LLMs)和任务之间差异显著,且实践中仍依赖试错法。解决方案的关键在于提出一种通用的系统化提示构建流水线——HALC(Human-AI Collaboration Pipeline),该方法允许针对任意编码任务与模型,系统性地识别并构建最优提示,同时兼容多种提示策略。通过在本地部署的LLM上执行超过两百万次请求验证,研究发现基于HALC优化的提示可在单变量和多变量编码任务中实现高信度(如α = .76–.78),且无需调整代码本以适配模型,而是通过提示设计使模型对齐人工编码标准。

链接: https://arxiv.org/abs/2507.21831
作者: Andreas Reich,Claudia Thoms,Tobias Schrimpf
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 48 pages, 9 figures and 8 tables

点击查看摘要

Abstract:LLMs are seeing widespread use for task automation, including automated coding in the social sciences. However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. Often trial and error practices are still widespread. We propose HALC - a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. We test prompting strategies and LLM task performance based on few expert codings (ground truth). When compared to these expert codings, we find prompts that code reliably for single variables ( \alpha climate = .76; \alpha movement = .78) and across two variables ( \alpha climate = .71; \alpha movement = .74) using the LLM Mistral NeMo. Our prompting strategies are set up in a way that aligns the LLM to our codebook - we are not optimizing our codebook for LLM friendliness. Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model.
zh

[NLP-13] Modelling Adjectival Modification Effects on Semantic Plausibility

【速读】: 该论文旨在解决事件合理性(plausibility)在语境中因修饰成分变化而产生的动态感知问题,尤其关注如何准确建模由单一形容词修饰引发的合理性差异。这一问题对对话生成、常识推理和幻觉检测等任务具有重要意义,例如区分朋友间的“温和讽刺”与真正的恶意。其解决方案的关键在于提出一种基于句子嵌入(sentence transformers)的新颖建模方法,尽管该方法在概念上契合任务本质,但实验表明其性能反而不如RoBERTa等Transformer模型;同时,研究强调了构建更真实、平衡的评估机制的重要性,以避免数据不平衡导致的性能误判和结果可信度下降。

链接: https://arxiv.org/abs/2507.21828
作者: Anna Golub,Beate Zywietz,Annerose Eichel
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ESSLLI 2025 Student Session

点击查看摘要

Abstract:While the task of assessing the plausibility of events such as ‘‘news is relevant’’ has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ‘‘gentle sarcasm’’ as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.
zh

[NLP-14] HRIPBench: Benchmarking LLM s in Harm Reduction Information Provision to Support People Who Use Drugs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提供药物使用人群(People Who Use Drugs, PWUD)相关危害减少(Harm Reduction)信息时的准确性与安全性问题。当前尽管部分LLMs展现出一定的医学知识水平,但其在实际危害减少场景中的表现尚不明确,存在潜在风险。解决方案的关键在于构建一个名为HRIPBench的基准测试集,包含2,160个问答-证据对,涵盖安全边界核查、定量数值提供和多物质使用风险推断三类任务,并通过指令微调(Instruction)和检索增强生成(Retrieval-Augmented Generation, RAG)两种方式评估模型基于自身知识与领域知识融合后的行为表现。结果表明,当前先进LLMs在该领域仍难以保证准确性和安全性,需谨慎应用以避免对PWUD产生负面健康影响。

链接: https://arxiv.org/abs/2507.21815
作者: Kaixuan Wang,Chenxin Diao,Jason T. Jacques,Zhongliang Guo,Shuai Zhao
机构: University of St. Andrews (圣安德鲁斯大学); University of Edinburgh (爱丁堡大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 15 pages, 5 figures, 12 tables, a dataset

点击查看摘要

Abstract:Millions of individuals’ well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM’s accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.
zh

[NLP-15] Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish

【速读】: 该论文旨在解决西班牙语新闻文本中英语词汇借用(anglicism)的识别问题,这是自然语言处理领域中跨语言影响研究的重要任务。解决方案的关键在于利用多种先进模型技术,包括大语言模型(LLM)、深度学习模型、基于Transformer的模型以及规则系统,通过多团队协作与对比实验,验证不同方法在该任务上的性能差异,最终F1分数从0.17到0.99不等,凸显了模型架构和特征工程对识别准确率的显著影响。

链接: https://arxiv.org/abs/2507.21813
作者: Elena Alvarez-Mellado,Jordi Porta-Zamorano,Constantine Lignos,Julio Gonzalo
机构: UNED(西班牙国家远程教育大学); UAM(马德里自治大学); Brandeis University(布兰迪斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted in the journal Procesamiento del Lenguaje Natural 75

点击查看摘要

Abstract:This paper summarizes the main findings of ADoBo 2025, the shared task on anglicism identification in Spanish proposed in the context of IberLEF 2025. Participants of ADoBo 2025 were asked to detect English lexical borrowings (or anglicisms) from a collection of Spanish journalistic texts. Five teams submitted their solutions for the test phase. Proposed systems included LLMs, deep learning models, Transformer-based models and rule-based systems. The results range from F1 scores of 0.17 to 0.99, which showcases the variability in performance different systems can have for this task.
zh

[NLP-16] ChartMark: A Structured Grammar for Chart Annotation IEEE-VIS2025

【速读】: 该论文旨在解决图表注释(chart annotations)在跨平台复用时面临的碎片化与非标准化问题,这些问题限制了注释语义的通用性和可视化实现的一致性。解决方案的关键在于提出一种结构化的语法——ChartMark,其核心创新是将注释语义(annotation semantics)与可视化实现(visualization implementations)解耦,并通过分层框架映射到注释维度(如任务、图表上下文等),从而支持从抽象意图到精确视觉细节的灵活表达。该方法显著提升了注释的可移植性与可扩展性,实验表明其能有效转化为Vega-Lite可视化规范,验证了方案的实用性与表达能力。

链接: https://arxiv.org/abs/2507.21810
作者: Yiyu Chen,Yifan Wu,Shuyu Shen,Yupeng Xie,Leixian Shen,Hui Xiong,Yuyu Luo
机构: HKUST(GZ) (香港科技大学(广州)); HKUST (香港科技大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: IEEE VIS 2025

点击查看摘要

Abstract:Chart annotations enhance visualization accessibility but suffer from fragmented, non-standardized representations that limit cross-platform reuse. We propose ChartMark, a structured grammar that separates annotation semantics from visualization implementations. ChartMark features a hierarchical framework mapping onto annotation dimensions (e.g., task, chart context), supporting both abstract intents and precise visual details. Our toolkit demonstrates converting ChartMark specifications into Vega-Lite visualizations, highlighting its flexibility, expressiveness, and practical applicability.
zh

[NLP-17] he Problem with Safety Classification is not just the Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言场景下安全分类模型(safety classification models)的鲁棒性评估问题,特别是当前对安全分类器有效性的评测方法存在不足,且现有评估数据集在不同语言间可能存在偏差。其解决方案的关键在于通过构建覆盖18种语言的多语言测试数据集,系统性地评估5个安全分类模型在不同语言中的表现差异,从而揭示多语言不平等现象,并指出当前安全分类器性能局限不仅源于模型本身,也与评估数据集的设计缺陷密切相关。

链接: https://arxiv.org/abs/2507.21782
作者: Sowmya Vajjala
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print, Short paper

点击查看摘要

Abstract:Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We expect that these findings will contribute to the discussion on developing better methods to identify harmful content in LLM inputs across languages.
zh

[NLP-18] AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models

【速读】: 该论文旨在解决农业领域大语言模型(Large Language Models, LLMs)部署受限的问题,主要瓶颈在于缺乏高质量的训练数据和系统化的评估基准。为此,作者提出了AgriEval——首个面向中文农业领域的综合性评估基准,其关键创新在于:(1) 覆盖六大农业主类与29个子类别的多维能力评测体系,涵盖记忆、理解、推理与生成四类核心认知场景;(2) 数据源自高校考试与作业,确保内容的专业性与真实性,可有效评估模型在农业知识应用与专家级决策上的表现;(3) 包含14,697道选择题与2,167道开放问答题,规模居当前农业领域之首。实验证明,多数主流LLMs在该基准上准确率不足60%,凸显了农业专用模型的发展空间,并为后续优化提供了策略依据。

链接: https://arxiv.org/abs/2507.21773
作者: Lian Yan,Haotian Wang,Chen Tang,Haifeng Liu,Tianyang Sun,Liangliang Liu,Yi Guan,Jingchi Jiang
机构: Harbin Institute of Technology (哈尔滨工业大学); MemTensor (Shanghai) Technology Co., Ltd. (MemTensor(上海)科技有限公司)
类目: Computation and Language (cs.CL)
备注: 36 pages, 22 figures

点击查看摘要

Abstract:In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at this https URL.
zh

[NLP-19] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在实际应用中易受对抗攻击(adversarial attacks)影响的问题,即模型鲁棒性不足。其解决方案的关键在于提出一种简单而有效的附加模块,通过移除实例级主成分(instance-level principal components)来重构嵌入空间,使其更接近高斯分布特性,从而在不依赖传统对抗防御机制或扰动原始训练数据的前提下,降低对抗噪声对决策边界的影响,同时保持语义关系不变,实现鲁棒性与泛化能力之间的平衡。

链接: https://arxiv.org/abs/2507.21750
作者: Yang Wang,Chenghao Xiao,Yizhi Li,Stuart E. Middleton,Noura Al Moubayed,Chenghua Lin
机构: The University of Manchester (曼彻斯特大学); Durham University (杜伦大学); The University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL)
备注: This paper was accepted with an A-decision to Transactions of the Association for Computational Linguistics. This version is the pre-publication version prior to MIT Press production

点击查看摘要

Abstract:Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
zh

[NLP-20] UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在链式思维(Chain-of-Thought, CoT)推理过程中存在的安全挑战,特别是现有基于监督微调(SFT)的安全对齐研究主要关注过滤能产生安全、高质量响应的提示词,而忽视了那些始终诱发有害输出的“硬提示”(hard prompts)。解决方案的关键在于构建UnsafeChain数据集,该数据集由多样来源的硬提示组成,并通过识别不安全的生成结果并显式纠正为安全响应来增强模型的安全性,同时保持其通用推理能力。这种基于纠正(correction-based)的监督机制显著提升了模型在分布外和分布内基准测试中的安全性表现,且即使使用仅1K样本的子集也能达到或超越基线性能,验证了该方法的有效性和泛化能力。

链接: https://arxiv.org/abs/2507.21652
作者: Raj Vardhan Tomar,Preslav Nakov,Yuxia Wang
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学); Cluster Innovation Centre, University of Delhi (德里大学创新中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at this https URL
zh

[NLP-21] Libra: Assessing and Improving Reward Model by Learning to Think

【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)训练中奖励模型(Reward Model, RM)在复杂推理场景下性能不足的问题,尤其是现有RL范式依赖于人工标注的参考答案和受限输出格式所带来的两大局限:一是对高质量参考答案的强依赖性,二是对输出格式的严格约束。这两个限制严重阻碍了奖励模型的数据扩展与推理能力的持续提升。解决方案的关键在于提出一个系统性的框架,包括两个核心创新:其一,构建了一个面向推理任务的基准测试集Libra Bench,该基准由多样化的高难度数学问题和先进推理模型生成,弥补了现有奖励模型评估基准在推理场景中的不足;其二,引入基于“学习如何思考”(learning-to-think)方法的生成式奖励模型(Generative Reward Model),开发出Libra-RM系列具备推理能力的奖励模型,在多个基准上达到最先进性能,并通过下游实验验证其与实际应用的相关性及利用无标签数据进一步提升推理模型的能力。

链接: https://arxiv.org/abs/2507.21645
作者: Meng Zhou,Bei Li,Jiahao Liu,Xiaowen Shi,Yang Bai,Rongxiang Weng,Jingang Wang,Xunliang Cai
机构: Meituan(美团)
类目: Computation and Language (cs.CL)
备注: Work In Progress

点击查看摘要

Abstract:Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations: 1) the dependence on finely annotated reference answer to attain rewards; and 2) the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
zh

[NLP-22] Multilingual JobBERT for Cross-Lingual Job Title Matching

【速读】: 该论文旨在解决跨语言职位名称匹配(cross-lingual job title matching)问题,即在不同语言间准确识别和对齐具有相同或相似含义的职位名称,以支持多语言劳动力市场分析。解决方案的关键在于提出JobBERT-V3模型,该模型基于对比学习(contrastive learning)框架,利用合成翻译(synthetic translations)构建覆盖英语、德语、西班牙语和中文的平衡多语言数据集(超过2100万条职位标题),并在保持前代模型JobBERT-V2高效架构的基础上,实现无需任务特定监督即可跨语言鲁棒对齐的能力。

链接: https://arxiv.org/abs/2507.21609
作者: Jens-Joris Decorte,Matthias De Lange,Jeroen Van Hautte
机构: TechWolf(科技狼); Ghent University (根特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the TalentCLEF 2025 Workshop as part of CLEF 2025

点击查看摘要

Abstract:We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the efficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be effectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: this https URL.
zh

[NLP-23] Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

【速读】: 该论文旨在解决多语言预训练编码器-解码器翻译模型中序列级知识蒸馏(sequence-level knowledge distillation, KD)的效率与效果问题,特别是传统基于束搜索(beam search)生成单一最优翻译时,未能充分捕捉教师模型输出分布的多样性,导致学生模型学习受限。解决方案的关键在于提出多假设蒸馏(Multi-Hypothesis Distillation, MHD),通过在束搜索中生成多个候选翻译(如n-best列表),为学生模型提供更丰富的目标侧前缀空间和教师模型分布的更大覆盖范围,从而提升学生模型的泛化能力与鲁棒性,尤其在低资源语言场景下,尽管采样方法可能略微降低翻译质量,但显著增强了语料的词汇丰富性和变异性,有效缓解了知识蒸馏过程中性别偏见的放大问题。

链接: https://arxiv.org/abs/2507.21568
作者: Aarón Galiano-Jiménez,Juan Antonio Pérez-Ortiz,Felipe Sánchez-Martínez,Víctor M. Sánchez-Cartagena
机构: Universitat d’Alacant (阿尔卡拉大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model’s output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage n -best lists from beam search to guide the student’s learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.
zh

[NLP-24] Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers

【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理西班牙语不规则形态模式(morphome)时是否具备认知合理性的问题,即模型能否像人类一样对复杂语言现象表现出类似的敏感性和偏好。其解决方案的关键在于:采用与原始人类行为研究一致的分析框架,通过控制输入条件,在自然、低频和高频三种动词分布条件下对比基于Transformer的神经网络模型与人类数据的表现差异。实验发现,尽管模型在词干和词缀准确性上优于人类,但在响应偏好上存在显著分歧——人类始终偏好自然响应,而模型则受训练数据中不规则动词比例影响,并仅在自然和低频训练条件下表现出对测试项与真实西班牙语L形动词语音相似性的敏感性,揭示了当前模型在模拟人类语言认知机制方面的局限性。

链接: https://arxiv.org/abs/2507.21556
作者: Akhilesh Kakolu Ramarao,Kevin Tang,Dinah Baer-Henney
机构: Heinrich Heine University Düsseldorf (海因里希海涅杜塞尔多夫大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the cognitive plausibility of the Spanish irregular morphomic pattern by directly comparing transformer-based neural networks to human behavioral data from \citetNevins2015TheRA. Using the same analytical framework as the original human study, we evaluate whether transformer models can replicate human-like sensitivity to a complex linguistic phenomena, the morphome, under controlled input conditions. Our experiments focus on three frequency conditions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns. While the models outperformed humans in stem and suffix accuracy, a clear divergence emerged in response preferences. Unlike humans, who consistently favored natural responses across all test items, models’ preferred irregular responses and were influenced by the proportion of irregular verbs in their training data. Additionally, models trained on the natural and low-frequency distributions, but not the high-frequency distribution, were sensitive to the phonological similarity between test items and real Spanish L-shaped verbs.
zh

[NLP-25] MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中知识冲突(Knowledge Conflict)的检测与理解问题,此类冲突常表现为检索文档之间不一致或与模型参数化知识相矛盾。现有基准测试存在局限性,如仅聚焦于问答场景、依赖实体替换技术以及冲突类型单一。为应对这些问题,论文提出一种基于知识图谱(Knowledge Graph, KG)的框架,其核心在于利用KG的显式关系结构生成多样且细微的上下文冲突,从而提升冲突的可解释性与可控性。实验结果表明,无论是开源还是商用大语言模型(Large Language Models, LLMs)在多跳推理场景下均难以准确识别和定位冲突源,凸显了当前LLMs在整合异构甚至矛盾信息方面的不足。

链接: https://arxiv.org/abs/2507.21544
作者: Jungyeon Lee,Kangmin Lee,Taeuk Kim
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
zh

[NLP-26] Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language

【速读】: 该论文旨在解决维吾尔语自然语言处理(Natural Language Processing, NLP)领域中依赖句法标注资源匮乏的问题。针对现有通用依存树库(Treebank)难以适配低资源、黏着语特征的维吾尔语结构这一局限,研究提出了一套定制化的依存标注框架,其关键在于:设计包含18个主关系和26个子类型的关系体系,引入如cop:zero(用于无动词小句)和instr:case=loc/dat(用于细致的工具格功能)等特化标签,并基于九条注释原则确保形态学复杂性和语义透明性。通过跨标准评估验证,该方案相较于通用依存标注体系展现出47.9%的系统性差异,证明其对维吾尔语语法结构的更准确刻画,从而为依存句法分析及下游NLP任务提供高质量数据基础,并为其他形态复杂语言的树库建设提供可复用范式。

链接: https://arxiv.org/abs/2507.21536
作者: Jiaxin Zuo,Yiquan Wang,Yuan Pan,Xiadiya Yibulayin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To address a critical resource gap in Uyghur Natural Language Processing (NLP), this study introduces a dependency annotation framework designed to overcome the limitations of existing treebanks for the low-resource, agglutinative language. This inventory includes 18 main relations and 26 subtypes, with specific labels such as cop:zero for verbless clauses and instr:case=loc/dat for nuanced instrumental functions. To empirically validate the necessity of this tailored approach, we conducted a cross-standard evaluation using a pre-trained Universal Dependencies parser. The analysis revealed a systematic 47.9% divergence in annotations, pinpointing the inadequacy of universal schemes for handling Uyghur-specific structures. Grounded in nine annotation principles that ensure typological accuracy and semantic transparency, the Modern Uyghur Dependency Treebank (MUDT) provides a more accurate and semantically transparent representation, designed to enable significant improvements in parsing and downstream NLP tasks, and offers a replicable model for other morphologically complex languages.
zh

[NLP-27] Automatic Classification of User Requirements from Online Feedback – A Replication Study

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)在需求工程(Requirements Engineering, RE)领域中研究可复现性不足的问题,特别是针对已有NLP for RE(NLP4RE)研究缺乏系统性复制与扩展验证的现状。其解决方案的关键在于:首先,基于公开源代码复现了先前一项关于使用深度学习模型从用户反馈中分类需求的基准研究;其次,通过引入外部数据集评估模型泛化能力,并与GPT-4o零样本分类器进行对比;最后,构建并发布复制研究ID卡(replication study ID-card),以提升研究透明度和可重复性。结果表明,不同模型在可复现性上表现差异显著,其中朴素贝叶斯(Naive Bayes)模型完全可复现,而BERT和ELMo等模型展现出良好的跨数据集泛化性能,且GPT-4o性能与传统机器学习模型相当,同时确认原研究具备良好复制准备度,仅需补充环境配置文件即可进一步增强。

链接: https://arxiv.org/abs/2507.21532
作者: Meet Bhatt,Nic Boilard,Muhammad Rehan Chaudhary,Cole Thompson,Jacob Idoko,Aakash Sorathiya,Gouri Ginde
机构: University of Calgary (卡尔加里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, Replication package available at this https URL , Accepted at AIRE 2025 (12th International Workshop on Artificial Intelligence and Requirements Engineering)

点击查看摘要

Abstract:Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), “Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning”, which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study’s replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.
zh

[NLP-28] riangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预填充阶段(prefilling stage)因注意力机制时间复杂度随输入序列长度呈二次增长而导致的显著计算瓶颈问题。现有静态稀疏注意力方法通常会降低模型精度,而动态稀疏方法则因运行时稀疏索引估计引入额外计算开销。解决方案的关键在于提出一种无需训练的静态注意力模式——TriangleMix:该方法在浅层使用密集注意力(dense attention),在深层切换为三角形形状的稀疏模式(triangle-shaped sparse pattern),从而在不牺牲模型准确性的前提下,将深层注意力开销降低3.7倍至15.3倍,并使整体首次生成时间(Time-to-First-Token, TTFT)减少12%至32%(序列长度为32K至128K)。此外,TriangleMix可与动态稀疏方法无缝集成,进一步提升推理效率,例如在128K序列长度下使MInference加速19%。

链接: https://arxiv.org/abs/2507.21526
作者: Zhiyuan He,Yike Zhang,Chengruidong Zhang,Huiqiang Jiang,Yuqing Yang,Lili Qiu
机构: Microsoft Research (微软研究院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.
zh

[NLP-29] Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

【速读】: 该论文旨在解决基于Transformer架构的端到端自动语音识别(ASR)系统在CPU等资源受限设备上部署时,因自回归解码计算开销大而导致推理效率低的问题。其解决方案的关键在于提出了一种无需额外草稿模型(draft model)的“Token Map Drafting”方法,通过利用领域特定训练数据预先构建的n-gram词元映射(token map),实现高效的推测解码(speculative decoding, SD),从而在保持转录准确率的前提下显著提升推理速度,尤其适用于结构化、低困惑度场景下的边缘设备ASR应用。

链接: https://arxiv.org/abs/2507.21522
作者: Tuan Vu Ho,Hiroaki Kokubo,Masaaki Yamamoto,Yohei Kawaguchi
机构: Hitachi, Ltd. (日立有限公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at EUSIPCO 2025

点击查看摘要

Abstract:End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emphToken Map Drafting, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured, low-perplexity domains without sacrificing transcription accuracy. Experimental results demonstrate decoding speed-ups of 1.27\times on the CI-AVSR dataset and 1.37\times on our internal dataset without degrading recognition accuracy. Additionally, our approach achieves a 10% absolute improvement in decoding speed over the Distill-spec baseline running on CPU, highlighting its effectiveness for on-device ASR applications.
zh

[NLP-30] What Does it Mean for a Neural Network to Learn a “World Model”?

【速读】: 该论文旨在解决当前神经网络研究中对“世界模型”(world model)这一概念缺乏精确、可操作定义的问题,从而为实验研究提供统一的语言和评估标准。其核心挑战在于区分真正学习并利用潜在状态空间(latent state space)的模型与仅因数据或任务特性而偶然生成类似表征的非本质模型。解决方案的关键在于提出一套基于线性探测(linear probing)文献思想的形式化标准,明确要求一个计算过程必须通过数据生成过程的表示来实现,并额外引入一组条件以排除模型表征是训练数据或任务结构的平凡结果的可能性。

链接: https://arxiv.org/abs/2507.21513
作者: Kenneth Li,Fernanda Viégas,Martin Wattenberg
机构: Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a set of precise criteria for saying a neural net learns and uses a “world model.” The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent “state space” of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a “world model” is not a trivial consequence of the neural net’s data or task.
zh

[NLP-31] Persona Vectors: Monitoring and Controlling Character Traits in Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署和微调过程中出现的不可控人格偏移问题,即模型在保持“有益、无害、诚实”(helpful, harmless, honest)理想人格特质方面可能出现的偏离。其关键解决方案是提出了一种自动化提取“人格向量”(persona vectors)的方法,这些向量位于模型激活空间中,可表征诸如邪恶(evil)、谄媚(sycophancy)和幻觉倾向(propensity to hallucinate)等特定人格特征。通过监控和干预这些向量的变动,研究者能够预测、控制甚至预防训练过程中的人格漂移,并识别出可能导致不良人格变化的训练数据样本,从而实现对模型行为的精准调控。

链接: https://arxiv.org/abs/2507.21509
作者: Runjin Chen,Andy Arditi,Henry Sleight,Owain Evans,Jack Lindsey
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models interact with users through a simulated ‘Assistant’ persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
zh

[NLP-32] VN-MTEB: Vietnamese Massive Text Embedding Benchmark

【速读】: 该论文旨在解决越南语场景下大规模文本嵌入模型(text embedding models)缺乏高质量、多样化测试数据集的问题,这限制了研究人员在真实部署前对模型性能的有效评估。其解决方案的关键在于构建一个名为VN-MTEB的越南语嵌入模型基准测试集,通过自研的自动化框架将英文Massive Text Embedding Benchmark(MTEB)中的大量样本翻译为越南语,并利用大语言模型(LLMs)和先进嵌入模型进行翻译与筛选,以确保语言自然流畅、语义保真度高,同时保留命名实体识别(NER)和代码片段等关键信息。该基准包含6类任务的41个数据集,为越南语文本嵌入模型的评估提供了系统性支持。

链接: https://arxiv.org/abs/2507.21500
作者: Loc Pham,Tung Luu,Thu Vo,Minh Nguyen,Viet Hoang
机构: GreenNode AI(绿节点人工智能); School of Electrical Engineering, International University, VNU-HCMC(国际大学,胡志明市国家大学-西贡理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages (including reference, appendix) 41 datasets from 6 tasks (retrieval, classification, pair-classification, clustering, rerank, sts) 7 figures, 16 tables, benchmark 18 text embedding models

点击查看摘要

Abstract:Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks. Datasets are available at HuggingFace: this https URL
zh

[NLP-33] Improving Task Diversity in Label Efficient Supervised Finetuning of LLM s

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中的标签高效学习问题,即在有限标注数据预算下提升模型性能。传统方法多依赖提示多样性(prompt-diversity)进行数据选择,而本文提出以任务多样性(task-diversity)为核心原则,利用任务标签的易获取性与预训练模型在不同任务上的置信度差异,设计了一种基于逆置信度加权的简单采样策略。该方案通过优先选取低置信度任务中的样本,实现更有效的知识覆盖,在保持较低计算开销的同时显著降低标注成本(最高达80%),并在多个数据集和预算设置下达到或优于现有最优方法,甚至在完整数据集训练上实现4%的MMLU分数提升。

链接: https://arxiv.org/abs/2507.21482
作者: Abhinav Arabelly,Jagrut Nemade,Robert D Nowak,Jifan Zhang
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation – a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80%.
zh

[NLP-34] Which LLM s Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在非STEM领域(如幽默理解)中智能评估不足的问题。随着LLMs在数学和科学等STEM任务上趋于饱和,亟需新的、具有挑战性的评测基准来衡量其跨域推理能力。解决方案的关键在于构建HumorBench——一个基于《纽约客》漫画标题竞赛的300余个独特漫画-标题对组成的基准,配有专家标注的幽默元素评价标准,用于系统评估模型对幽默机制的解释能力和关键要素识别能力。该基准要求模型进行概念关联假设生成与验证,并可能回溯初始解读以获得最合理的解释,从而推动对模型深层语义理解和文化常识推理能力的精准测量。

链接: https://arxiv.org/abs/2507.21476
作者: Reuben Narad,Siddharth Suresh,Jiayi Chen,Pine S.L. Dysart-Bricken,Bob Mankoff,Robert Nowak,Jifan Zhang,Lalit Jain
机构: University of Washington (华盛顿大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Air Mail and Cartoon Collections (Air Mail 和漫画收藏)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present HumorBench, a benchmark designed to evaluate large language models’ (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and this http URL, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.
zh

[NLP-35] owards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour

【速读】: 该论文旨在解决交通出行方式选择预测中模型可解释性不足与本地部署可行性低的问题,尤其针对当前主流大语言模型(Large Language Models, LLMs)在交通行为建模中缺乏因果机制、难以本地化部署且解释能力有限的挑战。解决方案的关键在于提出并验证了LiTransMC——首个专为出行方式选择任务微调的因果LLM,采用参数高效微调(Parameter Efficient Fine-Tuning)与损失掩码策略,在三个显性与隐性偏好数据集上实现高精度预测(加权F1分数0.6845)和近乎完美的分布校准(Jensen-Shannon散度0.000245),显著优于未微调本地模型、大型闭源系统(如GPT-4o)及传统离散选择模型和机器学习分类器。该方法通过融合结构化行为预测与自然语言推理,首次实现了预测准确性与决策逻辑透明性的协同提升,为构建隐私友好、成本可控、具备多任务能力的交通模拟与政策测试工具提供了可行路径。

链接: https://arxiv.org/abs/2507.21432
作者: Tareq Alsaleh,Bilal Farooq
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task. We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions. Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory. LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset. This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability. Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment.
zh

[NLP-36] MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在多轮交互中因固定上下文窗口限制而导致的工具或Model Context Protocol (MCP) 服务器上下文管理效率低下的问题。其核心解决方案是提出MemTool,一个短时记忆框架,使LLM代理能够在多轮对话中动态管理工具或MCP服务器的上下文。MemTool的关键创新在于提供三种代理架构:自主代理模式(Autonomous Agent Mode)、工作流模式(Workflow Mode)和混合模式(Hybrid Mode),分别实现完全自主控制、确定性控制及两者的结合,从而在工具移除效率与任务完成准确率之间实现灵活权衡。

链接: https://arxiv.org/abs/2507.21428
作者: Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,James A. Burke
机构: PricewaterhouseCoopers(普华永道)
类目: Computation and Language (cs.CL)
备注: 23 Pages, 20 Figures

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi-turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90-94% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0-60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.
zh

[NLP-37] ReGATE: Learning Faster and Better with Fewer Tokens in MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中因Token数量激增而导致的计算成本急剧上升的问题。现有效率方法主要面向推理阶段,依赖于Token削减或合并,在训练阶段效果有限。解决方案的关键在于提出一种自适应Token剪枝方法ReGATE(Reference - Guided Adaptive Token Elision),其核心机制是基于教师-学生框架:将待训练的MLLM作为学生模型,冻结一个参考大语言模型(LLM)作为教师模型;教师模型为每个Token计算参考损失,并与学生模型自身难度分数的指数移动平均(EMA)相结合,生成动态权重评分,从而在前向传播中选择性处理关键Token并跳过低信息量Token,实现训练加速和资源节约。实验表明,该方法在VideoLLaMA2上可达到标准训练峰值精度的2倍速度,仅使用35%的Token,且在额外训练后超越基线性能,同时总Token消耗降低超41%。

链接: https://arxiv.org/abs/2507.21420
作者: Chaoyu Li,Yogesh Kulkarni,Pooyan Fazli
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference - Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student’s own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2 \times faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
zh

[NLP-38] Multimodal LLM s as Customized Reward Models for Text-to-Image Generation ICCV2025

【速读】: 该论文旨在解决现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的文本到图像(Text-to-Image, T2I)生成质量评估方法中存在的训练效率低、依赖指令遵循数据以及仅通过分析文本响应进行评价等局限性问题。解决方案的关键在于提出一种名为LLaVA-Reward的高效奖励模型,其核心创新包括:1)直接利用预训练MLLM对图文对输入的隐藏状态进行建模,避免了繁琐的监督微调过程;2)引入Skip-connection Cross Attention(SkipCA)模块以增强解码器-only架构中视觉与文本表示之间的双向交互,通过连接早期视觉特征与后期隐藏状态提升文本-图像关联推理能力;3)支持多种偏好数据形式(如成对和非成对偏好数据),实现更灵活高效的微调策略。实验证明,LLaVA-Reward在多个评估维度(如图文对齐度、保真度/伪影、安全性及整体排名)上均优于传统及基于MLLM的方法,且能有效支持生成过程中推理时的扩展性。

链接: https://arxiv.org/abs/2507.21391
作者: Shijie Zhou,Ruiyi Zhang,Huaisheng Zhu,Branislav Kveton,Yufan Zhou,Jiuxiang Gu,Jian Chen,Changyou Chen
机构: University at Buffalo (纽约州立大学布法罗分校); Adobe Research (Adobe 研究院); Pennsylvania State University (宾夕法尼亚州立大学); Luma AI (Luma AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICCV 2025. Code available at this https URL

点击查看摘要

Abstract:We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden this http URL addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
zh

[NLP-39] aching Language Models To Gather Information Proactively

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在面对不完整或模糊提示时,往往被动响应或仅做狭窄澄清,而无法主动获取对高质量解决方案至关重要的隐含用户信息的问题。其解决方案的关键在于提出一种新的任务范式——主动信息收集(proactive information gathering),并设计了一个可扩展的框架来生成部分指定的真实世界任务,通过掩码关键信息模拟真实场景中的模糊性;核心创新是采用强化微调策略,奖励那些能有效激发用户隐含知识(如领域专业知识或细粒度需求)的提问行为,从而引导模型从被动文本生成转向主动协作思维伙伴。

链接: https://arxiv.org/abs/2507.21389
作者: Tenghao Huang,Sihao Chen,Muhao Chen,Jonathan May,Longqi Yang,Mengting Wan,Pei Zhou
机构: University of Southern California (南加州大学); Microsoft Corporation (微软公司); University of California, Davis (加州大学戴维斯分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts, falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information – such as hidden domain expertise or fine-grained requirements – that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
zh

[NLP-40] urbocharging Web Automation: The Impact of Compressed History States

【速读】: 该论文旨在解决当前网页自动化(web automation)方法中忽视历史状态(history states)利用的问题。现有方法仅依赖当前网页状态、历史动作和语言指令来预测下一步操作,而未有效整合历史状态信息,导致任务执行效率受限。其核心挑战在于网页状态的高度冗余性(highly verbose nature)会导致输入序列过长且信息稀疏,难以充分利用历史状态。解决方案的关键在于提出一种新颖的网页历史压缩器(web history compressor),通过一个历史压缩模块将每条历史状态中的任务相关特征提炼为固定长度的短表示(fixed-length short representation),从而缓解冗余与稀疏问题,显著提升模型对历史状态的利用效率。实验表明,该方法在Mind2Web和WebLINX数据集上相较无历史输入的基线方法实现了1.2–5.4%的绝对准确率提升。

链接: https://arxiv.org/abs/2507.21369
作者: Xiyue Zhu,Peng Tang,Haofu Liao,Srikar Appalaraju
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); AWS AI Labs (亚马逊云科技人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.
zh

[NLP-41] StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation

【速读】: 该论文旨在解决企业场景中从非结构化文本中提取键值对(key-value pairs)的自动化评估难题,尤其针对特定领域或组织内部文档缺乏高质量、可扩展的基准测试数据的问题。传统方法依赖人工标注构建基准,成本高且难以规模化。解决方案的关键在于提出一个名为StructText的端到端框架,利用现有表格数据作为结构化真实标签,通过“规划-执行”两阶段合成自然语言文本,并引入多维评估策略:一方面借助大语言模型(LLM)判断事实性、幻觉和连贯性,另一方面结合客观提取指标衡量数值与时间信息的准确性,从而实现高保真度自动基准生成。

链接: https://arxiv.org/abs/2507.21340
作者: Satyananda Kashyap,Sola Shirai,Nandana Mihindukulasooriya,Horst Samulowitz
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Data available: this https URL and code available at: this https URL

点击查看摘要

Abstract:Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for evaluating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text using existing tabular data. It uses available tabular data as structured ground truth, and follows a two-stage ``plan-then-execute’’ pipeline to synthetically generate corresponding natural-language text. To ensure alignment between text and structured source, we introduce a multi-dimensional evaluation strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring numeric and temporal accuracy. We evaluated the proposed method on 71,539 examples across 49 datasets. Results reveal that while LLMs achieve strong factual accuracy and avoid hallucination, they struggle with narrative coherence in producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a framework, including datasets, evaluation tools, and baseline extraction systems, to support continued research.
zh

[NLP-42] A Deep Learning Automatic Speech Recognition Model for Shona Language

【速读】: 该论文旨在解决低资源语言Shona的自动语音识别(Automatic Speech Recognition, ASR)准确率低的问题,主要挑战包括训练数据稀缺、标注数据不足以及Shona特有的声调和语法复杂性。解决方案的关键在于采用混合深度学习架构:使用卷积神经网络(Convolutional Neural Network, CNN)进行声学建模,结合长短期记忆网络(Long Short-Term Memory, LSTM)进行语言建模,并引入注意力机制以捕捉声调特征;同时,通过数据增强和迁移学习缓解数据稀缺问题。实验结果显示,该系统在词错误率(Word Error Rate, WER)为29%、音素错误率(Phoneme Error Rate, PER)为12%的情况下实现了74%的整体准确率,显著优于传统统计模型,验证了深度学习在提升低资源语言ASR性能方面的有效性。

链接: https://arxiv.org/abs/2507.21331
作者: Leslie Wellington Sirora,Mainford Mutandavari
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.
zh

[NLP-43] Do Large Language Models Understand Morality Across Cultures?

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在跨文化道德观表征上的偏差问题,即模型是否能够准确捕捉并反映不同文化背景下道德态度的差异与共性。研究发现,现有LLMs往往无法充分再现跨文化道德差异,倾向于压缩文化间差异且与实证调查数据的匹配度较低。解决方案的关键在于采用三种互补方法:比较模型输出与调查数据中道德评分的方差、进行聚类一致性分析以评估国家群体划分的对应关系,以及通过结构化提示词对模型进行直接探测,从而系统性识别和量化模型在跨文化道德理解中的代表性不足问题,并为后续提升模型的文化敏感性和伦理对齐提供依据。

链接: https://arxiv.org/abs/2507.21319
作者: Hadi Mohammadi,Yasmeen F.S.S. Meijer,Efthymia Papadopoulou,Ayoub Bagheri
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have established them as powerful tools across numerous domains. However, persistent concerns about embedded biases, such as gender, racial, and cultural biases arising from their training data, raise significant questions about the ethical use and societal consequences of these technologies. This study investigates the extent to which LLMs capture cross-cultural differences and similarities in moral perspectives. Specifically, we examine whether LLM outputs align with patterns observed in international survey data on moral attitudes. To this end, we employ three complementary methods: (1) comparing variances in moral scores produced by models versus those reported in surveys, (2) conducting cluster alignment analyses to assess correspondence between country groupings derived from LLM outputs and survey data, and (3) directly probing models with comparative prompts using systematically chosen token pairs. Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation, tending to compress differences and exhibit low alignment with empirical survey patterns. These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs. We conclude by discussing the implications for the responsible development and global deployment of LLMs, emphasizing fairness and ethical alignment.
zh

[NLP-44] Can human clinical rationales improve the performance and explainability of clinical text classification models?

【速读】: 该论文旨在解决如何通过引入人类提供的临床推理(clinical rationales)来提升基于Transformer模型的临床文本分类任务在性能和可解释性方面的表现。其核心问题是:临床推理作为额外监督信号是否能有效改善模型性能并增强其可解释性,尤其是在资源受限场景下。解决方案的关键在于将99,125条人工标注的临床推理与128,649份电子病理报告一同用于训练,并通过“充分性”(sufficiency)指标自动筛选高质量推理以优化训练数据质量。研究发现,尽管在高资源条件下临床推理可带来小幅性能提升,但在低资源场景中效果不稳定;更重要的是,单纯增加更多标注报告比使用推理数据更能提升模型准确率,而仅在可解释性方面略有优势。因此,该研究指出,若目标为最大化准确性,应优先扩展报告标注而非生成推理;若强调可解释性,则可考虑使用推理数据辅助训练。

链接: https://arxiv.org/abs/2507.21302
作者: Christoph Metzner,Shang Gao,Drahomira Herrmannova,Heidi A. Hanson
机构: The University of Tennessee (田纳西大学); Thomson Reuters (汤森路透); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don’t consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.
zh

[NLP-45] LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中,推理服务(inference serving)与持续训练(continuous retraining)因分离部署导致的资源利用效率低下和响应延迟问题。现有方案通常将两者分置于独立服务器并按阶段执行,造成GPU空闲、适应新数据滞后等瓶颈,其根源在于服务请求的动态到达特性以及流水线并行训练中的负载异构性。解决方案的关键在于提出LeMix系统,通过离线性能剖析(offline profiling)、执行预测机制(execution prediction)与运行时调度策略(runtime scheduling)相结合,实现对共享节点上并发推理与训练任务的动态资源分配优化;该方案能感知任务特性和共执行干扰(co-execution interference),从而在不牺牲服务响应性的前提下提升资源利用率和推理质量,实测表明其吞吐量最高提升3.53倍,推理损失降低0.61倍,响应时间服务等级目标(SLO)达成率提高2.12倍。

链接: https://arxiv.org/abs/2507.21276
作者: Yufei Li,Zexin Li,Yinglun Zhu,Cong Liu
机构: University of California, Riverside (加州大学河滨分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted by RTSS 2025

点击查看摘要

Abstract:Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.
zh

[NLP-46] CompoST: A Benchmark for Analyzing the Ability of LLM s To Compositionally Interpret Questions in a QALD Setting ISWC2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自然语言到结构化查询(如SPARQL)映射任务中是否具备系统性(systematic)和组合性(compositional)理解能力的问题。其核心挑战在于:即使模型能够理解单个词汇或短语的含义,是否仍能将其有效组合以处理未见过的复杂句法结构。解决方案的关键在于构建三个基于DBpedia图模式、受控生成且难度递增的数据集,这些数据集依赖Lemon词典进行语义化表达,从而严格测试LLMs在已掌握原子单元的基础上,对结构性复杂问题的解析能力。实验表明,随着结构复杂度提升,宏平均F₁分数从0.45骤降至0.09,即便提供完整输入信息,最低复杂度数据集的F₁也未超过0.57,证明当前LLMs难以实现真正意义上的组合式语言解释。

链接: https://arxiv.org/abs/2507.21257
作者: David Maria Schmidt,Raoul Schubert,Philipp Cimiano
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Research Track, 24th International Semantic Web Conference (ISWC 2025), November 2-6, 2025, Nara, Japan

点击查看摘要

Abstract:Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they “understand” the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro F_1 degrades from 0.45 over 0.26 down to 0.09 with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the F_1 scores do not exceed 0.57 for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
zh

[NLP-47] Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach

【速读】: 该论文旨在解决低资源语言 Bangla 中超党派新闻(hyperpartisan news)识别困难的问题,因为目前针对该语言的先进自然语言处理方法稀缺,导致偏见内容难以被有效检测并可能广泛传播。解决方案的关键在于对 Bangla BERT 进行微调(fine-tuning),这是一种基于 Transformer 架构的预训练语言模型,能够显著提升分类准确性;同时结合半监督学习(semi-supervised learning)进一步优化预测性能,并利用 LIME(Local Interpretable Model-agnostic Explanations)提供可解释性,增强模型决策的透明度与可信度。实验结果显示,该方法在准确率上达到 95.65%,优于传统机器学习模型,验证了 Transformer 模型在低资源场景下的有效性。

链接: https://arxiv.org/abs/2507.21242
作者: Mohammad Mehadi Hasan,Fatema Binte Hassan,Md Al Jubair,Zobayer Ahmed,Sazzatul Yeakin,Md Masum Billah
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current digital landscape, misinformation circulates rapidly, shaping public perception and causing societal divisions. It is difficult to identify hyperpartisan news in Bangla since there aren’t many sophisticated natural language processing methods available for this low-resource language. Without effective detection methods, biased content can spread unchecked, posing serious risks to informed discourse. To address this gap, our research fine-tunes Bangla BERT. This is a state-of-the-art transformer-based model, designed to enhance classification accuracy for hyperpartisan news. We evaluate its performance against traditional machine learning models and implement semi-supervised learning to enhance predictions further. Not only that, we use LIME to provide transparent explanations of the model’s decision-making process, which helps to build trust in its outcomes. With a remarkable accuracy score of 95.65%, Bangla BERT outperforms conventional approaches, according to our trial data. The findings of this study demonstrate the usefulness of transformer models even in environments with limited resources, which opens the door to further improvements in this area.
zh

[NLP-48] Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability

【速读】: 该论文旨在解决低资源语言(如孟加拉语)中犯罪相关社交媒体评论的情感分类问题,以捕捉公众对刑事案件的动态感知变化。其解决方案的关键在于构建了一个包含28,528条孟加拉语评论的新标注数据集,并提出基于XLM-RoBERTa Base架构的Transformer模型,实现了97%的分类准确率,显著优于现有方法;同时引入可解释人工智能(Explainable AI)技术识别影响情感分类的关键特征,从而提升模型透明度并为公共政策制定和犯罪预防提供可操作的洞察。

链接: https://arxiv.org/abs/2507.21234
作者: Fatema Binte Hassan,Md Al Jubair,Mohammad Mehadi Hasan,Tahmid Hossain,S M Mehebubur Rahman Khan Shuvo,Mohammad Shamsul Arefin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, social media platforms have become prominent spaces for individuals to express their opinions on ongoing events, including criminal incidents. As a result, public sentiment can shift dynamically over time. This study investigates the evolving public perception of crime-related news by classifying user-generated comments into three categories: positive, negative, and neutral. A newly curated dataset comprising 28,528 Bangla-language social media comments was developed for this purpose. We propose a transformer-based model utilizing the XLM-RoBERTa Base architecture, which achieves a classification accuracy of 97%, outperforming existing state-of-the-art methods in Bangla sentiment analysis. To enhance model interpretability, explainable AI technique is employed to identify the most influential features driving sentiment classification. The results underscore the effectiveness of transformer-based models in processing low-resource languages such as Bengali and demonstrate their potential to extract actionable insights that can support public policy formulation and crime prevention strategies.
zh

[NLP-49] Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

【速读】: 该论文旨在解决基于Transformer的文本分类模型在可解释性方面的问题,特别是现有基于激活(activation)的归因方法容易受到类别无关特征(class-irrelevant features)干扰,导致解释结果不可靠。其解决方案的关键在于提出Contrast-CAT方法,通过对比输入序列的激活与参考激活,过滤掉类别无关特征,从而生成更清晰、更忠实的词元级归因图谱(token-level attribution maps),显著提升了模型决策过程的可解释性与可信度。

链接: https://arxiv.org/abs/2507.21186
作者: Sungmin Han,Jeonghyun Lee,Sangkyun Lee
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of x1.30 in AOPC and x2.25 in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.
zh

[NLP-50] EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models

【速读】: 该论文旨在解决神经网络性能随模型规模、数据集大小和计算资源变化时的Scaling Law(缩放定律)自动发现难题,传统方法依赖大量人工经验和试错实验。其解决方案的关键在于提出EvoSLD框架,该框架利用大语言模型(Large Language Models, LLMs)引导的进化算法,协同演化符号表达式及其优化过程,从而在多样实验设置下搜索具有最小拟合误差的简洁且通用的函数形式,实现对Scaling Law的高效、高精度自动化发现。

链接: https://arxiv.org/abs/2507.21184
作者: Haowei Lin,Xiangyu Wang,Jianzhu Ma,Yitao Liang
机构: Peking University (北京大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws are fundamental mathematical relationships that predict how neural network performance evolves with changes in variables such as model size, dataset size, and computational resources. Traditionally, discovering these laws requires extensive human expertise and manual experimentation. We introduce EvoSLD, an automated framework for Scaling Law Discovery (SLD) that leverages evolutionary algorithms guided by Large Language Models (LLMs) to co-evolve symbolic expressions and their optimization routines. Formulated to handle scaling variables, control variables, and response metrics across diverse experimental settings, EvoSLD searches for parsimonious, universal functional forms that minimize fitting errors on grouped data subsets. Evaluated on five real-world scenarios from recent literature, EvoSLD rediscovers exact human-derived laws in two cases and surpasses them in others, achieving up to orders-of-magnitude reductions in normalized mean squared error on held-out test sets. Compared to baselines like symbolic regression and ablated variants, EvoSLD demonstrates superior accuracy, interpretability, and efficiency, highlighting its potential to accelerate AI research. Code is available at this https URL.
zh

[NLP-51] MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与人类偏好对齐过程中存在的优化效率低、对响应质量判断过于简化以及缺乏先验奖励知识利用的问题。现有方法如直接偏好优化(Direct Preference Optimization, DPO)及其变体将偏好学习建模为最大似然估计(Maximum Likelihood Estimation, MLE)问题,忽略了已有的奖励知识,并且通常采用二分类方式判断响应优劣,导致对复杂偏好关系的刻画不足。解决方案的关键在于提出最大后验偏好优化(Maximum a Posteriori Preference Optimization, MaPPO),其核心创新是将先验奖励估计显式引入优化目标,构建一个基于最大后验(Maximum a Posteriori, MaP)原则的统一框架,从而在不增加额外超参数的前提下,同时支持离线与在线偏好优化,并兼容DPO系列方法(如SimPO、IPO和CPO),显著提升模型对齐性能且保持计算效率。

链接: https://arxiv.org/abs/2507.21183
作者: Guangchen Lan,Sipeng Zhang,Tianle Wang,Yuwei Zhang,Daoan Zhang,Xinpeng Wei,Xiaoman Pan,Hongming Zhang,Dong-Jun Han,Christopher G. Brinton
机构: Purdue University (普渡大学); University of California, San Diego (加州大学圣地亚哥分校); University of Rochester (罗切斯特大学); Georgia Institute of Technology (佐治亚理工学院); Tencent AI Lab (腾讯人工智能实验室); Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
zh

[NLP-52] OneShield – the Next Generation of LLM Guardrails

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在应用过程中引发的安全、隐私与伦理风险问题,尤其针对当前LLMs持续演化导致的通用防护机制难以适配多样化场景的挑战。解决方案的关键在于提出OneShield——一个独立部署、模型无关且可定制的防护框架,其核心能力包括:支持用户定义风险因素、表达和声明上下文相关的安全与合规策略,并能根据具体客户的需求动态 mitigating(缓解)LLM风险,从而实现对不同应用场景下潜在风险的精细化管控。

链接: https://arxiv.org/abs/2507.21170
作者: Chad DeLuca,Anna Lisa Gentile,Shubhi Asthana,Bing Zhang,Pawan Chowdhary,Kellen Cheng,Basel Shbita,Pengyuan Li,Guang-Jie Ren,Sandeep Gopisetty
机构: IBM Research(IBM研究实验室); Princeton University (普林斯顿大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Large Language Models has created a general excitement about the great potential for a myriad of applications. While LLMs offer many possibilities, questions about safety, privacy, and ethics have emerged, and all the key actors are working to address these issues with protective measures for their own models and standalone solutions. The constantly evolving nature of LLMs makes the task of universally shielding users against their potential risks extremely challenging, and one-size-fits-all solutions unfeasible. In this work, we propose OneShield, our stand-alone, model-agnostic and customizable solution to safeguard LLMs. OneShield aims to provide facilities for defining risk factors, expressing and declaring contextual safety and compliance policies, and mitigating LLM risks, with a focus on each specific customer. We describe the implementation of the framework, the scalability considerations and provide usage statistics of OneShield since its first deployment.
zh

[NLP-53] Diverse LLM s or Diverse Question Interpretations? That is the Ensembling Question

【速读】: 该论文旨在解决如何有效利用多样性以提升大语言模型(Large Language Models, LLMs)在回答二元问题时的性能这一挑战。其核心问题是:在多种多样性策略中,哪种方式能更有效地增强集成模型的准确性。解决方案的关键在于对比两种多样性方法——模型多样性(model diversity),即使用多个不同模型对同一问题进行回答;以及问题解释多样性(question interpretation diversity),即使用同一个模型对同一问题的不同表述形式进行回答。研究采用多数投票(majority voting)作为集成共识机制,并通过在BoolQ、StrategyQA和PubMedQA数据集上的实验发现,问题解释多样性始终优于模型多样性,表明从问题表述层面引入多样性更能提升集成效果。

链接: https://arxiv.org/abs/2507.21168
作者: Rafael Rosales,Santiago Miret
机构: Intel Labs(英特尔实验室); Intel Labs(英特尔实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.
zh

[NLP-54] S-1 Technical Report

【速读】: 该论文旨在解决高质量、低延迟且具备情感控制能力的文本到语音(Text-to-Speech, TTS)生成问题,尤其针对高要求应用场景中对自然度与表达力的需求。其解决方案的关键在于构建两个基于Transformer的自回归TTS模型——TTS-1(1.6B参数)和TTS-1-Max(8.8B参数),通过扩大训练时计算资源,并采用预训练、微调及强化学习对齐(Reinforcement Learning alignment)的顺序流程优化语音语言模型(SpeechLM)组件,从而在多个基准测试中实现最先进性能;同时,模型仅依赖上下文学习(in-context learning)即可捕捉说话者语音特征,支持11种语言、48 kHz高分辨率输出、低延迟合成,并通过音频标记(audio markups)实现细粒度情绪控制与非言语发声(non-verbal vocalizations)。

链接: https://arxiv.org/abs/2507.21138
作者: Oleg Atamanenko,Anna Chalova,Joseph Coombes,Nikki Cope,Phillip Dang,Zhifeng Deng,Jimmy Du,Michael Ermolenko,Feifan Fan,Yufei Feng,Cheryl Fichter,Pavel Filimonov,Louis Fischer,Kylan Gibbs,Valeria Gusarova,Pavel Karpik,Andreas Assad Kottner,Ian Lee,Oliver Louie,Jasmine Mai,Mikhail Mamontov,Suri Mao,Nurullah Morshed,Igor Poletaev,Florin Radu,Dmytro Semernia,Evgenii Shingarev,Vikram Sivaraja,Peter Skirko,Rinat Takhautdinov,Robert Villahermosa,Jean Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 20 pages, 10 figures. For associated modeling and training code, see this https URL

点击查看摘要

Abstract:We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker’s voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
zh

[NLP-55] RIDENT: Benchmarking LLM Safety in Finance Medicine and Law

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律、金融和医疗等高风险专业领域部署时,缺乏系统性领域特定安全性评估的问题。现有研究多聚焦于提升模型性能,而忽视了其在伦理合规方面的潜在风险。解决方案的关键在于:首先基于美国医学协会(AMA)医学伦理原则、美国律师协会(ABA)职业行为规范及特许金融分析师协会(CFA Institute)道德准则,定义了领域特定的安全原则;进而构建了Trident-Bench基准测试,专门用于评估LLM在上述三个专业领域的安全性和合规性。实验表明,该基准能有效识别出通用模型与领域专业化模型在伦理细节上的显著差异,揭示了当前模型在处理细微伦理情境时的不足,从而为未来精细化提升LLM在受监管专业场景下的安全性提供了可量化、可复现的研究基础。

链接: https://arxiv.org/abs/2507.21134
作者: Zheng Hui,Yijiang River Dong,Ehsan Shareghi,Nigel Collier
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps – strong generalist models (e.g., GPT, Gemini) can meet basic expectations, whereas domain-specialized models often struggle with subtle ethical nuances. This highlights an urgent need for finer-grained domain-specific safety improvements. By introducing Trident-Bench, our work provides one of the first systematic resources for studying LLM safety in law and finance, and lays the groundwork for future research aimed at reducing the safety risks of deploying LLMs in professionally regulated fields. Code and benchmark will be released at: this https URL
zh

[NLP-56] InsurTech innovation using natural language processing

【速读】: 该论文旨在解决传统保险公司在面对InsurTech(保险科技)快速发展时,如何有效利用替代数据源和先进技术以维持竞争优势的问题。其核心挑战在于如何将原始、非结构化的文本数据转化为可用于精算分析与决策的结构化信息。解决方案的关键在于应用自然语言处理(Natural Language Processing, NLP)技术,从真实世界中由InsurTech合作伙伴提供的替代数据中提取高价值洞察,从而不仅增强和优化传统商业保险定价因子,还通过引入新型行业分类体系为风险评估提供全新视角。研究表明,NLP已不再是辅助工具,而是现代数据驱动保险分析的基础性组成部分。

链接: https://arxiv.org/abs/2507.21112
作者: Panyi Dong,Zhiyu Quan
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations with a focus on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate practical use cases in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classifications. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element for modern, data-driven insurance analytics.
zh

[NLP-57] SemRAG : Semantic Knowledge-Augmented RAG for Improved Question-Answering

【速读】: 该论文旨在解决现有检索增强生成(Retrieval Augmented Generation, RAG)方法在集成领域知识时存在的计算成本高、易过拟合及可扩展性差的问题。其核心解决方案在于提出SemRAG框架,通过语义分块(semantic chunking)算法基于句子嵌入的余弦相似度对文档进行分割,从而在保持语义连贯性的前提下降低计算开销;同时,将检索到的信息结构化为知识图谱(Knowledge Graph),显式建模实体间关系,提升检索准确性和上下文理解能力。实验表明,该方法在MultiHop RAG和Wikipedia数据集上显著优于传统RAG,且无需大量微调即可实现高效、精准的领域特定大语言模型(LLM)推理流程。

链接: https://arxiv.org/abs/2507.21110
作者: Kezhen Zhong,Basem Suleiman,Abdelkarim Erradi,Shijing Chen
机构: University of Sydney (悉尼大学); University of New South Wales (新南威尔士大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:This paper introduces SemRAG, an enhanced Retrieval Augmented Generation (RAG) framework that efficiently integrates domain-specific knowledge using semantic chunking and knowledge graphs without extensive fine-tuning. Integrating domain-specific knowledge into large language models (LLMs) is crucial for improving their performance in specialized tasks. Yet, existing adaptations are computationally expensive, prone to overfitting and limit scalability. To address these challenges, SemRAG employs a semantic chunking algorithm that segments documents based on the cosine similarity from sentence embeddings, preserving semantic coherence while reducing computational overhead. Additionally, by structuring retrieved information into knowledge graphs, SemRAG captures relationships between entities, improving retrieval accuracy and contextual understanding. Experimental results on MultiHop RAG and Wikipedia datasets demonstrate SemRAG has significantly enhances the relevance and correctness of retrieved information from the Knowledge Graph, outperforming traditional RAG methods. Furthermore, we investigate the optimization of buffer sizes for different data corpus, as optimizing buffer sizes tailored to specific datasets can further improve retrieval performance, as integration of knowledge graphs strengthens entity relationships for better contextual comprehension. The primary advantage of SemRAG is its ability to create an efficient, accurate domain-specific LLM pipeline while avoiding resource-intensive fine-tuning. This makes it a practical and scalable approach aligned with sustainability goals, offering a viable solution for AI applications in domain-specific fields.
zh

[NLP-58] A Survey of Classification Tasks and Approaches for Legal Contracts

【速读】: 该论文旨在解决法律合同人工审查效率低、易出错的问题,从而推动法律文本处理的自动化。其关键解决方案在于系统性梳理自动法律合同分类(Automatic Legal Contract Classification, LCC)领域的研究进展,提出七类核心分类任务、十四种英文合同数据集,并构建涵盖传统机器学习、深度学习与基于Transformer的方法学分类体系,为提升LCC的效率、准确性和可扩展性提供理论基础与实践指导。

链接: https://arxiv.org/abs/2507.21108
作者: Amrita Singh,Aditya Joshi,Jiaojiao Jiang,Hye-young Paik
机构: University of New South Wales (UNSW)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review. 49 pages + references

点击查看摘要

Abstract:Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.
zh

[NLP-59] Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)内部语义推理过程的可解释性问题,特别是如何量化和分析模型在不同语义关注强度下激活轨迹的几何变化。其解决方案的关键在于提出“曲率推断”(Curved Inference)框架,通过在拉回语义度量空间中计算残差流(residual stream)轨迹的曲率(κi\kappa_i)与显著性(S(t)),揭示模型如何随提示语义重心的变化而弯曲、重定向或强化语义路径。该方法利用未嵌入矩阵(unembedding matrix)构建语义对齐的度量空间,确保所有几何测量反映token级语义结构而非原始坐标系,从而为诊断模型对齐性、抽象能力及涌现推理动态提供了一种原理清晰的几何分析工具。

链接: https://arxiv.org/abs/2507.21107
作者: Rob Manson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 22 figures

点击查看摘要

Abstract:We propose Curved Inference - a geometric Interpretability framework that tracks how the residual stream trajectory of a large language model bends in response to shifts in semantic concern. Across 20 matched prompts spanning emotional, moral, perspective, logical, identity, environmental, and nonsense domains, we analyse Gemma3-1b and LLaMA3.2-3b using five native-space metrics, with a primary focus on curvature (\kappa_i) and salience (S(t)). These metrics are computed under a pullback semantic metric derived from the unembedding matrix, ensuring that all measurements reflect token-aligned geometry rather than raw coordinate structure. We find that concern-shifted prompts reliably alter internal activation trajectories in both models - with LLaMA exhibiting consistent, statistically significant scaling in both curvature and salience as concern intensity increases. Gemma also responds to concern but shows weaker differentiation between moderate and strong variants. Our results support a two-layer view of LLM geometry - a latent conceptual structure encoded in the embedding space, and a contextual trajectory shaped by prompt-specific inference. Curved Inference reveals how models navigate, reorient, or reinforce semantic meaning over depth, offering a principled method for diagnosing alignment, abstraction, and emergent inference dynamics. These findings offer fresh insight into semantic abstraction and model alignment through the lens of Curved Inference.
zh

[NLP-60] Creation of a Numerical Scoring System to Objectively Measure and Compare the Level of Rhetoric in Arabic Texts: A Feasibility Study and A Working Prototype

【速读】: 该论文旨在解决阿拉伯修辞(Arabic Rhetoric)在文本中难以客观量化的问题,即无法准确判断某文本是否使用了阿拉伯修辞、使用程度及分布情况,也无法跨体裁、作者或时代进行比较。解决方案的关键在于构建一套可计算的指标体系:首先整理出84种最常见的文学修辞手法及其定义,继而开发识别这些修辞手法的方法,并基于词素(morpheme)计数法计算其密度;最终通过四个电子工具(含在线计算器与网站)和一个模拟工具实现对任意阿拉伯语文本或演讲中修辞密度的精准测量与报告,从而将主观性强的修辞分析转化为可量化的客观指标。

链接: https://arxiv.org/abs/2507.21106
作者: Mandar Marathe
机构: 未知
类目: Computation and Language (cs.CL)
备注: This dissertation was submitted by Mandar Marathe on 6 September 2022, in partial fulfilment of the requirements for the Master of Arts degree in Advanced Arabic at the University of Exeter

点击查看摘要

Abstract:Arabic Rhetoric is the field of Arabic linguistics which governs the art and science of conveying a message with greater beauty, impact and persuasiveness. The field is as ancient as the Arabic language itself and is found extensively in classical and contemporary Arabic poetry, free verse and prose. In practical terms, it is the intelligent use of word order, figurative speech and linguistic embellishments to enhance message delivery. Despite the volumes that have been written about it and the high status accorded to it, there is no way to objectively know whether a speaker or writer has used Arabic rhetoric in a given text, to what extent, and why. There is no objective way to compare the use of Arabic rhetoric across genres, authors or epochs. It is impossible to know which of pre-Islamic poetry, Andalucian Arabic poetry, or modern literary genres are richer in Arabic rhetoric. The aim of the current study was to devise a way to measure the density of the literary devices which constitute Arabic rhetoric in a given text, as a proxy marker for Arabic rhetoric itself. A comprehensive list of 84 of the commonest literary devices and their definitions was compiled. A system of identifying literary devices in texts was constructed. A method of calculating the density of literary devices based on the morpheme count of the text was utilised. Four electronic tools and an analogue tool were created to support the calculation of an Arabic text’s rhetorical literary device density, including a website and online calculator. Additionally, a technique of reporting the distribution of literary devices used across the three sub-domains of Arabic rhetoric was created. The output of this project is a working tool which can accurately report the density of Arabic rhetoric in any Arabic text or speech.
zh

[NLP-61] Agent Master: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在集成大型语言模型(Large Language Models, LLMs)后,仍面临智能体间通信、协调以及与异构工具和资源交互的挑战。解决方案的关键在于提出一个名为AgentMaster的新型模块化多协议MAS框架,其自实现Agent-to-Agent(A2A)通信协议与Model Context Protocol(MCP),实现了动态协调与灵活通信,并通过统一的自然语言接口支持多模态查询处理,从而提升任务分解、路由决策和领域相关响应的质量。

链接: https://arxiv.org/abs/2507.21105
作者: Callie C. Liao,Duoduo Liao,Sai Surya Gadiraju
机构: Stanford University (斯坦福大学); George Mason University (乔治梅森大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Multi-Agent Systems (MAS) in Artificial Intelligence (AI), especially integrated with Large Language Models (LLMs), has greatly facilitated the resolution of complex tasks. However, current systems are still facing challenges of inter-agent communication, coordination, and interaction with heterogeneous tools and resources. Most recently, the Model Context Protocol (MCP) by Anthropic and Agent-to-Agent (A2A) communication protocol by Google have been introduced, and to the best of our knowledge, very few applications exist where both protocols are employed within a single MAS framework. We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP, enabling dynamic coordination and flexible communication. Through a unified conversational interface, the system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis. Evaluation through the BERTScore F1 and LLM-as-a-Judge metric G-Eval averaged 96.3% and 87.1%, revealing robust inter-agent coordination, query decomposition, dynamic routing, and domain-specific, relevant responses. Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS.
zh

[NLP-62] LSU-T: an Open Dataset for Uruguayan Sign Language Translation

【速读】: 该论文旨在解决**手语自动翻译(Automatic Sign Language Translation)**中因各国手语差异导致的本地化数据稀缺问题,从而阻碍了相关技术的发展与应用。其关键解决方案是构建并公开了一个名为iLSU T的多模态开放数据集,包含185小时以上乌拉圭手语(Uruguayan Sign Language)的RGB视频、音频及文本转录,覆盖多样话题并由18位专业手语翻译者参与录制。该数据集为开发和评估基于深度学习的翻译算法提供了高质量、本地化的基准资源,推动了面向包容性与无障碍交流的新型手语处理工具的研究进展。

链接: https://arxiv.org/abs/2507.21104
作者: Ariel E. Stassi,Yanina Boria,J. Matías Di Martino,Gregory Randall
机构: Universidad de la República (乌拉圭共和国大学); Universidad de Buenos Aires (布宜诺斯艾利斯大学); Universidad Católica del Uruguay (乌拉圭天主教大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 19th International Conference on Automatic Face and Gesture Recognition IEEE FG 2025

点击查看摘要

Abstract:Automatic sign language translation has gained particular interest in the computer vision and computational linguistics communities in recent years. Given each sign language country particularities, machine translation requires local data to develop new techniques and adapt existing ones. This work presents iLSU T, an open dataset of interpreted Uruguayan Sign Language RGB videos with audio and text transcriptions. This type of multimodal and curated data is paramount for developing novel approaches to understand or generate tools for sign language processing. iLSU T comprises more than 185 hours of interpreted sign language videos from public TV broadcasting. It covers diverse topics and includes the participation of 18 professional interpreters of sign language. A series of experiments using three state of the art translation algorithms is presented. The aim is to establish a baseline for this dataset and evaluate its usefulness and the proposed pipeline for data processing. The experiments highlight the need for more localized datasets for sign language translation and understanding, which are critical for developing novel tools to improve accessibility and inclusion of all individuals. Our data and code can be accessed.
zh

[NLP-63] Analise Semantica Automatizada com LLM e RAG para Bulas Farmaceuticas

【速读】: 该论文旨在解决在学术、商业和医疗等环境中,数字文档快速增长所带来的非结构化信息高效提取与分析难题。其解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)架构结合大规模语言模型(Large-Scale Language Models, LLMs),通过嵌入(embeddings)实现向量搜索、语义数据抽取以及生成上下文相关的自然语言响应,从而提升对PDF格式文档中技术文本的智能检索与理解能力。

链接: https://arxiv.org/abs/2507.21103
作者: Daniel Meireles do Rego
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: in Portuguese language

点击查看摘要

Abstract:The production of digital documents has been growing rapidly in academic, business, and health environments, presenting new challenges in the efficient extraction and analysis of unstructured information. This work investigates the use of RAG (Retrieval-Augmented Generation) architectures combined with Large-Scale Language Models (LLMs) to automate the analysis of documents in PDF format. The proposal integrates vector search techniques by embeddings, semantic data extraction and generation of contextualized natural language responses. To validate the approach, we conducted experiments with drug package inserts extracted from official public sources. The semantic queries applied were evaluated by metrics such as accuracy, completeness, response speed and consistency. The results indicate that the combination of RAG with LLMs offers significant gains in intelligent information retrieval and interpretation of unstructured technical texts.
zh

[NLP-64] Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting

【速读】: 该论文旨在解决广告内容在基于检索的大型语言模型(Large Language Model, LLM)系统中可见性不足的问题,即如何通过优化广告文案的表述方式来提升其在检索排序中的位置和被LLM生成响应时的包含频率,而无需修改底层检索模型。解决方案的关键在于提出一种监督微调框架,结合自定义损失函数以平衡语义相关性(semantic relevance)与内容保真度(content fidelity),并引入两种量化指标——DeltaMRR@K(排序改进度)和DeltaDIR@K(包含频率改进度)用于评估效果;实验表明,采用近端策略优化(Proximal Policy Optimization, PPO)训练的模型在指令式和少样本提示场景下均显著优于提示工程和传统监督微调方法,验证了强化学习在广告重写任务中的有效性。

链接: https://arxiv.org/abs/2507.21099
作者: Chloe Ho,Ishneet Sukhvinder Singh,Diya Sharma,Tanvi Reddy Anumandla,Michael Lu,Vasu Sharma,Kevin Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Search algorithms and user query relevance have given LLMs the ability to return relevant information, but the effect of content phrasing on ad visibility remains underexplored. We investigate how LLM-based rewriting of advertisements can improve their ranking in retrieval systems and inclusion in generated LLM responses, without modifying the retrieval model itself. We introduce a supervised fine-tuning framework with a custom loss balancing semantic relevance and content fidelity. To evaluate effectiveness, we propose two metrics: DeltaMRR@K (ranking improvement) and DeltaDIR@K (inclusion frequency improvement). Our approach presents a scalable method to optimize ad phrasing, enhancing visibility in retrieval-based LLM workflows. Experiments across both instruction-based and few-shot prompting demonstrate that PPO trained models outperform both prompt engineering and supervised fine-tuning in most cases, achieving up to a 2.79 DeltaDIR@5 and 0.0073 DeltaMRR@5 in instruction-based prompting. These results highlight the importance of how the ad is written before retrieval and prompt format and reinforcement learning in effective ad rewriting for LLM integrated retrieval systems.
zh

[NLP-65] QU-NLP at CheckThat! 2025: Multilingual Subjectivity in News Articles Detection using Feature-Augmented Transformer Models with Sequential Cross-Lingual Fine-Tuning

【速读】: 该论文旨在解决新闻文本中主观性(subjectivity)检测问题,即区分句子是表达作者的主观观点还是客观陈述。其核心解决方案为提出一种特征增强的Transformer架构,通过融合预训练语言模型的上下文嵌入与统计及语言学特征(如词性标注POS和TF-IDF特征),提升跨语言场景下的检测性能。关键创新在于:针对阿拉伯语采用AraELECTRA结合POS和TF-IDF特征;其他语言则基于跨语言DeBERTa-V3模型,通过门控机制整合TF-IDF特征,实现多语言及零样本(zero-shot)设置下的高效迁移学习。实验表明该方法在多种语言上均取得竞争力结果,尤其在英语、罗马尼亚语等语言的单语设置中表现突出,并验证了特征融合与跨语言微调顺序对性能的重要影响。

链接: https://arxiv.org/abs/2507.21095
作者: Mohammad AL-Smadi
机构: Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents our approach to the CheckThat! 2025 Task 1 on subjectivity detection, where systems are challenged to distinguish whether a sentence from a news article expresses the subjective view of the author or presents an objective view on the covered topic. We propose a feature-augmented transformer architecture that combines contextual embeddings from pre-trained language models with statistical and linguistic features. Our system leveraged pre-trained transformers with additional lexical features: for Arabic we used AraELECTRA augmented with part-of-speech (POS) tags and TF-IDF features, while for the other languages we fine-tuned a cross-lingual DeBERTa~V3 model combined with TF-IDF features through a gating mechanism. We evaluated our system in monolingual, multilingual, and zero-shot settings across multiple languages including English, Arabic, German, Italian, and several unseen languages. The results demonstrate the effectiveness of our approach, achieving competitive performance across different languages with notable success in the monolingual setting for English (rank 1st with macro-F1=0.8052), German (rank 3rd with macro-F1=0.8013), Arabic (rank 4th with macro-F1=0.5771), and Romanian (rank 1st with macro-F1=0.8126) in the zero-shot setting. We also conducted an ablation analysis that demonstrated the importance of combining TF-IDF features with the gating mechanism and the cross-lingual transfer for subjectivity detection. Furthermore, our analysis reveals the model’s sensitivity to both the order of cross-lingual fine-tuning and the linguistic proximity of the training languages.
zh

[NLP-66] Emotionally Aware Moderation: The Potential of Emotion Monitoring in Shaping Healthier Social Media Conversations

【速读】: 该论文试图解决社交平台上仇恨言论(hate speech)泛滥与用户情绪失控的问题,尤其是现有被动式内容审核机制难以从根源上缓解网络不文明行为的局限性。其解决方案的关键在于引入两种情感监测仪表盘(emotion monitoring dashboards),通过增强用户对自身情绪状态的觉察(emotional awareness),实现对负面情绪表达的主动调节,从而减少仇恨言论的发生。实验结果显示,该干预策略能有效提升用户的情绪认知并抑制仇恨言论,但也可能在讨论敏感议题时引发更多愤怒、恐惧和悲伤等负面情绪表达,提示需进一步优化情绪调节工具的设计以避免副作用。

链接: https://arxiv.org/abs/2507.21089
作者: Xiaotian Su,Naim Zierau,Soomin Kim,April Yi Wang,Thiemo Wambsganss
机构: ETH Zurich (苏黎世联邦理工学院); University of St.Gallen (圣加仑大学); Seoul National University (首尔国立大学); Bern University of Applied Sciences (伯尔尼应用科学大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media platforms increasingly employ proactive moderation techniques, such as detecting and curbing toxic and uncivil comments, to prevent the spread of harmful content. Despite these efforts, such approaches are often criticized for creating a climate of censorship and failing to address the underlying causes of uncivil behavior. Our work makes both theoretical and practical contributions by proposing and evaluating two types of emotion monitoring dashboards to users’ emotional awareness and mitigate hate speech. In a study involving 211 participants, we evaluate the effects of the two mechanisms on user commenting behavior and emotional experiences. The results reveal that these interventions effectively increase users’ awareness of their emotional states and reduce hate speech. However, our findings also indicate potential unintended effects, including increased expression of negative emotions (Angry, Fear, and Sad) when discussing sensitive issues. These insights provide a basis for further research on integrating proactive emotion regulation tools into social media platforms to foster healthier digital interactions.
zh

[NLP-67] Multi-Amateur Contrastive Decoding for Text Generation

【速读】: 该论文旨在解决对比解码(Contrastive Decoding, CD)在开放文本生成中因依赖单一业余模型(amateur model)而导致的局限性问题,即难以全面捕捉语言生成中的多样化失败模式(如重复、幻觉和风格漂移)。其解决方案的关键在于提出多业余对比解码(Multi-Amateur Contrastive Decoding, MACD),通过引入一个业余模型集合来更全面地表征不良生成模式,并结合平均与共识惩罚机制整合对比信号,同时将可解释性约束扩展至多业余场景;此外,MACD还支持可控生成,通过嵌入具有特定风格或内容偏好的业余模型实现对输出特性的调节。实验表明,MACD在流畅性、连贯性、多样性和适应性方面均优于传统解码方法及原始CD方法,且无需额外训练或微调。

链接: https://arxiv.org/abs/2507.21086
作者: Jaydip Sen,Subhasis Dasgupta,Hetvi Waghela
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for oral presentation and publication in the proceedings of the IEEE I2ITCON 2025. The conference will be organized in Pune, India, from July 4 to 5, 2025. This is the accepted version of the paper and NOT the final camera-ready version. The paper is 11 pages long and contains 5 figures and 6 tables

点击查看摘要

Abstract:Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.
zh

[NLP-68] Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调或去学习(unlearning)过程中可能产生的不可预测的副作用问题,例如因删除生物学知识而导致化学任务性能下降等行为漂移现象。现有评估方法仅关注干预后的性能表现,缺乏对这些潜在机制性影响的系统检测手段。解决方案的关键在于提出一种轻量级框架MNEME(Model diffiNg for Evaluating Mechanistic Effects),通过稀疏模型差异分析(sparse model diffing)对比基础模型与微调后模型在任务无关数据集(如The Pile、LMSYS-Chat-1M)上的行为差异,从而无须访问微调数据即可识别出副作用,并实现高达95%的预测准确率。该方法无需定制启发式规则,具备可扩展性和自动化特性,为理解与管理LLM行为变化提供了有效工具。

链接: https://arxiv.org/abs/2507.21084
作者: Aly M. Kassem,Zhuan Shi,Negar Rostamzadeh,Golnoosh Farnadi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects, such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a lightweight framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on task-agnostic data (for example, The Pile, LMSYS-Chat-1M) without access to fine-tuning data to isolate behavioral shifts. Applied to five LLMs across three scenarios: WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves up to 95 percent accuracy in predicting side effects, aligning with known benchmarks and requiring no custom heuristics. Furthermore, we show that retraining on high-activation samples can partially reverse these effects. Our results demonstrate that sparse probing and diffing offer a scalable and automated lens into fine-tuning-induced model changes, providing practical tools for understanding and managing LLM behavior.
zh

[NLP-69] ChatGPT Reads Your Tone and Responds Accordingly – Until It Does Not – Emotional Framing Induces Bias in LLM Outputs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在响应中对提示语情感框架(emotional framing)的敏感性问题,特别是GPT-4如何因提示的情感基调变化而调整输出倾向,从而揭示其潜在的“反弹偏差”(rebound bias)和对敏感话题的对齐抑制现象。解决方案的关键在于系统性地操控156个提示语的情感色调(从负面到中性再到正面),并利用“情绪-效价转换矩阵”(tone-valence transition matrices)量化响应模式的变化,同时引入“情绪底限”(tone floor)概念识别响应中负向表达的下界,并通过1536维嵌入空间的可视化验证语义漂移,从而揭示由情感语境驱动的新型偏见机制及其对AI对齐(AI alignment)与可信度的影响。

链接: https://arxiv.org/abs/2507.21083
作者: Franck Bardol
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models like GPT-4 adjust their responses not only based on the question asked, but also on how it is emotionally phrased. We systematically vary the emotional tone of 156 prompts - spanning controversial and everyday topics - and analyze how it affects model responses. Our findings show that GPT-4 is three times less likely to respond negatively to a negatively framed question than to a neutral one. This suggests a “rebound” bias where the model overcorrects, often shifting toward neutrality or positivity. On sensitive topics (e.g., justice or politics), this effect is even more pronounced: tone-based variation is suppressed, suggesting an alignment override. We introduce concepts like the “tone floor” - a lower bound in response negativity - and use tone-valence transition matrices to quantify behavior. Visualizations based on 1536-dimensional embeddings confirm semantic drift based on tone. Our work highlights an underexplored class of biases driven by emotional framing in prompts, with implications for AI alignment and trust. Code and data are available at: this https URL
zh

[NLP-70] Which symbol grounding problem should we try to solve?

【速读】: 该论文试图解决的是“意义的奠基问题”(grounding problem),即如何在人工计算代理中解释和再现意义的产生与功能。作者指出,Floridi 和 Taddeo 提出的“零语义承诺”(zero semantic commitment)条件无法实现,即便是他们自身提出的解决方案也难以满足这一要求。在此基础上,论文进一步质疑了传统对问题的理解,并主张应重新审视系统中“目标”(goals)的作用。最终,作者基于对计算本质的正确认识,提出唯一合理的奠基问题是:如何解释并复现人工计算代理的行为能力和意义功能。其解决方案的关键在于从计算系统的功能视角出发,而非依赖抽象的语义承诺条件。

链接: https://arxiv.org/abs/2507.21080
作者: Vincent C. Müller
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Floridi and Taddeo propose a condition of “zero semantic commitment” for solutions to the grounding problem, and a solution to it. I argue briefly that their condition cannot be fulfilled, not even by their own solution. After a look at Luc Steels’ very different competing suggestion, I suggest that we need to re-think what the problem is and what role the ‘goals’ in a system play in formulating the problem. On the basis of a proper understanding of computing, I come to the conclusion that the only sensible grounding problem is how we can explain and re-produce the behavioral ability and function of meaning in artificial computational agents
zh

[NLP-71] Can LLM s Reason About Trust?: A Pilot Study AAMAS2025

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)在数字环境中推理和促进人与人之间的信任关系问题。其核心挑战在于,随着人类互动日益通过移动应用等电子媒介进行,AI系统需具备理解并增强社会关系中信任状态的能力。解决方案的关键在于:一方面评估LLMs是否能基于交互情境准确推理出两人间的信任动态;另一方面检验LLMs能否通过扮演其中一方角色,制定并执行可增强信任的策略性行为,从而主动诱导信任建立。

链接: https://arxiv.org/abs/2507.21075
作者: Anushka Debnath,Stephen Cranefield,Emiliano Lorini,Bastin Tony Roy Savarimuthu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 17 pages, 5 figures, 3 tables Accepted for presentation as a full paper at the COINE 2025 workshop at AAMAS 2025 see this https URL

点击查看摘要

Abstract:In human society, trust is an essential component of social attitude that helps build and maintain long-term, healthy relationships which creates a strong foundation for cooperation, enabling individuals to work together effectively and achieve shared goals. As many human interactions occur through electronic means such as using mobile apps, the potential arises for AI systems to assist users in understanding the social state of their relationships. In this paper we investigate the ability of Large Language Models (LLMs) to reason about trust between two individuals in an environment which requires fostering trust relationships. We also assess whether LLMs are capable of inducing trust by role-playing one party in a trust based interaction and planning actions which can instil trust.
zh

[NLP-72] Product vs. Process: Exploring EFL Students Editing of AI-Generated Text for Expository Writing

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在英语作为外语(EFL)写作教学中对学生说明文写作过程与成果影响不清的问题,特别是学生对AI生成文本的编辑行为与其最终写作质量之间的关系尚未明确。研究通过混合方法设计,结合屏幕录制分析、人工评分及多元线性回归(MLR)模型,发现尽管学生投入大量编辑努力,但多数编辑变量对内容、结构、语言和整体质量评分的影响微弱;而AI生成词数则显著正向预测各项评分。关键解决方案在于强调:在引入AI工具前应强化以体裁为导向的过程写作训练,并建立兼顾写作过程与成果的评估体系,从而引导学生批判性地使用AI文本,真正提升写作能力。

链接: https://arxiv.org/abs/2507.21073
作者: David James Woo,Yangyang Yu,Kai Guo,Yilin Huang,April Ka Yeng Fung
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 45 pages, 11 figures

点击查看摘要

Abstract:Text generated by artificial intelligence (AI) chatbots is increasingly used in English as a foreign language (EFL) writing contexts, yet its impact on students’ expository writing process and compositions remains understudied. This research examines how EFL secondary students edit AI-generated text. Exploring editing behaviors in their expository writing process and in expository compositions, and their effect on human-rated scores for content, organization, language, and overall quality. Participants were 39 Hong Kong secondary students who wrote an expository composition with AI chatbots in a workshop. A convergent design was employed to analyze their screen recordings and compositions to examine students’ editing behaviors and writing qualities. Analytical methods included qualitative coding, descriptive statistics, temporal sequence analysis, human-rated scoring, and multiple linear regression analysis. We analyzed over 260 edits per dataset, and identified two editing patterns: one where students refined introductory units repeatedly before progressing, and another where they quickly shifted to extensive edits in body units (e.g., topic and supporting sentences). MLR analyses revealed that the number of AI-generated words positively predicted all score dimensions, while most editing variables showed minimal impact. These results suggest a disconnect between students’ significant editing effort and improved composition quality, indicating AI supports but does not replace writing skills. The findings highlight the importance of genre-specific instruction and process-focused writing before AI integration. Educators should also develop assessments valuing both process and product to encourage critical engagement with AI text.
zh

[NLP-73] Dialogic Social Learning for Artificial Agents : Enhancing LLM Ontology Acquisition through Mixed-Initiative Educational Interactions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在获取和整合在线复杂知识时面临的挑战,尤其是传统监督学习或强化学习范式因依赖大规模离线数据与稀疏反馈信号而难以实现高效交互式学习的问题。其解决方案的关键在于引入基于维果斯基社会文化理论的“AI社交健身房”(AI Social Gym)动态环境,通过AI学习代理与知识型AI教师代理之间的二元教学对话,以结构化外部对话作为核心机制促进知识获取。实证结果表明,混合方向的对话策略——即结合自上而下的解释与学习者主动提问——显著提升了LLM对新知识的习得与应用能力,优于单向教学或直接访问结构化知识的方式,从而为AI后训练阶段的知识获取与响应质量提升提供了一种融合教育学与心理学洞见的新路径。

链接: https://arxiv.org/abs/2507.21065
作者: Sabrina Patania,Luca Annese,Cansu Koyuturk,Azzurra Ruggeri,Dimitri Ognibene
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: submitted to ICSR2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive offline datasets. However, they often face challenges in acquiring and integrating complex, knowledge online. Traditional AI training paradigms, predominantly based on supervised learning or reinforcement learning, mirror a ‘Piagetian’ model of independent exploration. These approaches typically rely on large datasets and sparse feedback signals, limiting the models’ ability to learn efficiently from interactions. Drawing inspiration from Vygotsky’s sociocultural theory, this study explores the potential of socially mediated learning paradigms to address these limitations. We introduce a dynamic environment, termed the ‘AI Social Gym’, where an AI learner agent engages in dyadic pedagogical dialogues with knowledgeable AI teacher agents. These interactions emphasize external, structured dialogue as a core mechanism for knowledge acquisition, contrasting with methods that depend solely on internal inference or pattern recognition. Our investigation focuses on how different pedagogical strategies impact the AI learning process in the context of ontology acquisition. Empirical results indicate that such dialogic approaches-particularly those involving mixed-direction interactions combining top-down explanations with learner-initiated questioning-significantly enhance the LLM’s ability to acquire and apply new knowledge, outperforming both unidirectional instructional methods and direct access to structured knowledge, formats typically present in training datasets. These findings suggest that integrating pedagogical and psychological insights into AI and robot training can substantially improve post-training knowledge acquisition and response quality. This approach offers a complementary pathway to existing strategies like prompt engineering Comments: submitted to ICSR2025 Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO) MSC classes: I.2.7, I.2.9, j.4, Cite as: arXiv:2507.21065 [cs.CL] (or arXiv:2507.21065v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.21065 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Dimitri Ognibene [view email] [v1] Sun, 25 May 2025 11:19:48 UTC (75 KB)
zh

[NLP-74] Categorical Classification of Book Summaries Using Word Embedding Techniques WWW

【速读】: 该论文旨在解决图书摘要与类别分类问题,即如何利用自然语言处理(Natural Language Processing, NLP)技术与机器学习算法对来自图书网站的文本数据进行有效分类。其解决方案的关键在于比较不同词嵌入(word embedding)方法在土耳其语文本上的性能表现,包括独热编码(One-Hot Encoding)、Word2Vec 和词频-逆文档频率(Term Frequency–Inverse Document Frequency, TF-IDF),并结合预处理方法组合表以优化模型输入特征。研究结果表明,支持向量机(Support Vector Machine, SVM)、朴素贝叶斯(Naive Bayes)和逻辑回归(Logistic Regression)模型配合TF-IDF与独热编码,在土耳其语文本分类任务中表现出更高的准确性。

链接: https://arxiv.org/abs/2507.21058
作者: Kerem Keskin,Mümine Kaya Keleş
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in Turkish language. This paper was published in the proceedings of the 6th International Conference on Data Science and Applications ICONDATA24, held on September between 2 and 6, 2024, in Pristina, Kosovo. For full text book see this https URL

点击查看摘要

Abstract:In this study, book summaries and categories taken from book sites were classified using word embedding methods, natural language processing techniques and machine learning algorithms. In addition, one hot encoding, Word2Vec and Term Frequency - Inverse Document Frequency (TF-IDF) methods, which are frequently used word embedding methods were used in this study and their success was compared. Additionally, the combination table of the pre-processing methods used is shown and added to the table. Looking at the results, it was observed that Support Vector Machine, Naive Bayes and Logistic Regression Models and TF-IDF and One-Hot Encoder word embedding techniques gave more successful results for Turkish texts.
zh

[NLP-75] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)推理在大型语言模型(Large Language Model, LLM)中引入的高计算开销问题,尤其是在长序列自回归解码过程中产生的延迟。现有加速方法如早期停止或基于奖励压缩的策略难以平衡效率与精度,而推测解码(Speculative Decoding)在小模型与大模型预测一致性较低时速度提升有限,且未充分利用小模型生成简洁中间推理的优势。论文提出 R-Stitch,一种基于置信度的分层混合解码框架,其核心在于:默认使用小语言模型(Small Language Model, SLM)生成 token,仅当 SLM 的置信度低于预设阈值时才切换至 LLM 进行纠错,从而避免全序列回滚、选择性调用 LLM 以保留推理质量的同时显著降低延迟。该方案无需训练、模型无关,兼容标准解码流程,在数学推理基准测试中实现最高达 85% 的推理延迟减少且准确率几乎无损。

链接: https://arxiv.org/abs/2507.17307
作者: Zhuokun Chen,Zeren Chen,Jiahao He,Mingkui Tan,Jianfei Cai,Bohan Zhuang
机构: Monash University (蒙纳士大学); School of Software, Beihang University (北京航空航天大学软件学院); South China University of Technology (华南理工大学); ZIP Lab, Zhejiang University (浙江大学ZIP实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
zh

计算机视觉

[CV-0] MOVE: Motion-Guided Few-Shot Video Object Segmentation ICCV2025

【速读】:该论文旨在解决**运动引导的少样本视频目标分割(motion-guided few-shot video object segmentation, FSVOS)问题,即在仅提供少量标注示例的情况下,基于相同运动模式准确分割视频中的动态物体。现有方法通常依赖于静态类别信息,忽视了视频中丰富的时序动态特性,导致在需要理解运动语义的任务中表现受限。为填补这一空白,作者提出MOVE数据集,并设计了一种名为解耦运动与外观网络(Decoupled Motion Appearance Network, DMA)**的基线方法,其关键在于将运动信息与外观特征进行显式解耦建模,从而提升模型在少样本场景下对运动模式的理解能力,显著优于现有方法。

链接: https://arxiv.org/abs/2507.22061
作者: Kaining Ying,Hengrui Hu,Henghui Ding
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project Page: this https URL

点击查看摘要

Abstract:This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few shot motion understanding, establishing a solid foundation for future research in this direction.
zh

[CV-1] StepAL: Step-aware Active Learning for Cataract Surgical Videos MICCAI2025

【速读】:该论文旨在解决传统主动学习(Active Learning, AL)方法在长时未剪辑手术视频中的步骤识别任务中效果不佳的问题。传统AL方法通常针对图像或短片段视频设计,通过选择单个帧或片段进行标注,忽略了手术视频中各步骤之间的依赖关系及标注所需的全局上下文信息,导致标注效率低且模型性能受限。解决方案的关键在于提出StepAL框架,其核心创新是结合步态感知特征表示(step-aware feature representation)与熵加权聚类策略:前者利用伪标签捕捉每段视频内预测步骤的分布特性,后者则优先选择不确定性高且步骤组成多样化的完整视频进行标注,从而提升标注效率和模型性能。实验表明,StepAL在两个白内障手术数据集上均显著优于现有方法,在减少标注视频数量的同时实现了更高的步骤识别准确率。

链接: https://arxiv.org/abs/2507.22059
作者: Nisarg A. Shah,Bardia Safaei,Shameema Sikder,S. Swaroop Vedula,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:Active learning (AL) can reduce annotation costs in surgical video analysis while maintaining model performance. However, traditional AL methods, developed for images or short video clips, are suboptimal for surgical step recognition due to inter-step dependencies within long, untrimmed surgical videos. These methods typically select individual frames or clips for labeling, which is ineffective for surgical videos where annotators require the context of the entire video for annotation. To address this, we propose StepAL, an active learning framework designed for full video selection in surgical step recognition. StepAL integrates a step-aware feature representation, which leverages pseudo-labels to capture the distribution of predicted steps within each video, with an entropy-weighted clustering strategy. This combination prioritizes videos that are both uncertain and exhibit diverse step compositions for annotation. Experiments on two cataract surgery datasets (Cataract-1k and Cataract-101) demonstrate that StepAL consistently outperforms existing active learning approaches, achieving higher accuracy in step recognition with fewer labeled videos. StepAL offers an effective approach for efficient surgical video analysis, reducing the annotation burden in developing computer-assisted surgical systems.
zh

[CV-2] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

【速读】:该论文旨在解决离散自回归建模在图像生成中面临的视觉保真度低、输出失真以及难以遵循复杂指令等问题,这些问题可能源于自回归推理过程中的累积误差或离散化导致的信息损失。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)机制,以有效缓解生成过程中的伪影并显著提升离散自回归模型的图像生成质量,从而实现图像与语言生成的无缝集成;具体框架包括语义图像分词器、统一的自回归模型(用于语言和图像)以及离线扩散解码器,整体称为X-Omni,在仅使用7B参数的语言模型基础上实现了领先的图像生成性能,兼具高美学质量、强指令遵循能力及长文本渲染能力。

链接: https://arxiv.org/abs/2507.22058
作者: Zigang Geng,Yibing Wang,Yeyao Ma,Chen Li,Yongming Rao,Shuyang Gu,Zhao Zhong,Qinglin Lu,Han Hu,Xiaosong Zhang,Linus,Di Wang,Jie Jiang
机构: Tencent Hunyuan X (腾讯混元X)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Numerous efforts have been made to extend the ``next token prediction’’ paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
zh

[CV-3] MetaLab: Few-Shot Game Changer for Image Recognition

【速读】:该论文旨在解决小样本图像识别(few-shot image recognition)中存在显著技术差距的问题,即如何在仅提供少量标注样本(如每类仅一个样本)的情况下实现高准确率、强鲁棒性和良好泛化能力。其解决方案的关键在于提出了一种名为CIELab-Guided Coherent Meta-Learning(MetaLab)的新方法,该方法由两个协同工作的神经网络组成:LabNet能够将图像映射到CIELab颜色空间并提取分组特征,而coherent LabGNN则通过光度图(lightness graph)与色度图(color graph)之间的互学习机制增强特征表示的一致性与判别力。这一设计使模型在多个粗粒度、细粒度及跨域基准上均表现出接近人类识别上限的性能(可达99%准确率)。

链接: https://arxiv.org/abs/2507.22057
作者: Chaofei Qi,Zhitai Liu,Jianbin Qiu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Difficult few-shot image recognition has significant application prospects, yet remaining the substantial technical gaps with the conventional large-scale image recognition. In this paper, we have proposed an efficient original method for few-shot image recognition, called CIELab-Guided Coherent Meta-Learning (MetaLab). Structurally, our MetaLab comprises two collaborative neural networks: LabNet, which can perform domain transformation for the CIELab color space and extract rich grouped features, and coherent LabGNN, which can facilitate mutual learning between lightness graph and color graph. For sufficient certification, we have implemented extensive comparative studies on four coarse-grained benchmarks, four fine-grained benchmarks, and four cross-domain few-shot benchmarks. Specifically, our method can achieve high accuracy, robust performance, and effective generalization capability with one-shot sample per class. Overall, all experiments have demonstrated that our MetaLab can approach 99% \uparrow\downarrow accuracy, reaching the human recognition ceiling with little visual deviation.
zh

[CV-4] Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

【速读】:该论文旨在解决开放词汇表(open-vocabulary)语义三维重建问题,即在不依赖预定义类别的情况下,从RGB视频流中实现高精度、语义一致的三维场景重建。传统方法通常受限于封闭词汇集,难以泛化到未见类别,且难以同时保证几何一致性与细粒度语义对齐。解决方案的关键在于提出Ov3R框架,其核心创新为两个模块:一是CLIP3R,利用CLIP模型嵌入的语义先验直接指导3D重建过程,生成稠密点云并保留对象级语义;二是2D-3D OVS,通过融合空间、几何与语义线索学习跨模态特征表示,将2D视觉特征精准映射至3D空间。这一设计使系统在保持全局几何一致性的同时,实现了细粒度的语义感知能力,显著优于现有方法,在密集三维重建和开放词汇语义分割任务上达到最先进性能。

链接: https://arxiv.org/abs/2507.22052
作者: Ziren Gong,Xiaohan Li,Fabio Tosi,Jiawei Han,Stefano Mattoccia,Jianfei Cai,Matteo Poggi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
zh

[CV-5] Shallow Deep Learning Can Still Excel in Fine-Grained Few-Shot Learning

【速读】:该论文旨在解决浅层深度神经网络(如ConvNet-4)在细粒度少样本学习(Fine-Grained Few-Shot Learning, FGFSL)中因提取较多非抽象视觉属性而导致性能受限的问题。其核心解决方案是提出一种位置感知星群网络(Location-Aware Constellation Network, LCN-4),关键创新在于引入两个模块:一是通用的网格位置编码补偿机制,用于弥补普通卷积在特征提取过程中丢失的空间位置信息;二是通用的频域位置嵌入技术,以缓解聚类特征中的位置信息损失。这两个模块协同提升了空间特征融合与聚类能力,显著降低了整体损失,使LCN-4在多个细粒度少样本基准测试中表现优于基于ConvNet-4的现有方法,并达到或超越主流ResNet12基线模型的性能水平。

链接: https://arxiv.org/abs/2507.22041
作者: Chaofei Qi,Chao Ye,Zhitai Liu,Weiyang Lin,Jianbin Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has witnessed the extensive utilization across a wide spectrum of domains, including fine-grained few-shot learning (FGFSL) which heavily depends on deep backbones. Nonetheless, shallower deep backbones such as ConvNet-4, are not commonly preferred because they’re prone to extract a larger quantity of non-abstract visual attributes. In this paper, we initially re-evaluate the relationship between network depth and the ability to fully encode few-shot instances, and delve into whether shallow deep architecture could effectuate comparable or superior performance to mainstream deep backbone. Fueled by the inspiration from vanilla ConvNet-4, we introduce a location-aware constellation network (LCN-4), equipped with a cutting-edge location-aware feature clustering module. This module can proficiently encoder and integrate spatial feature fusion, feature clustering, and recessive feature location, thereby significantly minimizing the overall loss. Specifically, we innovatively put forward a general grid position encoding compensation to effectively address the issue of positional information missing during the feature extraction process of specific ordinary convolutions. Additionally, we further propose a general frequency domain location embedding technique to offset for the location loss in clustering features. We have carried out validation procedures on three representative fine-grained few-shot benchmarks. Relevant experiments have established that LCN-4 notably outperforms the ConvNet-4 based State-of-the-Arts and achieves performance that is on par with or superior to most ResNet12-based methods, confirming the correctness of our conjecture.
zh

[CV-6] From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

【速读】:该论文旨在解决导航基础模型(navigation foundation models)在仅依赖离线数据训练时,缺乏对行为后果的推理能力及通过反事实理解进行适应的问题,从而限制了其在真实城市环境中实现交互式与安全行为(如避障和避让行人)的能力。解决方案的关键在于提出Seeing-to-Experiencing(S2E)框架,该框架结合大规模视频预训练与强化学习(Reinforcement Learning, RL)后训练,既保留了从真实世界视频中获得的泛化能力,又通过仿真环境中的RL增强交互性。其核心创新包括:Anchor-Guided Distribution Matching策略,通过锚点监督稳定学习并建模多样运动模式;以及Residual-Attention Module,可在不破坏预训练知识的前提下,从仿真环境中提取反应式行为。

链接: https://arxiv.org/abs/2507.22028
作者: Honglin He,Yukai Ma,Wayne Wu,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.
zh

[CV-7] XAI for Point Cloud Data using Perturbations based on Meaningful Segmentation

【速读】:该论文旨在解决神经网络在点云分类任务中缺乏可解释性的问题,即如何生成人类易于理解的解释以揭示模型决策过程。其核心挑战在于现有方法生成的显著图(saliency maps)往往缺乏语义意义,难以供人类直观分析。解决方案的关键在于提出一种基于分割的可解释人工智能(XAI)方法,其中引入了一种新颖的点移位机制(point-shifting mechanism),通过语义上可解释的点云分割区域对输入数据施加扰动,从而生成更具可读性的显著图。相较于传统聚类算法生成的解释,该方法能确保扰动后的点不再影响分类结果,同时保留人类可理解的结构信息,显著提升了解释的语义清晰度和实用性。

链接: https://arxiv.org/abs/2507.22020
作者: Raju Ningappa Mulawade,Christoph Garth,Alexander Wiebel
机构: Hochschule Worms University of Applied Sciences (霍恩施尔大学应用科学大学); Scientific Visualization Lab (科学可视化实验室); RPTU Kaiserslautern-Landau (凯撒斯劳滕-兰道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:We propose a novel segmentation-based explainable artificial intelligence (XAI) method for neural networks working on point cloud classification. As one building block of this method, we propose a novel point-shifting mechanism to introduce perturbations in point cloud data. Recently, AI has seen an exponential growth. Hence, it is important to understand the decision-making process of AI algorithms when they are applied in critical areas. Our work focuses on explaining AI algorithms that classify point cloud data. An important aspect of the methods used for explaining AI algorithms is their ability to produce explanations that are easy for humans to understand. This allows them to analyze the AI algorithms better and make appropriate decisions based on that analysis. Therefore, in this work, we intend to generate meaningful explanations that can be easily interpreted by humans. The point cloud data we consider represents 3D objects such as cars, guitars, and laptops. We make use of point cloud segmentation models to generate explanations for the working of classification models. The segments are used to introduce perturbations into the input point cloud data and generate saliency maps. The perturbations are introduced using the novel point-shifting mechanism proposed in this work which ensures that the shifted points no longer influence the output of the classification algorithm. In contrast to previous methods, the segments used by our method are meaningful, i.e. humans can easily interpret the meaning of the segments. Thus, the benefit of our method over other methods is its ability to produce more meaningful saliency maps. We compare our method with the use of classical clustering algorithms to generate explanations. We also analyze the saliency maps generated for example inputs using our method to demonstrate the usefulness of the method in generating meaningful explanations.
zh

[CV-8] VeS: Teaching Pixels to Listen Without Supervision

【速读】:该论文旨在解决当前密集音频-视觉(Audio-Visual, AV)模型在低资源、多语言且存在代码切换和噪声的场景下性能是否依然可靠的问题,尤其是在发展中国家典型的语言环境(如印度多种语言及方言混合)中。其核心挑战在于:现有模型大多基于英语主导、带丰富字幕的网络视频数据训练,缺乏对多语种、标注稀疏和语音噪声等现实条件的鲁棒性验证。解决方案的关键在于采用密集token级匹配机制(Dense Token Matcher),相比传统的全局平均池化损失(CLIP-style),该方法显著提升了检索准确率(R@1提升59%相对指标)并实现了更精确的零样本定位热图,同时保持视觉主干网络完全冻结(未使用LoRA或部分微调)。研究表明,在标注稀缺和声学质量差的情况下,密集token路由策略反而成为性能提升的核心因素,而非高资源环境下的可选优化项。

链接: https://arxiv.org/abs/2507.22008
作者: Sajay Raj
机构: Indian Institute of Technology, Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figure, 1 table. Code and models are released

点击查看摘要

Abstract:Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp zero-shot localization heatmaps of spoken objects-despite keeping the vision backbone entirely frozen (no LoRA / partial fine-tuning). Our results demonstrate that dense token routing is not a luxury of high-resource English corpora; it is more decisive when annotations and acoustic cleanliness are scarce. We release the codebase and trained models.
zh

[CV-9] See Different Think Better: Visual Variations Mitigating Hallucinations in LVLMs ACM-MM25

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉理解任务中频繁出现的幻觉问题,即生成与输入图像内容不一致的文本响应。现有方法多以文本为中心,难以有效应对视觉-语义对齐挑战,尤其在细粒度视觉理解场景下效果有限。解决方案的关键在于提出一种以视觉为中心的幻觉缓解框架ViHallu,其核心创新包括:通过可控的视觉变化图像生成(Visual Variation Image Generation)构建具有细微差异但结构保持一致的视觉样本,并结合精心设计的视觉指令(Visual Instruction Construction)进行微调,从而增强模型对细粒度视觉内容的理解能力,提升视觉-语义对齐精度,显著降低幻觉倾向。

链接: https://arxiv.org/abs/2507.22003
作者: Ziyun Dai,Xiaoqiang Li,Shaohua Zhang,Yuanchen Wu,Jide Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM25

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces \textbf\textitvisual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models’ fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at this https URL.
zh

[CV-10] Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation

【速读】:该论文旨在解决工业烟雾分割(industrial smoke segmentation)在真实场景中因像素级标注成本高、数据稀缺而导致的性能瓶颈问题。其核心解决方案是提出一种人机协同的类别感知域适应框架CEDANet,关键创新在于:1)利用公民科学提供的视频级弱标签(video-level labels)通过投票机制优化伪标签(pseudo-labels),提升标注质量;2)引入类别特定的域判别器(class-specific domain discriminators),实现源域到目标域的特征对齐与丰富表示迁移。该方法在无需目标域像素级标注的情况下,达到了接近仅用100张全标注图像训练模型的性能,显著提升了烟雾分割的准确率与实用性。

链接: https://arxiv.org/abs/2507.22002
作者: Yida Tao,Yen-Chia Hsu
机构: Universiteit van Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial smoke segmentation is critical for air-quality monitoring and environmental protection but is often hampered by the high cost and scarcity of pixel-level annotations in real-world settings. We introduce CEDANet, a human-in-the-loop, class-aware domain adaptation framework that uniquely integrates weak, citizen-provided video-level labels with adversarial feature alignment. Specifically, we refine pseudo-labels generated by a source-trained segmentation model using citizen votes, and employ class-specific domain discriminators to transfer rich source-domain representations to the industrial domain. Comprehensive experiments on SMOKE5K and custom IJmond datasets demonstrate that CEDANet achieves an F1-score of 0.414 and a smoke-class IoU of 0.261 with citizen feedback, vastly outperforming the baseline model, which scored 0.083 and 0.043 respectively. This represents a five-fold increase in F1-score and a six-fold increase in smoke-class IoU. Notably, CEDANet with citizen-constrained pseudo-labels achieves performance comparable to the same architecture trained on limited 100 fully annotated images with F1-score of 0.418 and IoU of 0.264, demonstrating its ability to reach small-sampled fully supervised-level accuracy without target-domain annotations. Our research validates the scalability and cost-efficiency of combining citizen science with weakly supervised domain adaptation, offering a practical solution for complex, data-scarce environmental monitoring applications.
zh

[CV-11] Staining and locking computer vision models without retraining

【速读】:该论文旨在解决计算机视觉模型知识产权保护的问题,具体包括防止模型被非法复制或滥用。其核心解决方案是提出了一种无需微调或重新训练即可对预训练模型进行“染色”(staining,即水印)和“锁定”(locking)的新方法。关键创新在于通过直接修改模型少量权重实现上述功能,并提供可计算的、可证明的最坏情况误报率上限;同时,锁定后的模型仅在输入图像中嵌入特定“触发补丁”(trigger patch)时才能解锁使用,且对未锁定状态下的模型性能影响极小。

链接: https://arxiv.org/abs/2507.22000
作者: Oliver J. Sutton,Qinghua Zhou,George Leete,Alexander N. Gorban,Ivan Y. Tyukin
机构: Synoptix Ltd( Synoptix有限公司); King’s College London(伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 9 pages of appendices, 10 figures

点击查看摘要

Abstract:We introduce new methods of staining and locking computer vision models, to protect their owners’ intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, computable guarantees bounding their worst-case false positive rates. The stain and lock are implemented by directly modifying a small number of the model’s weights and have minimal impact on the (unlocked) model’s performance. Locked models are unlocked by inserting a small `trigger patch’ into the corner of the input image. We present experimental results showing the efficacy of our methods and demonstrating their practical performance on a variety of computer vision models.
zh

[CV-12] ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models ICCV2025

【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)模型在遭受对抗性提示攻击时存在的安全风险问题,即攻击者可通过特定提示诱导已遗忘模型生成被删除的数据点或概念内容,且现有攻击方法在实现意图对齐的同时存在计算成本高的缺陷。解决方案的关键在于提出ZIUM(Zero-shot Intent-aware adversarial attack on Unlearned Models),其核心创新是实现无需额外优化即可针对已遗忘概念进行零样本(zero-shot)攻击,并支持基于用户意图灵活定制目标图像内容,从而在保持高攻击成功率的同时显著降低攻击时间开销。

链接: https://arxiv.org/abs/2507.21985
作者: Hyun Jun Yook,Ga San Jhun,Jae Hyun Cho,Min Jeon,Donghyun Kim,Tae Hyung Kim,Youn Kyu Lee
机构: Chung-Ang University (中央大学); Korea University (韩国科学技术院); Hongik University (弘益大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to ICCV2025

点击查看摘要

Abstract:Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker’s intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker’s intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM’s effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.
zh

[CV-13] Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

【速读】:该论文旨在解决现有微动作(Micro-Action, MA)识别方法中对细微运动变化建模不足的问题,从而限制了对具有相似外观但细微差异的微动作的区分能力。其解决方案的关键在于提出一种运动引导调制网络(Motion-guided Modulation Network, MMN),通过两个核心模块显式地捕捉并利用运动线索:一是骨骼级的运动引导骨骼调制模块(Motion-guided Skeletal Modulation, MSM),将运动信息作为控制信号注入骨骼表示以优化空间建模;二是帧级的运动引导时间调制模块(Motion-guided Temporal Modulation, MTM),用于整合帧间运动信息以学习整体运动模式。此外,还设计了一种运动一致性学习策略,从多尺度特征中聚合运动线索以提升分类性能。实验表明,MMN在基于骨架的微动作识别任务上达到当前最优效果,验证了显式建模细微运动线索的重要性。

链接: https://arxiv.org/abs/2507.21977
作者: Jihao Gu,Kun Li,Fei Wang,Yanyan Wei,Zhiliang Wu,Hehe Fan,Meng Wang
机构: University College London (伦敦大学学院); Zhejiang University (浙江大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at this https URL.
zh

[CV-14] EIFNet: Leverag ing Event-Image Fusion for Robust Semantic Segmentation

【速读】:该论文旨在解决事件相机(event camera)在复杂环境下的语义分割问题,主要挑战在于从稀疏且噪声较多的事件流中提取可靠特征,并有效融合与密集语义信息丰富的图像数据(两者在结构和表示上存在差异)。解决方案的关键在于提出EIFNet,一种多模态融合网络,其核心创新包括:自适应事件特征精化模块(AEFRM),通过多尺度活动建模和空间注意力机制增强事件表征;以及模态自适应重校准模块(MARM)和多头注意力门控融合模块(MGFM),利用注意力机制和门控融合策略实现跨模态特征对齐与集成。实验表明,该方法在DDD17-Semantic和DSEC-Semantic数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2507.21971
作者: Zhijiang Li,Haoran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based semantic segmentation explores the potential of event cameras, which offer high dynamic range and fine temporal resolution, to achieve robust scene understanding in challenging environments. Despite these advantages, the task remains difficult due to two main challenges: extracting reliable features from sparse and noisy event streams, and effectively fusing them with dense, semantically rich image data that differ in structure and representation. To address these issues, we propose EIFNet, a multi-modal fusion network that combines the strengths of both event and frame-based inputs. The network includes an Adaptive Event Feature Refinement Module (AEFRM), which improves event representations through multi-scale activity modeling and spatial attention. In addition, we introduce a Modality-Adaptive Recalibration Module (MARM) and a Multi-Head Attention Gated Fusion Module (MGFM), which align and integrate features across modalities using attention mechanisms and gated fusion strategies. Experiments on DDD17-Semantic and DSEC-Semantic datasets show that EIFNet achieves state-of-the-art performance, demonstrating its effectiveness in event-based semantic segmentation.
zh

[CV-15] A Deep Learning Pipeline Using Synthetic Data to Improve Interpretation of Paper ECG Images

【速读】:该论文旨在解决临床实践中纸质心电图(ECG)图像自动分类的难题,尤其针对视觉噪声干扰(如阴影或折痕)以及细粒度波形模式识别困难两大挑战。其解决方案的关键在于提出了一种专门设计的深度学习框架,包含两个核心环节:一是构建预处理流程以有效降低视觉噪声;二是采用两阶段微调策略——首先在合成数据和外部ECG图像数据集上进行领域特征学习,再在目标数据集上进一步微调以提升疾病特异性识别能力。该方法基于ConvNeXt架构,在英国心脏基金会开放数据科学挑战赛中取得优异性能(公开验证集AUROC=0.9688,私有测试集AUROC=0.9677),展现出在临床工作流中实现自动化ECG解读的潜力。

链接: https://arxiv.org/abs/2507.21968
作者: Xiaoyu Wang,Ramesh Nadarajah,Zhiqiang Zhang,David Wong
机构: Leeds Institute of Health Sciences (健康科学研究所); University of Leeds (利兹大学); Leeds Institute of Cardiovascular and Metabolic Medicine (心血管与代谢医学研究所); School of Electronic and Electrical Engineering (电子与电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) are the leading global cause of death, and early detection is essential to improve patient outcomes. Electrocardiograms (ECGs), especially 12-lead ECGs, play a key role in the identification of CVDs. These are routinely interpreted by human experts, a process that is time-consuming and requires expert knowledge. Historical research in this area has focused on automatic ECG interpretation from digital signals, with recent deep learning approaches achieving strong results. In practice, however, most ECG data in clinical practice are stored or shared in image form. To bridge this gap, we propose a deep learning framework designed specifically to classify paper-like ECG images into five main diagnostic categories. Our method was the winning entry to the 2024 British Heart Foundation Open Data Science Challenge. It addresses two main challenges of paper ECG classification: visual noise (e.g., shadows or creases) and the need to detect fine-detailed waveform patterns. We propose a pre-processing pipeline that reduces visual noise and a two-stage fine-tuning strategy: the model is first fine-tuned on synthetic and external ECG image datasets to learn domain-specific features, and then further fine-tuned on the target dataset to enhance disease-specific recognition. We adopt the ConvNeXt architecture as the backbone of our model. Our method achieved AUROC scores of 0.9688 on the public validation set and 0.9677 on the private test set of the British Heart Foundation Open Data Science Challenge, highlighting its potential as a practical tool for automated ECG interpretation in clinical workflows.
zh

[CV-16] PanoSplatt3R: Leverag ing Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction ICCV2025

【速读】:该论文旨在解决现有宽基线全景重建方法对精确位姿信息高度依赖的问题,而真实场景中获取高精度位姿通常需要额外计算资源且易受噪声干扰,限制了方法的实用性与普适性。其解决方案的关键在于提出PanoSplatt3R,一种无需位姿信息的宽基线全景重建方法:通过将视角域(perspective domain)中的基础重建预训练迁移至全景域(panoramic domain),并引入RoPE滚动机制(RoPE rolling),在旋转位置编码(rotary positional embeddings, RoPE)中跨不同注意力头扩展展开坐标,有效建模全景图像的水平周期性,从而实现高效且无缝的域迁移,显著提升重建质量与深度估计精度。

链接: https://arxiv.org/abs/2507.21960
作者: Jiahui Ren,Mochu Xiang,Jiajun Zhu,Yuchao Dai
机构: Northwestern Polytechnical University (西北工业大学); Shaanxi Key Laboratory of Information Acquisition and Processing (陕西省信息获取与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Wide-baseline panorama reconstruction has emerged as a highly effective and pivotal approach for not only achieving geometric reconstruction of the surrounding 3D environment, but also generating highly realistic and immersive novel views. Although existing methods have shown remarkable performance across various benchmarks, they are predominantly reliant on accurate pose information. In real-world scenarios, the acquisition of precise pose often requires additional computational resources and is highly susceptible to noise. These limitations hinder the broad applicability and practicality of such methods. In this paper, we present PanoSplatt3R, an unposed wide-baseline panorama reconstruction method. We extend and adapt the foundational reconstruction pretrainings from the perspective domain to the panoramic domain, thus enabling powerful generalization capabilities. To ensure a seamless and efficient domain-transfer process, we introduce RoPE rolling that spans rolled coordinates in rotary positional embeddings across different attention heads, maintaining a minimal modification to RoPE’s mechanism, while modeling the horizontal periodicity of panorama images. Comprehensive experiments demonstrate that PanoSplatt3R, even in the absence of pose information, significantly outperforms current state-of-the-art methods. This superiority is evident in both the generation of high-quality novel views and the accuracy of depth estimation, thereby showcasing its great potential for practical applications. Project page: this https URL
zh

[CV-17] Mitigating Spurious Correlations in Weakly Supervised Semantic Segmentation via Cross-architecture Consistency Regularization

【速读】:该论文旨在解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)在工业烟雾场景中因图像级标签导致的模型偏差问题,特别是由于烟雾与烟囱空间耦合关系引发的前景覆盖不全、边界模糊及虚假相关性等挑战。其解决方案的关键在于提出一种教师-学生框架,融合卷积神经网络(CNN)与视觉Transformer(ViT)结构,并引入知识迁移损失以对齐两类架构的内部表示,从而实现跨结构一致性约束;同时结合后处理技术优化伪掩码质量,有效缓解了仅依赖图像级标签时模型对共现上下文的固有偏倚问题。

链接: https://arxiv.org/abs/2507.21959
作者: Zheyuan Zhang,Yen-chia Hsu
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model’s inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.21959 [cs.CV] (or arXiv:2507.21959v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.21959 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-18] Contrast-Prior Enhanced Duality for Mask-Free Shadow Removal

【速读】:该论文旨在解决无阴影掩码(shadow mask)条件下图像去阴影(shadow removal)的难题,尤其针对现有方法依赖难以获取的阴影掩码而限制实际应用的问题。其核心挑战在于如何利用内在图像线索(如局部对比度信息)在缺乏显式掩码时有效区分真实阴影与低反射率物体或复杂背景纹理之间的模糊性。解决方案的关键在于提出自适应门控双分支注意力机制(Adaptive Gated Dual-Branch Attention, AGBA),该机制能动态过滤并重新加权对比度先验,从而有效解耦阴影特征与干扰视觉元素;同时引入基于扩散模型的频域-对比融合网络(Diffusion-based Frequency-Contrast Fusion Network, FCFN),通过融合高频信息与对比度线索引导生成过程,显著提升软阴影边界和细粒度细节的恢复质量。

链接: https://arxiv.org/abs/2507.21949
作者: Jiyu Wu,Yifan Liu,Jiancheng Huang,Mingfu Yan,Shifeng Chen
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing shadow removal methods often rely on shadow masks, which are challenging to acquire in real-world scenarios. Exploring intrinsic image cues, such as local contrast information, presents a potential alternative for guiding shadow removal in the absence of explicit masks. However, the cue’s inherent ambiguity becomes a critical limitation in complex scenes, where it can fail to distinguish true shadows from low-reflectance objects and intricate background textures. To address this motivation, we propose the Adaptive Gated Dual-Branch Attention (AGBA) mechanism. AGBA dynamically filters and re-weighs the contrast prior to effectively disentangle shadow features from confounding visual elements. Furthermore, to tackle the persistent challenge of restoring soft shadow boundaries and fine-grained details, we introduce a diffusion-based Frequency-Contrast Fusion Network (FCFN) that leverages high-frequency and contrast cues to guide the generative process. Extensive experiments demonstrate that our method achieves state-of-the-art results among mask-free approaches while maintaining competitive performance relative to mask-based methods.
zh

[CV-19] Enhancing Generalization in Data-free Quantization via Mixup-class Prompting

【速读】:该论文旨在解决数据自由量化(Data-free Quantization, DFQ)中因合成图像质量不足导致的量化模型泛化能力差的问题,尤其是在隐私约束下校准数据有限的情况下。现有DFQ方法依赖基于单类提示词(single-class prompt)生成合成图像,易受多义性(polysemy)影响,从而降低量化性能。解决方案的关键在于提出一种基于mixup思想的文本提示策略——mixup-class prompt,通过在文本提示层面融合多个类别标签,生成更具多样性与鲁棒性的合成数据,从而提升量化过程中的优化稳定性和模型泛化能力。实验表明,该方法在CNN和视觉Transformer(ViT)上均优于当前最优DFQ方法(如GenQ),并在极端低比特场景(如W2A4)下实现了新的SOTA精度。

链接: https://arxiv.org/abs/2507.21947
作者: Jiwoong Park,Chaeun Lee,Yongseok Choi,Sein Park,Deokki Hong,Jungwook Choi
机构: HyperAccel(超加速); KRAFTON; SK Telecom(韩国电信); Rebellions; Hanyang University(汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) improves efficiency but struggles with limited calibration data, especially under privacy constraints. Data-free quantization (DFQ) mitigates this by generating synthetic images using generative models such as generative adversarial networks (GANs) and text-conditioned latent diffusion models (LDMs), while applying existing PTQ algorithms. However, the relationship between generated synthetic images and the generalizability of the quantized model during PTQ remains underexplored. Without investigating this relationship, synthetic images generated by previous prompt engineering methods based on single-class prompts suffer from issues such as polysemy, leading to performance degradation. We propose \textbfmixup-class prompt, a mixup-based text prompting strategy that fuses multiple class labels at the text prompt level to generate diverse, robust synthetic data. This approach enhances generalization, and improves optimization stability in PTQ. We provide quantitative insights through gradient norm and generalization error analysis. Experiments on convolutional neural networks (CNNs) and vision transformers (ViTs) show that our method consistently outperforms state-of-the-art DFQ methods like GenQ. Furthermore, it pushes the performance boundary in extremely low-bit scenarios, achieving new state-of-the-art accuracy in challenging 2-bit weight, 4-bit activation (W2A4) quantization.
zh

[CV-20] Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment

【速读】:该论文旨在解决长期动作质量评估(Long-term Action Quality Assessment, AQA)中多模态信息融合不足的问题,尤其针对艺术类体育项目(如韵律体操和花样滑冰)中视觉动作与背景音乐之间复杂时空关系难以建模的挑战。现有方法或仅依赖视觉特征(单模态),或采用简单的特征级对比融合策略,未能充分挖掘跨模态协同作用及时间动态变化,导致性能受限。解决方案的关键在于提出长时多模态注意力一致性网络(LMAC-Net),其核心创新为引入多模态注意力一致性机制,显式对齐视觉与音频特征,实现稳定的信息融合与增强表征;同时设计局部查询编码模块捕捉时序语义与跨模态关联,并通过两层评分机制提供可解释结果,辅以基于注意力和回归的联合损失函数优化多模态对齐与分数融合,从而显著提升长视频序列中关键性能变化的识别准确性。

链接: https://arxiv.org/abs/2507.21945
作者: Xin Wang,Peng-Jie Li,Yuan-Yuan Shen
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Applied Soft Computing

点击查看摘要

Abstract:Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended sequences. To address these challenges, we propose the Long-term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention consistency mechanism to explicitly align multimodal features, enabling stable integration of visual and audio information and enhancing feature representations. Specifically, we introduce a multimodal local query encoder module to capture temporal semantics and cross-modal relations, and use a two-level score evaluation for interpretable results. In addition, attention-based and regression-based losses are applied to jointly optimize multimodal alignment and score fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods, validating the effectiveness of our proposed approach.
zh

[CV-21] MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在代理调优(agent tuning)领域缺乏大规模、高质量数据集的问题,从而限制了其链式思维(Chain-of-Thought, CoT)、反思能力(reflection)和动态工具使用(dynamic tool usage)的提升。解决方案的关键在于提出首个百万级多模态代理调优数据集MMAT-1M,并设计了一个四阶段数据构建引擎:首先整合公开的多模态问答数据;其次利用GPT-4o生成推理过程并动态集成API调用与检索增强生成(Retrieval Augmented Generation, RAG)信息;再次通过反思机制优化推理逻辑以保证一致性与准确性,形成包含推理与反思(Rationale and Reflection, RR)的多轮对话数据;最后可选地将多轮对话压缩为单轮格式(One-turn Rationale and Reflection, ORR)以提升效率。实验证明,基于该数据集微调的模型在多个基准测试中显著提升性能,尤其在RAG任务上表现突出。

链接: https://arxiv.org/abs/2507.21924
作者: Tianhong Gao,Yannian Fu,Weiqun Wu,Haixiao Yue,Shanshan Liu,Gang Zhang
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset’s effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at this https URL.
zh

[CV-22] SwinECAT: A Transformer-based fundus disease classification model with Shifted Window Attention and Efficient Channel Attention

【速读】:该论文旨在解决眼底图像(fundus image)分析中因病灶区域小、疾病间差异细微而导致的模型预测准确率下降和过拟合问题。其解决方案的关键在于提出一种基于Transformer架构的SwinECAT模型,该模型融合了Swin注意力机制与轻量级通道注意力机制(Efficient Channel Attention, ECA),其中Swin注意力用于有效捕捉眼底图像中的局部空间结构与长程依赖关系,而ECA机制则引导模型关注关键特征通道,从而增强特征表示的判别能力。通过在包含16,140张图像的Eye Disease Image Dataset (EDID) 上进行9类疾病分类实验,SwinECAT取得了88.29%的准确率和0.90的宏F1分数,显著优于基线模型,成为该数据集上9分类任务的最先进性能。

链接: https://arxiv.org/abs/2507.21922
作者: Peiran Gu,Teng Yao,Mengshen He,Fuhao Duan,Feiyan Liu,RenYuan Peng,Bao Ge
机构: Shaanxi Normal University (陕西师范大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:In recent years, artificial intelligence has been increasingly applied in the field of medical imaging. Among these applications, fundus image analysis presents special challenges, including small lesion areas in certain fundus diseases and subtle inter-disease differences, which can lead to reduced prediction accuracy and overfitting in the models. To address these challenges, this paper proposes the Transformer-based model SwinECAT, which combines the Shifted Window (Swin) Attention with the Efficient Channel Attention (ECA) Attention. SwinECAT leverages the Swin Attention mechanism in the Swin Transformer backbone to effectively capture local spatial structures and long-range dependencies within fundus images. The lightweight ECA mechanism is incorporated to guide the SwinECAT’s attention toward critical feature channels, enabling more discriminative feature representation. In contrast to previous studies that typically classify fundus images into 4 to 6 categories, this work expands fundus disease classification to 9 distinct types, thereby enhancing the granularity of diagnosis. We evaluate our method on the Eye Disease Image Dataset (EDID) containing 16,140 fundus images for 9-category classification. Experimental results demonstrate that SwinECAT achieves 88.29% accuracy, with weighted F1-score of 0.88 and macro F1-score of 0.90. The classification results of our proposed model SwinECAT significantly outperform the baseline Swin Transformer and multiple compared baseline models. To our knowledge, this represents the highest reported performance for 9-category classification on this public dataset.
zh

[CV-23] ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval

【速读】:该论文旨在解决数字艺术作品分析中面临的挑战,即如何在缺乏结构化元数据(如Wikidata或Wikipedia链接)的情况下,实现对艺术作品的多模态理解与知识驱动的推理。传统方法依赖外部知识图谱或文本关联,难以适用于大多数未标注的数字化艺术收藏。其解决方案的关键在于提出ArtSeek框架,该框架由三个核心组件构成:基于晚期交互检索的智能多模态检索模块、用于预测艺术家、流派、风格、媒介和标签的对比多任务分类网络,以及通过上下文示例启用的代理式推理策略(agentic reasoning strategy),结合Qwen2.5-VL模型实现复杂视觉问答与作品解释。此外,作者构建了WikiFragments——一个大规模图像-文本片段数据集,支持知识增强的多模态推理,从而显著提升模型在风格分类(相比GraphCLIP F1提升8.4%)和艺术描述生成(ArtPedia上BLEU@1提升7.1%)等任务上的性能,并展现出对冷门作品的语境推断与视觉符号解析能力。

链接: https://arxiv.org/abs/2507.21917
作者: Nicola Fanelli,Gennaro Vessio,Giovanna Castellano
机构: University of Bari Aldo Moro (巴里阿尔多·莫罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at this https URL.
zh

[CV-24] Predict Patient Self-reported Race from Skin Histological Images MICCAI

【速读】:该论文旨在解决深度学习模型在数字皮肤病理学图像中可能学习到与社会健康决定因素相关的无意人口统计学偏差(如自我报告种族)的问题,从而影响AI在病理诊断中的公平性和可靠性。其关键解决方案是采用基于注意力机制的方法来识别与种族相关的关键形态学特征,并通过三种数据清理策略控制混杂因素,最终发现表皮区域为预测种族的关键特征,且移除该区域后模型性能显著下降,凸显了在训练数据中进行精细的偏见缓解和数据标准化的重要性。

链接: https://arxiv.org/abs/2507.21912
作者: Shengjia Chen,Ruchika Verma,Kevin Clare,Jannes Jegminat,Kuan-lin Huang,Brandon Veremis,Thomas Fuchs,Gabriele Campanella
机构: Windreich Department of Artificial Intelligence and Human Health (Windreich人工智能与人类健康系); Icahn School of Medicine at Mount Sinai (Icahn医学院); Hasso Plattner Institute for Digital Health at Mount Sinai (Hasso Plattner数字健康研究所); Mount Sinai Center for Transformative Disease Modeling (蒙特菲斯中心疾病建模); Department of Pathology, Molecular and Cell-Based Medicine (病理学、分子与细胞医学系); Mount Sinai Health System (蒙特菲斯医疗系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted to the MICCAI Workshop on Fairness of AI in Medical Imaging (FAIMI), 2025

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated success in computational pathology (CPath) for disease detection, biomarker classification, and prognosis prediction. However, its potential to learn unintended demographic biases, particularly those related to social determinants of health, remains understudied. This study investigates whether deep learning models can predict self-reported race from digitized dermatopathology slides and identifies potential morphological shortcuts. Using a multisite dataset with a racially diverse population, we apply an attention-based mechanism to uncover race-associated morphological features. After evaluating three dataset curation strategies to control for confounding factors, the final experiment showed that White and Black demographic groups retained high prediction performance (AUC: 0.799, 0.762), while overall performance dropped to 0.663. Attention analysis revealed the epidermis as a key predictive feature, with significant performance declines when these regions were removed. These findings highlight the need for careful data curation and bias mitigation to ensure equitable AI deployment in pathology. Code available at: this https URL.
zh

[CV-25] Evaluating Deepfake Detectors in the Wild ICML2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的深度伪造(deepfake)内容对数字媒体真实性与身份验证构成的持续威胁问题。其解决方案的关键在于设计了一种新颖的测试流程,以更贴近真实应用场景的方式评估现代 deepfake 检测器的性能,并构建了一个包含超过50万张高质量 deepfake 图像的综合性数据集用于系统性评测。实验表明,当前多数检测器在真实世界条件下表现有限,且基础图像处理操作(如JPEG压缩或图像增强)可显著降低检测模型的准确性,凸显了该领域仍面临严峻挑战。

链接: https://arxiv.org/abs/2507.21905
作者: Viacheslav Pirogov,Maksim Artemev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2025 Workshop ‘DataWorld: Unifying Data Curation Frameworks Across Domains’

点击查看摘要

Abstract:Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at this https URL.
zh

[CV-26] Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs

【速读】:该论文旨在解决传统文本到视觉(text-to-visual)流水线在多模态叙事生成中存在的时间顺序依赖性问题,即各模态(如文本、图像、声音等)的生成过程相互割裂,难以保证时空一致性与情感连贯性。其解决方案的关键在于提出一个集成的协同生成框架 Aether Weaver,通过引入四个核心组件实现多模态内容的并行合成:Narrator(大语言模型)负责生成叙事文本及多模态提示,Director 动态管理场景图(scene graph)以确保空间和时间关系的一致性,Narrative Arc Controller 控制故事结构以维持高层叙事逻辑,以及 Affective Tone Mapper 保证跨模态情感表达的一致性。这种紧密耦合的机制显著提升了叙事深度、视觉保真度与情感共鸣效果。

链接: https://arxiv.org/abs/2507.21893
作者: Saeed Ghorbani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Aether Weaver, a novel, integrated framework for multimodal narrative co-generation that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story’s world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches. This integrated framework provides a robust platform for rapid creative prototyping and immersive storytelling experiences.
zh

[CV-27] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

【速读】:该论文旨在解决**具身指代理解(Embodied Reference Understanding)**问题,即在场景中准确识别说话者通过语言和指向手势所指的对象。现有方法通常难以有效利用视觉线索进行消歧,且多依赖单一的“头到指尖”方向假设,忽略了部分情况下“腕到指尖”方向更符合实际指向的情况,导致性能受限。其解决方案的关键在于提出一种双模型框架:分别基于头到指尖和腕到指尖的方向学习指向线索,并引入高斯射线热图表示作为强监督信号以增强模型对指向信息的关注;同时设计了基于CLIP特征的指向集成模块(CLIP-Aware Pointing Ensemble)融合两模型优势,并加入物体中心预测头作为辅助任务进一步提升定位精度。

链接: https://arxiv.org/abs/2507.21888
作者: Fevziye Irem Eyiokur,Dogucan Yaman,Hazım Kemal Ekenel,Alexander Waibel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Istanbul Technical University (伊斯坦布尔技术大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
zh

[CV-28] Low-Cost Test-Time Adaptation for Robust Video Editing

【速读】:该论文旨在解决视频编辑中面临的两大核心问题:一是由于未能捕捉复杂运动模式导致的时序不一致性,二是因UNet骨干网络架构局限性引发的对简单文本提示的过拟合现象。解决方案的关键在于提出一种轻量级的测试时自适应(Test-Time Adaptation, TTA)框架Vid-TTA,其通过引入基于自监督的辅助任务,在推理阶段为每个测试视频进行个性化优化;具体包括两个创新机制:1)运动感知的帧重建机制,用于识别并保留关键运动区域以提升时序一致性;2)提示扰动与重构策略,增强模型对多样化文本描述的鲁棒性;上述模块由元学习驱动的动态损失平衡机制协同调度,可根据视频特征自适应调整优化过程,从而在保持低计算开销的前提下显著改善视频编辑质量。

链接: https://arxiv.org/abs/2507.21858
作者: Jianhui Wang,Yinda Chen,Yangfan He,Xinyuan Song,Yi Xin,Dapeng Zhang,Zhongwei Wan,Bin Li,Rongchao Zhang
机构: UESTC(电子科技大学); USTC(中国科学技术大学); University of Minnesota–Twin Cities(明尼苏达大学双城分校); Emory University(埃默里大学); Nanjing University(南京大学); Lanzhou University(兰州大学); Ohio State University, Columbus(俄亥俄州立大学哥伦布分校); Chinese Academy of Sciences(中国科学院); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives. Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures. While learning-based methods can enhance editing quality, they typically demand substantial computational resources and are constrained by the scarcity of high-quality annotated data. In this paper, we present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks. Our approach incorporates a motion-aware frame reconstruction mechanism that identifies and preserves crucial movement regions, alongside a prompt perturbation and reconstruction strategy that strengthens model robustness to diverse textual descriptions. These innovations are orchestrated by a meta-learning driven dynamic loss balancing mechanism that adaptively adjusts the optimization process based on video characteristics. Extensive experiments demonstrate that Vid-TTA significantly improves video temporal consistency and mitigates prompt overfitting while maintaining low computational overhead, offering a plug-and-play performance boost for existing video editing models.
zh

[CV-29] Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection

【速读】:该论文旨在解决RGB-D视频显著性检测(RGB-D VSOD)中如何有效利用光流(optical flow)和深度信息辅助RGB模态进行显著性目标识别的问题。现有方法通常将光流与深度信息在模型设计上同等对待,忽略了二者在不同场景下的贡献差异,从而限制了运动和深度信息的潜力。解决方案的关键在于提出一种新颖的选择性跨模态融合框架(SMFNet),其核心包括两个创新模块:一是像素级选择性融合策略(PSF),可根据光流与深度的实际贡献动态优化融合过程;二是多维选择性注意力模块(MSAM),在多个维度上整合PSF融合后的特征与RGB模态特征,增强特征表示能力并生成更精细的显著图。该方法通过实证验证,在RDVS和DVisal等数据集上优于19种前沿模型,且在包含合成深度的五个视频基准数据集上也展现出优异性能。

链接: https://arxiv.org/abs/2507.21857
作者: Jiahao He,Daerji Suolang,Keren Fu,Qijun Zhao
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to TMM on 11-Jun-2024, ID: MM-020522, still in peer review

点击查看摘要

Abstract:Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we propose a novel selective cross-modal fusion framework (SMFNet) for RGB-D VSOD, incorporating a pixel-level selective fusion strategy (PSF) that achieves optimal fusion of optical flow and depth based on their actual contributions. Besides, we propose a multi-dimensional selective attention module (MSAM) to integrate the fused features derived from PSF with the remaining RGB modality at multiple dimensions, effectively enhancing feature representation to generate refined features. We conduct comprehensive evaluation of SMFNet against 19 state-of-the-art models on both RDVS and DVisal datasets, making the evaluation the most comprehensive RGB-D VSOD benchmark up to date, and it also demonstrates the superiority of SMFNet over other models. Meanwhile, evaluation on five video benchmark datasets incorporating synthetic depth validates the efficacy of SMFNet as well. Our code and benchmark results are made publicly available at this https URL.
zh

[CV-30] Cross-Architecture Distillation Made Simple with Redundancy Suppression ICCV2025

【速读】:该论文旨在解决跨架构知识蒸馏(cross-architecture knowledge distillation)中现有方法因引入复杂模块、架构定制设计和过多参数而导致效率低下与适用性受限的问题。其解决方案的关键在于提出一种简化的冗余抑制蒸馏(Redundancy Suppression Distillation, RSD)损失函数,该损失函数通过最大化跨架构不变性(cross-architecture invariance maximisation)和特征去相关(feature decorrelation)目标,有效提取异构表示中的架构无关知识,同时通过轻量级解耦模块避免学生模型丧失自身架构特有的表征能力,从而在保持高效性的同时显著提升性能。

链接: https://arxiv.org/abs/2507.21844
作者: Weijia Zhang,Yuehao Liu,Wu Ran,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (Highlight)

点击查看摘要

Abstract:We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximisation and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student’s internal representations. Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA. It outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks with only a fraction of their parameter overhead, which highlights its potential as a simple and strong baseline to the cross-architecture distillation community.
zh

[CV-31] Anyone Can Jailbreak: Prompt-Based Attacks on LLM s and T2Is

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和文本到图像(Text-to-Image, T2I)生成系统在内容安全机制上面临的提示攻击(prompt-based attacks)问题,尤其是由非专家用户通过低门槛、高效果的提示技巧实现的“越狱”(jailbreak)行为。其解决方案的关键在于提出一个统一的提示级越狱策略分类体系,涵盖多轮叙事升级、词汇伪装、隐含链式推理、虚构角色扮演及细微语义修改等五类典型方法,并基于对主流API的实际案例分析,揭示了从输入过滤到输出验证的整个内容审核流程均存在可被绕过的设计漏洞。研究强调需构建具备上下文感知能力的防御机制,以应对这些越狱手段在真实场景中易于复现的特性。

链接: https://arxiv.org/abs/2507.21820
作者: Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda
机构: University of Nottingham (诺丁汉大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today’s jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.
zh

[CV-32] HunyuanWorld 1.0: Generating Immersive Explorable and Interactive 3D Worlds from Words or Pixels

【速读】:该论文旨在解决从文本或图像生成沉浸式、可交互的3D世界这一核心挑战,现有方法在3D一致性与渲染效率之间难以平衡:视频驱动的方法虽具多样性但缺乏几何一致性,而基于3D的方法则受限于训练数据不足和内存效率低的问题。解决方案的关键在于提出HunyuanWorld 1.0框架,其核心创新是采用语义分层的3D网格(mesh)表示,并利用全景图像作为360°世界代理(world proxies),实现语义感知的世界分解与重建,从而生成多样化且结构一致的3D场景;该方法同时支持全景沉浸体验、网格导出兼容传统图形管线以及解耦物体表示以增强交互性,显著提升了3D世界生成的连贯性、可探索性和实用性。

链接: https://arxiv.org/abs/2507.21809
作者: HunyuanWorld Team,Zhenwei Wang,Yuhao Liu,Junta Wu,Zixiao Gu,Haoyuan Wang,Xuhui Zuo,Tianyu Huang,Wenhuan Li,Sheng Zhang,Yihang Lian,Yulin Tsai,Lifu Wang,Sicong Liu,Puhua Jiang,Xianghui Yang,Dongyuan Guo,Yixuan Tang,Xinyue Mao,Jiaao Yu,Junlin Yu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Chao Zhang,Yonghao Tan,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Minghui Chen,Zhan Li,Wangchen Qin,Lei Wang,Yifu Sun,Lin Niu,Xiang Yuan,Xiaofeng Yang,Yingping He,Jie Xiao,Yangyu Tao,Jianchen Zhu,Jinbao Xue,Kai Liu,Chongqing Zhao,Xinming Wu,Tian Liu,Peng Chen,Di Wang,Yuhong Liu,Linus,Jie Jiang,Tengfei Wang,Chunchao Guo
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360° immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360° world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.
zh

[CV-33] MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

【速读】:该论文旨在解决流匹配模型(flow matching models)在图像生成中人类偏好对齐时效率低下的问题,尤其是现有方法如FlowGRPO因需在马尔可夫决策过程(Markov Decision Process, MDP)的所有去噪步骤上进行采样与优化而导致的计算开销过大。其解决方案的关键在于提出MixGRPO框架,该框架通过融合随机微分方程(SDE)与常微分方程(ODE)的混合采样策略,引入滑动窗口机制:仅在窗口内使用SDE采样和GRPO引导优化,窗口外则采用ODE采样,从而将采样随机性限制在局部时间步,显著降低优化复杂度,并支持高阶求解器以提升采样速度。进一步地,作者还提出了高效变体MixGRPO-Flash,通过更激进的优化策略实现训练时间减少71%,同时保持性能相当。

链接: https://arxiv.org/abs/2507.21802
作者: Junzhe Li,Yutao Cui,Tao Huang,Yinping Ma,Chun Fan,Miles Yang,Zhao Zhong
机构: Hunyuan, Tencent(腾讯); School of Computer Science, Peking University(北京大学计算机学院); Computer Center, Peking University(北京大学计算中心)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose \textbfMixGRPO , a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed \textbfMixGRPO-Flash , which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at \hrefthis https URLMixGRPO .
zh

[CV-34] Distribution-Based Masked Medical Vision-Language Model Using Structured Reports MICCAI

【速读】:该论文旨在解决现有医学图像-文本预训练模型在处理医学数据固有的变异性与模糊性时表现不足的问题,从而限制了其对细微临床信息和不确定性的捕捉能力。解决方案的关键在于提出一种不确定性感知的医学图像-文本预训练模型,通过引入由大语言模型(Large Language Model, LLM)生成的结构化文本报告来增强图像数据的临床语义上下文,该报告包含疾病定义、关键区域描述(appearance)、观察与诊断(observations and verdicts)三部分,同时建模跨模态(inter-modal)与模态内(intra-modal)不确定性,从而提升模型在下游任务中的表征能力和泛化性能。

链接: https://arxiv.org/abs/2507.21794
作者: Shreyank N Gowda,Ruichi Zhang,Xiao Gu,Ying Weng,Lu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in MICCAI-W 2025

点击查看摘要

Abstract:Medical image-language pre-training aims to align medical images with clinically relevant text to improve model performance on various downstream tasks. However, existing models often struggle with the variability and ambiguity inherent in medical data, limiting their ability to capture nuanced clinical information and uncertainty. This work introduces an uncertainty-aware medical image-text pre-training model that enhances generalization capabilities in medical image analysis. Building on previous methods and focusing on Chest X-Rays, our approach utilizes structured text reports generated by a large language model (LLM) to augment image data with clinically relevant context. These reports begin with a definition of the disease, followed by the appearance' section to highlight critical regions of interest, and finally observations’ and `verdicts’ that ground model predictions in clinical semantics. By modeling both inter- and intra-modal uncertainty, our framework captures the inherent ambiguity in medical images and text, yielding improved representations and performance on downstream tasks. Our model demonstrates significant advances in medical image-text pre-training, obtaining state-of-the-art performance on multiple downstream tasks.
zh

[CV-35] MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning

【速读】:该论文旨在解决视觉语言预训练模型(Vision-Language Pre-trained Models, VLMs)在少样本场景下对新类别(novel classes)泛化能力不足的问题,其根源在于对已见类别的过拟合以及通用知识的遗忘。解决方案的关键在于提出一种多语义引导的上下文优化框架(Multiple Semantic-Guided Context Optimization, MSGCoOp),通过并行可学习的上下文向量集合捕捉多样化的语义特征,并引入由大语言模型(Large Language Model, LLM)自动生成的类别描述作为语义引导机制,以增强提示的语义丰富性;同时设计多样性正则化损失函数,促使不同提示学习互补且正交的特征表示,避免冗余,从而在保持计算效率的同时显著提升基类到新类的泛化性能。

链接: https://arxiv.org/abs/2507.21786
作者: Zhaolong Wang,Tongfeng Sun,Mingzheng Du,Yachao Huang
机构: China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with generalization to novel classes, a phenomenon attributed to overfitting on seen classes and forgetting general knowledge. Furthermore, recent approaches that improve generalization often introduce complex architectures or heavy computational overhead. In this paper, we propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization while maintaining computational efficiency. Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects. To enrich these prompts, we introduce a semantic guidance mechanism that aligns them with comprehensive class descriptions automatically generated by a Large Language Model (LLM). Furthermore, a diversity regularization loss encourages the prompts to learn complementary and orthogonal features, preventing them from collapsing into redundant representations. Extensive experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization, achieving an average harmonic mean improvement of 1.10% over the strong KgCoOp baseline. Our method also demonstrates enhanced robustness in cross-domain generalization tasks. Our code is avaliable at: \hrefthis https URLthis https URL.
zh

[CV-36] AU-LLM : Micro-Expression Action Unit Detection via Enhanced LLM -Based Feature Fusion

【速读】:该论文旨在解决微表情动作单元(Action Units, AUs)检测这一在情感计算中极具挑战性的问题,尤其针对低强度、数据稀缺场景下传统方法性能受限的困境。其解决方案的关键在于提出了一种名为AU-LLM的新框架,首次将大语言模型(Large Language Models, LLMs)引入微表情AU检测任务,并通过设计**增强融合投影器(Enhanced Fusion Projector, EFP)**来缓解视觉-语言语义鸿沟问题。EFP利用多层感知机(MLP)智能融合来自专用3D-CNN主干网络的中层(局部纹理)与高层(全局语义)视觉特征,生成一个信息密集的紧凑token表示,从而有效赋能LLM对细微面部肌肉运动进行精细化推理。在CASME II和SAMM基准数据集上的严格评估(包括LOSO和跨域协议)验证了该方法的先进性和鲁棒性。

链接: https://arxiv.org/abs/2507.21778
作者: Zhishu Liu,Kaishen Yuan,Bo Zhao,Yong Xu,Zitong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbfAU-LLM, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbfEnhanced Fusion Projector (EFP). The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle this http URL extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at this https URL.
zh

[CV-37] MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions

【速读】:该论文旨在解决标准视觉 Transformer (Vision Transformer, ViT) 在实际部署中因参数冗余和高计算成本导致的效率瓶颈问题。现有高效 ViT 方法主要依赖静态模型压缩或 token 级稀疏化,但受限于对所有 token 固定的计算深度,难以实现灵活的资源分配。其解决方案的关键在于提出 MoR-ViT 框架,首次引入受 Mixture-of-Recursions (MoR) 启发的 token 级动态递归机制,使每个 token 能自适应地决定处理深度,从而实现输入相关的计算资源动态分配。这一设计在 ImageNet-1K 和迁移任务上实现了高达 70% 的参数减少与 2.5 倍推理加速,同时保持甚至超越当前最优高效 ViT 方法(如 DynamicViT 和 TinyViT)的性能,验证了动态递归作为提升 ViT 效率的有效策略。

链接: https://arxiv.org/abs/2507.21761
作者: YiZhou Li
机构: XJTLU(西交利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages,9 figuers

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success in image recognition, yet standard ViT architectures are hampered by substantial parameter redundancy and high computational cost, limiting their practical deployment. While recent efforts on efficient ViTs primarily focus on static model compression or token-level sparsification, they remain constrained by fixed computational depth for all tokens. In this work, we present MoR-ViT, a novel vision transformer framework that, for the first time, incorporates a token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions (MoR) paradigm. This approach enables each token to adaptively determine its processing depth, yielding a flexible and input-dependent allocation of computational resources. Extensive experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration, but also outperforms leading efficient ViT baselines such as DynamicViT and TinyViT under comparable conditions. These results establish dynamic recursion as an effective strategy for efficient vision transformers and open new avenues for scalable and deployable deep learning models in real-world scenarios.
zh

[CV-38] LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection

【速读】:该论文旨在解决驾驶员疲劳检测系统在嵌入式机器人设备(如智能汽车)中因计算资源有限而难以实现低延迟实时检测的问题。现有基于深度学习的方法通常计算复杂度高、延迟大,不适用于资源受限场景。其解决方案的关键在于提出一种轻量级时空图学习模型LiteFat,通过面部关键点检测将视频流转换为时空图(Spatio-Temporal Graph, STG),利用MobileNet提取面部特征构建特征矩阵,并采用轻量级时空图神经网络进行高效疲劳识别,在保证高准确率的同时显著降低计算复杂度和延迟。

链接: https://arxiv.org/abs/2507.21756
作者: Jing Ren,Suyu Ma,Hong Jia,Xiwei Xu,Ivan Lee,Haytham Fayek,Xiaodong Li,Feng Xia
机构: RMIT University (皇家墨尔本理工大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); University of Auckland (奥克兰大学); University of South Australia (南澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly decreasing computational complexity and latency as compared to current state-of-the-art methods. This work enables the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.
zh

[CV-39] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards ICCV2025

【速读】:该论文旨在解决遥感(remote sensing)领域中由于标注数据稀缺且昂贵,导致大型视觉语言模型(vision-language models, VLMs)难以有效适配专业任务的问题。其解决方案的关键在于提出首个可验证奖励强化学习(Reinforcement Learning with Verifiable Reward, RLVR)框架,该框架无需依赖繁琐的文本描述监督(caption supervision),仅通过轻量级规则定义的二值或IoU-based奖励信号即可实现高效训练。特别地,作者将语言模型中的“单样本强化学习”(1-shot RLVR)范式迁移至VLM,利用策略梯度优化方法,仅需一个精心挑选的示例即可显著提升模型在卫星图像理解任务(如分类、视觉问答和定位)上的性能;当样本扩展至128个时,效果可媲美甚至超越使用数千条标注样本训练的传统方法,从而为数据匮乏场景提供了高效率、强泛化能力的建模路径。

链接: https://arxiv.org/abs/2507.21745
作者: Aybora Koksal,A. Aydin Alatan
机构: Middle East Technical University (METU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL). 10 pages, 3 figures, 6 tables. Our model, training code and dataset will be at this https URL

点击查看摘要

Abstract:Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision–relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the “1-shot RLVR” paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks–including classification, visual question answering, and grounding–show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.
zh

[CV-40] Adversarial Reconstruction Feedback for Robust Fine-grained Generalization ICCV2025

【速读】:该论文旨在解决细粒度图像检索(Fine-Grained Image Retrieval, FGIR)方法中因依赖预定义类别监督而导致的类别特定语义干扰问题,这种干扰会显著限制模型在未见类别上的泛化能力。解决方案的关键在于提出一种对抗性重建反馈框架(AdvRF),通过将FGIR重构为视觉差异重建任务,协同利用检索模型的类别感知差异定位与重建模型的类别无关特征学习:重建模型捕捉检索模型忽略的残差差异,从而提升检索模型的定位精度;同时,检索模型优化后的信号引导重建模型增强重建能力,最终使重建模型编码出类别无关的差异表示,并通过知识蒸馏迁移至检索模型以实现高效部署。

链接: https://arxiv.org/abs/2507.21742
作者: Shijie Wang,Jian Shi,Haojie Li
机构: Shandong University of Science and Technology (山东科技大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Existing fine-grained image retrieval (FGIR) methods predominantly rely on supervision from predefined categories to learn discriminative representations for retrieving fine-grained objects. However, they inadvertently introduce category-specific semantics into the retrieval representation, creating semantic dependencies on predefined classes that critically hinder generalization to unseen categories. To tackle this, we propose AdvRF, a novel adversarial reconstruction feedback framework aimed at learning category-agnostic discrepancy representations. Specifically, AdvRF reformulates FGIR as a visual discrepancy reconstruction task via synergizing category-aware discrepancy localization from retrieval models with category-agnostic feature learning from reconstruction models. The reconstruction model exposes residual discrepancies overlooked by the retrieval model, forcing it to improve localization accuracy, while the refined signals from the retrieval model guide the reconstruction model to improve its reconstruction ability. Consequently, the retrieval model localizes visual differences, while the reconstruction model encodes these differences into category-agnostic representations. This representation is then transferred to the retrieval model through knowledge distillation for efficient deployment. Quantitative and qualitative evaluations demonstrate that our AdvRF achieves impressive performance on both widely-used fine-grained and coarse-grained datasets.
zh

[CV-41] MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

【速读】:该论文旨在解决多模态学习中视觉数据在编码后出现的空间与语义信息损失问题,这一问题直接影响视觉编码器与大语言模型之间的耦合强度,进而制约大型多模态模型的性能表现。现有方法常因向量间隙或语义差异导致信息传播过程中的失真。其解决方案的关键在于提出MAGE(Multimodal Alignment and Generation Enhancement)框架,通过引入智能对齐网络(Intelligent Alignment Network, IAN)实现维度和语义层面的跨模态对齐,并采用交叉熵与均方误差联合训练策略以缩小同义异构数据间的差距;同时,为增强模型“任意到任意”(Any-to-Any)的工具调用能力,构建了针对多模态指令微调的数据集,从而显著提升模型在多个基准测试(如MME、MMBench和SEED)上的综合性能。

链接: https://arxiv.org/abs/2507.21741
作者: Shaojun E,Yuchen Yang,Jiaheng Wu,Yan Zhang,Tiejun Zhao,Ziyan Chen
机构: Global Tone Communication Technology Co., Ltd. (全球 tone 通信技术有限公司); Harbin Institute of Technology (哈尔滨工业大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages

点击查看摘要

Abstract:In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: this https URL.
zh

[CV-42] SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking

【速读】:该论文旨在解决视觉目标跟踪(Visual Object Tracking, VOT)中两大关键问题:一是模板匹配方法忽略帧间时序依赖关系,二是自回归方法在训练过程中偏向已知类别,导致对未见类别的泛化能力弱;同时,现有基于视频基础模型SAM2的方法在处理目标遮挡和干扰物时表现不佳,且缺乏抑制跟踪误差传播的机制。解决方案的核心在于提出SAMITE模型,其关键创新包括:(1) 原型记忆库(Prototypical Memory Bank),通过量化每帧跟踪结果在特征维度和位置维度上的正确性,筛选高质量帧作为后续帧的条件信息,从而过滤因遮挡或干扰导致的低质量特征,有效拦截误差传播;(2) 位置提示生成器(Positional Prompt Generator),生成显式的掩码提示以提供精确的位置线索,进一步降低干扰物的影响,提升跟踪精度。

链接: https://arxiv.org/abs/2507.21732
作者: Qianxiong Xu,Lanyun Zhu,Chenxi Liu,Guosheng Lin,Cheng Long,Ziyue Li,Rui Zhao
机构: Nanyang Technological University (南洋理工大学); Singapore University of Technology and Design (新加坡科技设计大学); University of Cologne (科隆大学); SenseTime Research (商汤研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame’s tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code is available at this https URL.
zh

[CV-43] Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations

【速读】:该论文旨在解决检测Transformer(Detection Transformer, DETR)模型中内部组件功能不明确的问题,即缺乏对各模块(如查询嵌入、编码器和解码器的多头自注意力机制(Multi-Head Self-Attention, MHSA)及解码器多头交叉注意力机制(Multi-Head Cross-Attention, MHCA))在分类与回归任务中具体贡献的理解,从而阻碍了模型透明度与效率的进一步提升。解决方案的关键在于借鉴神经科学中的消融研究方法,系统性地移除或“损伤”这些关键组件,并通过在COCO数据集上量化gIoU和F1-score的变化,揭示不同模型(DETR、DDETR、DINO)对各类组件缺失的敏感性差异,进而识别出结构冗余与鲁棒性来源,为模型优化与简化提供依据。

链接: https://arxiv.org/abs/2507.21723
作者: Nils Hütten,Florian Hölken,Hasan Tercan,Tobias Meisen
机构: Institute for Technologies and Management of Digital Transformation (TMDT); University of Wuppertal (伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Explainable AI has gained traction as an approach to enhancing model interpretability and transparency, particularly in complex models such as detection transformers. Despite rapid advancements, a substantial research gap remains in understanding the distinct roles of internal components - knowledge that is essential for improving transparency and efficiency. Inspired by neuroscientific ablation studies, which investigate the functions of brain regions through selective impairment, we systematically analyze the impact of ablating key components in three state-of-the-art detection transformer models: Detection transformer (DETR), deformable detection transformer (DDETR), and DETR with improved denoising anchor boxes (DINO). The ablations target query embeddings, encoder and decoder multi-head self-attentions (MHSA) as well as decoder multi-head cross-attention (MHCA) layers. We evaluate the effects of these ablations on the performance metrics gIoU and F1-score, quantifying effects on both the classification and regression sub-tasks on the COCO dataset. To facilitate reproducibility and future research, we publicly release the DeepDissect library. Our findings reveal model-specific resilience patterns: while DETR is particularly sensitive to ablations in encoder MHSA and decoder MHCA, DDETR’s multi-scale deformable attention enhances robustness, and DINO exhibits the greatest resilience due to its look-forward twice update rule, which helps distributing knowledge across blocks. These insights also expose structural redundancies, particularly in DDETR’s and DINO’s decoder MHCA layers, highlighting opportunities for model simplification without sacrificing performance. This study advances XAI for DETRs by clarifying the contributions of internal components to model performance, offering insights to optimize and improve transparency and efficiency in critical applications.
zh

[CV-44] Impact of Underwater Image Enhancement on Feature Matching

【速读】:该论文旨在解决水下图像增强技术在实际应用中缺乏可靠评估标准的问题,尤其关注增强效果对后续视觉任务(如帧匹配和SLAM)性能的影响。现有方法多依赖主观视觉质量评价,难以量化其对下游任务的实际贡献。解决方案的关键在于提出一个基于局部匹配稳定性(local matching stability)和最远可匹配帧(furthest matchable frame)的定量评估框架,该框架结合了实际匹配策略,能够客观衡量不同增强算法在真实水下环境中的鲁棒性和适用性,从而为水下图像增强提供了一个上下文感知且可靠的基准。

链接: https://arxiv.org/abs/2507.21715
作者: Jason M. Summers,Mark W. Jones
机构: Swansea University (斯旺西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce local matching stability and furthest matchable frame as quantitative measures for evaluating the success of underwater image enhancement. This enhancement process addresses visual degradation caused by light absorption, scattering, marine growth, and debris. Enhanced imagery plays a critical role in downstream tasks such as path detection and autonomous navigation for underwater vehicles, relying on robust feature extraction and frame matching. To assess the impact of enhancement techniques on frame-matching performance, we propose a novel evaluation framework tailored to underwater environments. Through metric-based analysis, we identify strengths and limitations of existing approaches and pinpoint gaps in their assessment of real-world applicability. By incorporating a practical matching strategy, our framework offers a robust, context-aware benchmark for comparing enhancement methods. Finally, we demonstrate how visual improvements affect the performance of a complete real-world algorithm – Simultaneous Localization and Mapping (SLAM) – reinforcing the framework’s relevance to operational underwater scenarios.
zh

[CV-45] Semantics versus Identity: A Divide-and-Conquer Approach towards Adjustable Medical Image De-Identification ICCV2025

【速读】:该论文旨在解决医学影像中再识别(ReID)风险引发的隐私泄露问题,同时克服现有去标识化(DeID)方法在保留医学语义信息和灵活调整隐私等级方面的不足。其解决方案的关键在于提出一种分而治之的框架:首先通过“身份阻断”(Identity-Blocking)模块按比例屏蔽与身份相关区域以实现不同隐私级别;其次利用预训练医学基础模型(Medical Foundation Models, MFMs)提取医学语义特征,对被屏蔽区域进行语义补偿,从而在保障隐私的同时维持诊断可用性;此外,为消除MFMs特征中可能残留的身份信息,引入基于最小描述长度(Minimum Description Length, MDL)原则的特征解耦策略,有效分离并丢弃身份成分,确保去标识效果的可靠性。

链接: https://arxiv.org/abs/2507.21703
作者: Yuan Tian,Shuo Wang,Rongzhao Zhang,Zijian Chen,Yankai Jiang,Chunyi Li,Xiangyang Zhu,Fang Yan,Qiang Hu,XiaoSong Wang,Guangtao Zhai
机构: Shanghai AI Laboratory; Institute of Image Communication and Network Engineering, Shanghai Jiao Tong Unversity; Cooperative Medianet Innovation Center, Shanghai Jiao Tong Unversity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025;

点击查看摘要

Abstract:Medical imaging has significantly advanced computer-aided diagnosis, yet its re-identification (ReID) risks raise critical privacy concerns, calling for de-identification (DeID) techniques. Unfortunately, existing DeID methods neither particularly preserve medical semantics, nor are flexibly adjustable towards different privacy levels. To address these issues, we propose a divide-and-conquer framework comprising two steps: (1) Identity-Blocking, which blocks varying proportions of identity-related regions, to achieve different privacy levels; and (2) Medical-Semantics-Compensation, which leverages pre-trained Medical Foundation Models (MFMs) to extract medical semantic features to compensate the blocked regions. Moreover, recognizing that features from MFMs may still contain residual identity information, we introduce a Minimum Description Length principle-based feature decoupling strategy, to effectively decouple and discard such identity components. Extensive evaluations against existing approaches across seven datasets and three downstream tasks, demonstrates our state-of-the-art performance.
zh

[CV-46] APT: Improving Diffusion Models for High Resolution Image Generation with Adaptive Path Tracing

【速读】:该论文旨在解决Latent Diffusion Models (LDMs) 在高分辨率图像生成中因训练时固定分辨率而导致的性能瓶颈问题。现有训练-based 方法虽可提升分辨率,但需大量数据和计算资源;而训练-free 的 patch-based 方法虽具实用性,却存在“patch-level distribution shift”(patch级分布偏移)和“increased patch monotonicity”(patch单调性增强)两大关键问题,导致细节模糊与生成质量下降。解决方案的核心是提出 Adaptive Path Tracing (APT) 框架,其关键创新在于:1)通过 Statistical Matching 保证上采样潜空间中各 patch 分布一致性,缓解分布偏移;2)引入 Scale-aware Scheduling 降低 patch 单调性,优化 denoising 路径。APT 不仅显著提升高分辨率图像的细节清晰度,还通过捷径 denoising 流程实现更快推理速度且保持高质量输出。

链接: https://arxiv.org/abs/2507.21690
作者: Sangmin Han,Jinho Jeong,Jinwoo Kim,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) are generally trained at fixed resolutions, limiting their capability when scaling up to high-resolution images. While training-based approaches address this limitation by training on high-resolution datasets, they require large amounts of data and considerable computational resources, making them less practical. Consequently, training-free methods, particularly patch-based approaches, have become a popular alternative. These methods divide an image into patches and fuse the denoising paths of each patch, showing strong performance on high-resolution generation. However, we observe two critical issues for patch-based approaches, which we call patch-level distribution shift" and increased patch monotonicity." To address these issues, we propose Adaptive Path Tracing (APT), a framework that combines Statistical Matching to ensure patch distributions remain consistent in upsampled latents and Scale-aware Scheduling to deal with the patch monotonicity. As a result, APT produces clearer and more refined details in high-resolution images. In addition, APT enables a shortcut denoising process, resulting in faster sampling with minimal quality degradation. Our experimental results confirm that APT produces more detailed outputs with improved inference speed, providing a practical approach to high-resolution image generation.
zh

[CV-47] Automated Detection of Antarctic Benthic Organisms in High-Resolution In Situ Imagery to Aid Biodiversity Monitoring ICCV2025 ICCV

【速读】:该论文旨在解决南极海底生物多样性监测中因高分辨率影像数据量大、人工标注耗时且专业性强而导致的大规模分析困难问题。其核心挑战包括标注数据有限、目标尺寸差异大以及复杂海底结构干扰。解决方案的关键在于提出一个定制化的目标检测框架,融合了保留分辨率的图像分块(resolution-preserving patching)、空间数据增强、微调(fine-tuning)及通过切片辅助超推理(Slicing Aided Hyper Inference)进行后处理的技术,从而在25种细粒度形态类型中实现了对中大型生物的有效检测,显著优于现有方法,为未来基于机器学习的原位海底生物多样性监测提供了可扩展的基础。

链接: https://arxiv.org/abs/2507.21665
作者: Cameron Trotter,Huw Griffiths,Tasnuva Ming Khan,Rowan Whittle
机构: British Antarctic Survey (英国南极调查局); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025’s Joint Workshop on Marine Vision (ICCVW, CVAUIAAMVEM). Main paper (11 pages, 3 figures, 3 tables) plus supplementary (7 pages, 5 figures, 2 tables)

点击查看摘要

Abstract:Monitoring benthic biodiversity in Antarctica is vital for understanding ecological change in response to climate-driven pressures. This work is typically performed using high-resolution imagery captured in situ, though manual annotation of such data remains laborious and specialised, impeding large-scale analysis. We present a tailored object detection framework for identifying and classifying Antarctic benthic organisms in high-resolution towed camera imagery, alongside the first public computer vision dataset for benthic biodiversity monitoring in the Weddell Sea. Our approach addresses key challenges associated with marine ecological imagery, including limited annotated data, variable object sizes, and complex seafloor structure. The proposed framework combines resolution-preserving patching, spatial data augmentation, fine-tuning, and postprocessing via Slicing Aided Hyper Inference. We benchmark multiple object detection architectures and demonstrate strong performance in detecting medium and large organisms across 25 fine-grained morphotypes, significantly more than other works in this area. Detection of small and rare taxa remains a challenge, reflecting limitations in current detection architectures. Our framework provides a scalable foundation for future machine-assisted in situ benthic biodiversity monitoring research.
zh

[CV-48] he Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)领域在大模型时代所面临的方法体系碎片化、技术演进路径不清晰以及缺乏系统性综述的问题。其解决方案的关键在于提出一个统一框架,整合基于深度神经网络(Deep Neural Networks, DNNs)和大语言模型(Large Language Models, LLMs)的VAD方法,构建新的分类体系,并深入分析LLMs赋能下的VAD范式变革,从而厘清当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)和LLMs的VAD方法的技术特征、优势与局限,为后续研究提供结构化指引和方向。

链接: https://arxiv.org/abs/2507.21649
作者: Shibo Gao,Peipei Yang,Haiyang Guo,Yangyang Liu,Yi Chen,Shuai Li,Han Zhu,Jian Xu,Xu-Yao Zhang,Linlin Huang
机构: Beijing Jiaotong University (北京交通大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Zhongguancun Academy, Beijing, China (中关村学院,北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos, serving as a core technology in the fields of intelligent surveillance and public safety. With the advancement of deep learning, the continuous evolution of deep model architectures has driven innovation in VAD methodologies, significantly enhancing feature representation and scene adaptability, thereby improving algorithm generalization and expanding application boundaries. More importantly, the rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field. Under the support of MLLMs and LLMs, VAD has undergone significant transformations in terms of data annotation, input modalities, model architectures, and task objectives. The surge in publications and the evolution of tasks have created an urgent need for systematic reviews of recent advancements. This paper presents the first comprehensive survey analyzing VAD methods based on MLLMs and LLMs, providing an in-depth discussion of the changes occurring in the VAD field in the era of large models and their underlying causes. Additionally, this paper proposes a unified framework that encompasses both deep neural network (DNN)-based and LLM-based VAD methods, offering a thorough analysis of the new VAD paradigms empowered by LLMs, constructing a classification system, and comparing their strengths and weaknesses. Building on this foundation, this paper focuses on current VAD methods based on MLLMs/LLMs. Finally, based on the trajectory of technological advancements and existing bottlenecks, this paper distills key challenges and outlines future research directions, offering guidance for the VAD community.
zh

[CV-49] GuidPaint: Class-Guided Image Inpainting with Diffusion Models

【速读】:该论文旨在解决现有基于扩散模型的图像修复(image inpainting)方法在无需额外训练的前提下,难以对掩码区域实现细粒度语义控制的问题。当前上下文感知型扩散修复方法虽能利用模型先验进行高质量修复,但缺乏对生成内容的精确引导,易导致语义不一致或视觉不合理的结果。解决方案的关键在于提出GuidPaint框架,通过在去噪过程中引入分类器引导(classifier guidance),实现对掩码区域内中间生成过程的精准控制,从而保障语义一致性与视觉真实性;同时结合随机与确定性采样策略,使用户可选择偏好中间结果并进一步确定性优化,显著提升修复质量与可控性。

链接: https://arxiv.org/abs/2507.21627
作者: Qimin Wang,Xinda Liu,Guohua Geng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, diffusion models have been widely adopted for image inpainting tasks due to their powerful generative capabilities, achieving impressive results. Existing multimodal inpainting methods based on diffusion models often require architectural modifications and retraining, resulting in high computational cost. In contrast, context-aware diffusion inpainting methods leverage the model’s inherent priors to adjust intermediate denoising steps, enabling high-quality inpainting without additional training and significantly reducing computation. However, these methods lack fine-grained control over the masked regions, often leading to semantically inconsistent or visually implausible content. To address this issue, we propose GuidPaint, a training-free, class-guided image inpainting framework. By incorporating classifier guidance into the denoising process, GuidPaint enables precise control over intermediate generations within the masked areas, ensuring both semantic consistency and visual realism. Furthermore, it integrates stochastic and deterministic sampling, allowing users to select preferred intermediate results and deterministically refine them. Experimental results demonstrate that GuidPaint achieves clear improvements over existing context-aware inpainting methods in both qualitative and quantitative evaluations.
zh

[CV-50] EMIT: Enhancing MLLM s for Industrial Anomaly Detection via Difficulty-Aware GRPO

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在工业异常检测(Industrial Anomaly Detection, IAD)任务中因缺乏领域适配而导致性能受限的问题。其核心解决方案是提出一个统一框架EMIT,关键创新在于引入难度感知的组相对策略优化(difficulty-aware group relative policy optimization, GRPO),通过响应重采样策略确保困难样本中正确答案的纳入,并结合优势重加权机制强化对难例的学习;同时,EMIT构建多任务IAD数据集并利用GPT生成的物体文本描述弥补缺陷图像缺失问题,在少样本场景下融合软提示与基于热力图引导的对比嵌入,从而显著提升MLLMs在IAD中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.21619
作者: Wei Guan,Jun Lan,Jian Cao,Hao Tan,Huijia Zhu,Weiqiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77% over the base model (InternVL3-8B) across seven tasks.
zh

[CV-51] Wind Turbine Feature Detection Using Deep Learning and Synthetic Data

【速读】:该论文旨在解决风力发电机(Wind Turbine, WT)叶片自主无人机巡检中,因依赖人工标注的真实世界图像导致训练数据数量有限且环境多样性不足的问题,从而影响目标检测模型在复杂天气、光照变化及不同风机类型下的鲁棒性。解决方案的关键在于构建一种合成训练数据生成方法,通过可控地调整视觉和环境因素来增强数据多样性,并采用改进的损失函数训练YOLOv11特征检测网络,使其仅基于合成图像即可实现对WT及其关键特征的高精度检测,在未见过的真实图像上仍达到Pose mAP50-95为0.97的优异性能。

链接: https://arxiv.org/abs/2507.21611
作者: Arash Shahirpour,Jakob Gebler,Manuel Sanders,Tim Reuscher
机构: Institute of Automatic Control (IRT); RWTH Aachen University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, accepted at ICMV 2025

点击查看摘要

Abstract:For the autonomous drone-based inspection of wind turbine (WT) blades, accurate detection of the WT and its key features is essential for safe drone positioning and collision avoidance. Existing deep learning methods typically rely on manually labeled real-world images, which limits both the quantity and the diversity of training datasets in terms of weather conditions, lighting, turbine types, and image complexity. In this paper, we propose a method to generate synthetic training data that allows controlled variation of visual and environmental factors, increasing the diversity and hence creating challenging learning scenarios. Furthermore, we train a YOLOv11 feature detection network solely on synthetic WT images with a modified loss function, to detect WTs and their key features within an image. The resulting network is evaluated both using synthetic images and a set of real-world WT images and shows promising performance across both synthetic and real-world data, achieving a Pose mAP50-95 of 0.97 on real images never seen during training.
zh

[CV-52] Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition ICCV

【速读】:该论文旨在解决车辆与万物互联(V2X)协同感知与规划中的关键技术难题,特别是在有限通信带宽和动态环境约束下,如何高效融合来自车辆自身及基础设施的多源异构传感器数据,以提升自动驾驶系统的感知范围与决策可靠性。其解决方案的关键在于构建一个统一的评估框架——UniV2X,并基于V2X-Seq-SPD数据集组织了端到端的V2X协同驾驶挑战赛,通过设置合作时序感知与合作端到端规划两个赛道,系统性地推动了带宽感知融合、鲁棒多智能体规划以及异构传感器集成等核心问题的研究,从而为可扩展、高可靠的V2X协同自动驾驶系统发展提供了实践基准与技术路径。

链接: https://arxiv.org/abs/2507.21610
作者: Ruiyang Hao,Haibao Yu,Jiaru Zhong,Chuanye Wang,Jiahao Wang,Yiming Kan,Wenxian Yang,Siqi Fan,Huilin Yin,Jianing Qiu,Yao Mu,Jiankai Sun,Li Chen,Walter Zimmer,Dandan Zhang,Shanghang Zhang,Mac Schwager,Wei Huang,Xiaobo Zhang,Ping Luo,Zaiqing Nie
机构: Tsinghua University (清华大学); Hong Kong University (香港大学); Tongji University (同济大学); Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); Stanford University (斯坦福大学); OpenDriveLab; Technical University of Munich (慕尼黑工业大学); Imperial College London (伦敦帝国理工学院); Peking University (北京大学); Shanghai Songying Technology Co., Ltd (上海松英科技有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, accepted by ICCVW

点击查看摘要

Abstract:With the rapid advancement of autonomous driving technology, vehicle-to-everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety by providing visibility beyond the line of sight. However, integrating multi-source sensor data from both ego-vehicles and infrastructure under real-world constraints, such as limited communication bandwidth and dynamic environments, presents significant technical challenges. To facilitate research in this area, we organized the End-to-End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end-to-end planning. Built on the UniV2X framework and the V2X-Seq-SPD dataset, the challenge attracted participation from over 30 teams worldwide and established a unified benchmark for evaluating cooperative driving systems. This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth-aware fusion, robust multi-agent planning, and heterogeneous sensor integration, and analyzes emerging technical trends among top-performing solutions. By addressing practical constraints in communication and data fusion, the challenge contributes to the development of scalable and reliable V2X-cooperative autonomous driving systems.
zh

[CV-53] Semantic Segmentation of iPS Cells: Case Study on Model Complexity in Biomedical Imaging

【速读】:该论文旨在解决医学图像分割中在挑战性成像条件下(如低对比度、边界模糊)如何实现高精度与鲁棒性的难题。其解决方案的关键在于:通过精心配置的DeepLabv3模型,在不依赖大规模基础模型(如SAM2或MedSAM2)结构修改的前提下,实现了对诱导多能干细胞(iPS)集落的高性能分割;研究发现,对于具有细微边界特征的特定任务,模型复杂度的提升并不必然带来性能增益,而经过领域适配的小规模模型反而能提供更可靠的准确性和实用性。

链接: https://arxiv.org/abs/2507.21608
作者: Maoquan Zhang,Bisser Raytchev,Xiujuan Sun
机构: Hiroshima University (广岛大学); Weifang University of Science and Technology (潍坊科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19th International Conference on Machine Vision Applications MVA2025

点击查看摘要

Abstract:Medical image segmentation requires not only accuracy but also robustness under challenging imaging conditions. In this study, we show that a carefully configured DeepLabv3 model can achieve high performance in segmenting induced pluripotent stem (iPS) cell colonies, and, under our experimental conditions, outperforms large-scale foundation models such as SAM2 and its medical variant MedSAM2 without structural modifications. These results suggest that, for specialized tasks characterized by subtle, low-contrast boundaries, increased model complexity does not necessarily translate to better performance. Our work revisits the assumption that ever-larger and more generalized architectures are always preferable, and provides evidence that appropriately adapted, simpler models may offer strong accuracy and practical reliability in domain-specific biomedical applications. We also offer an open-source implementation that includes strategies for small datasets and domain-specific encoding, with the aim of supporting further advances in semantic segmentation for regenerative medicine and related fields.
zh

[CV-54] Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking AAAI2025

【速读】:该论文旨在解决视觉跟踪任务中依赖大量人工标注边界框(bounding box annotations)所导致的数据集规模受限与多样性不足的问题。其核心解决方案是提出了一种名为 \tracker 的自监督跟踪框架,关键创新在于设计了一个解耦的时空一致性训练机制:通过全局空间定位实现跨时间戳的目标信息学习,结合局部时序关联建模实例的运动与外观变化;同时引入实例对比损失(instance contrastive loss),从多视角学习实例级别的对应关系,从而在无需额外标注的情况下提供鲁棒的实例级监督信号。这一范式使得模型能够以自监督方式有效学习通用的跟踪表示,显著降低对大规模标注数据的依赖。

链接: https://arxiv.org/abs/2507.21606
作者: Yaozong Zheng,Bineng Zhong,Qihua Liang,Ning Li,Shuxiang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf\tracker, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables \tracker to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that \tracker surpasses \textitSOTA self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: this https URL.
zh

[CV-55] Locally Controlled Face Aging with Latent Diffusion Models

【速读】:该论文旨在解决当前人脸老化方法中存在的局限性,即传统生成对抗网络(GAN)和扩散模型将老化视为全局、同质的过程,忽略了面部区域因内在生理因素和外在环境(如日晒)影响而呈现的异质性老化特征。其解决方案的关键在于利用潜在扩散模型(latent diffusion model),通过局部老化线索选择性地对特定面部区域进行老化处理,并采用潜空间扩散精修器(latent diffusion refiner)实现局部老化区域的无缝融合,从而在保持整体一致性的同时,实现更精细、真实且可控的人脸老化合成。

链接: https://arxiv.org/abs/2507.21600
作者: Lais Isabelle Alves dos Santos,Julien Despois,Thibaut Chauffier,Sileye O. Ba,Giovanni Palma
机构: L’Oréal AI Research (欧莱雅人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach to face aging that addresses the limitations of current methods which treat aging as a global, homogeneous process. Existing techniques using GANs and diffusion models often condition generation on a reference image and target age, neglecting that facial regions age heterogeneously due to both intrinsic chronological factors and extrinsic elements like sun exposure. Our method leverages latent diffusion models to selectively age specific facial regions using local aging signs. This approach provides significantly finer-grained control over the generation process, enabling more realistic and personalized aging. We employ a latent diffusion refiner to seamlessly blend these locally aged regions, ensuring a globally consistent and natural-looking synthesis. Experimental results demonstrate that our method effectively achieves three key criteria for successful face aging: robust identity preservation, high-fidelity and realistic imagery, and a natural, controllable aging progression.
zh

[CV-56] Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning ICCV2025

【速读】:该论文旨在解决音频-视觉多任务增量学习(audio-visual multi-task incremental learning)中的关键挑战:在不进行所有任务联合训练的前提下,持续学习多个音频-视觉任务时如何有效保留旧任务知识并促进新任务的学习。解决方案的核心在于提出一种三阶段渐进式稳定与可塑性提示(Progressive Homeostatic and Plastic audio-visual prompt, PHP)方法:首先在浅层设计任务共享的模态聚合适配器(task-shared modality aggregating adapter),增强跨任务和跨模态的表示学习;其次在中层引入任务特定的模态共享动态生成适配器(task-specific modality-shared dynamic generating adapter),在保持模态通用性的同时构建任务专属提示,平衡知识保留与多任务迁移能力;最后在深层采用任务特定的模态独立提示(task-specific modality-independent prompts),针对每个任务和模态细化理解。通过这三层结构,PHP 在保留任务专属提示的同时共享参数以适应新任务,实现了知识共享与任务特异性的有效权衡,在四个任务(AVE、AVVP、AVS 和 AVQA)的不同顺序下均达到当前最优性能(SOTA)。

链接: https://arxiv.org/abs/2507.21588
作者: Jiong Yin,Liang Li,Jiehua Zhang,Yuhan Gao,Chenggang Yan,Xichun Sheng
机构: Hangzhou Dianzi University (杭州电子科技大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Xi’an Jiaotong University (西安交通大学); Macao Polytechnic University (澳门理工学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the models ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of four tasks (AVE, AVVP, AVS and AVQA). Our code can be available at this https URL.
zh

[CV-57] Emerging Trends in Pseudo-Label Refinement for Weakly Supervised Semantic Segmentation with Image-Level Supervision

【速读】:该论文旨在解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)中仅使用图像级标签(image level annotations)时面临的挑战,即如何在缺乏像素级标注的情况下实现高精度的稠密预测任务。其解决方案的关键在于系统性地梳理和归纳近年来基于不同形式和层级附加监督信号的方法,并通过分类分析揭示主流技术路线的演进趋势。此外,论文还深入探讨了先进方法在特定领域数据集上应用时所遇到的泛化难题,指出现有方法在跨域适应方面的局限性,并提出未来研究应聚焦于提升模型鲁棒性、优化监督机制以及探索更有效的迁移学习策略等方向。

链接: https://arxiv.org/abs/2507.21587
作者: Zheyuan Zhang,Wang Zhang
机构: University of Amsterdam (阿姆斯特丹大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike fully supervised semantic segmentation, weakly supervised semantic segmentation (WSSS) relies on weaker forms of supervision to perform dense prediction tasks. Among the various types of weak supervision, WSSS with image level annotations is considered both the most challenging and the most practical, attracting significant research attention. Therefore, in this review, we focus on WSSS with image level annotations. Additionally, this review concentrates on mainstream research directions, deliberately omitting less influential branches. Given the rapid development of new methods and the limitations of existing surveys in capturing recent trends, there is a pressing need for an updated and comprehensive review. Our goal is to fill this gap by synthesizing the latest advancements and state-of-the-art techniques in WSSS with image level labels. Basically, we provide a comprehensive review of recent advancements in WSSS with image level labels, categorizing existing methods based on the types and levels of additional supervision involved. We also examine the challenges of applying advanced methods to domain specific datasets in WSSS,a topic that remains underexplored. Finally, we discuss the current challenges, evaluate the limitations of existing approaches, and outline several promising directions for future research. This review is intended for researchers who are already familiar with the fundamental concepts of WSSS and are seeking to deepen their understanding of current advances and methodological innovations. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.21587 [cs.CV] (or arXiv:2507.21587v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.21587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-58] ARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言推理中产生的幻觉问题,即模型生成看似合理但事实错误或缺乏视觉依据的输出,从而影响其可靠性。现有基于直接偏好优化(Direct Preference Optimization, DPO)的方法通常将与幻觉相关的偏好视为固定目标,依赖静态监督信号进行训练,容易过度拟合偏好数据中的表面语言线索,导致分布刚性和虚假相关性,削弱对因果相关视觉信息的 grounding 能力。解决方案的关键在于提出 TARS(Token-adaptive Preference Strategy),它将 DPO 重新建模为一个 min-max 优化问题:一方面在语义约束下最大化 token 级别的分布偏移以模拟对齐不确定性,另一方面在这些受控扰动下最小化期望偏好损失。该联合目标在保持因果 grounding 的同时缓解了对偏好模式的过拟合,显著降低幻觉率——仅用 4.8k 偏好样本且无需专家反馈,即可将幻觉率从 26.4% 降至 13.2%,认知价值从 2.5 降至 0.4,并优于标准 DPO 且达到 GPT-4o 水平。

链接: https://arxiv.org/abs/2507.21584
作者: Kejia Zhang,Keda Tao,Zhiming Luo,Chang Liu,Jiasheng Tang,Huan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.
zh

[CV-59] LinDeps: A Fine-tuning Free Post-Pruning Method to Remove Layer-Wise Linear Dependencies with Guaranteed Performance Preservation

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNN)在资源受限平台部署时因模型规模和复杂度增加而面临的效率瓶颈问题,尤其是现有剪枝技术因忽略层内特征图之间的结构依赖关系而导致剪枝决策次优的问题。其解决方案的关键在于提出一种名为LinDeps的后处理剪枝方法,通过引入基于选主元QR分解(pivoted QR decomposition)的线性依赖分析,系统性地识别并移除冗余滤波器(filter),同时设计了一种新颖的信号恢复机制,在不进行微调(fine-tuning)的前提下调整下一层核以保持网络兼容性和性能,从而显著提升压缩率并维持模型精度。

链接: https://arxiv.org/abs/2507.21573
作者: Maxim Henry,Adrien Deliège,Anthony Cioppa,Marc Van Droogenbroeck
机构: Montefiore Institute, University of Liège (列日大学蒙特菲奥雷研究所), Liège, Belgium
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 5 tables, 45 references

点击查看摘要

Abstract:Convolutional Neural Networks (CNN) are widely used in many computer vision tasks. Yet, their increasing size and complexity pose significant challenges for efficient deployment on resource-constrained platforms. Hence, network pruning has emerged as an effective way of reducing the size and computational requirements of neural networks by removing redundant or unimportant parameters. However, a fundamental challenge with pruning consists in optimally removing redundancies without degrading performance. Most existing pruning techniques overlook structural dependencies across feature maps within a layer, resulting in suboptimal pruning decisions. In this work, we introduce LinDeps, a novel post-pruning method, i.e., a pruning method that can be applied on top of any pruning technique, which systematically identifies and removes redundant filters via linear dependency analysis. Particularly, LinDeps applies pivoted QR decomposition to feature maps to detect and prune linearly dependent filters. Then, a novel signal recovery mechanism adjusts the next layer’s kernels to preserve compatibility and performance without requiring any fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet backbones demonstrate that LinDeps improves compression rates of existing pruning techniques while preserving performances, leading to a new state of the art in CNN pruning. We also benchmark LinDeps in low-resource setups where no retraining can be performed, which shows significant pruning improvements and inference speedups over a state-of-the-art method. LinDeps therefore constitutes an essential add-on for any current or future pruning technique.
zh

[CV-60] RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic Priors

【速读】:该论文旨在解决在线高精地图(Online High-Definition Map)构建中因忽略地图元素间空间关系与语义先验而导致的精度和泛化能力受限的问题。其核心解决方案在于提出RelMap框架,关键创新包括:一是引入类感知的空间关系先验(Class-aware Spatial Relation Prior),通过可学习的类感知关系编码器显式建模地图元素间的相对位置依赖;二是设计基于专家混合(Mixture-of-Experts, MoE)的语义先验机制,根据预测类别概率将特征路由至特定类别的专家网络,从而优化实例特征解码。该方法兼容单帧与时序感知骨干网络,在nuScenes和Argoverse 2数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2507.21567
作者: Tianhui Cai,Yun Zhang,Zewei Zhou,Zhiyu Huang,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online high-definition (HD) map construction plays an increasingly important role in scaling autonomous driving systems. Transformer-based methods have become prevalent in online HD map construction; however, existing approaches often neglect the inherent spatial and semantic relationships among map elements, which limits their accuracy and generalization. To address this, we propose RelMap, an end-to-end framework that enhances online map construction by incorporating spatial relations and semantic priors. We introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we propose a Mixture-of-Experts (MoE)-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. Our method is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
zh

[CV-61] Multi-View Reconstruction with Global Context for 3D Anomaly Detection

【速读】:该论文旨在解决高精度三维异常检测(3D anomaly detection)中因全局信息不足导致性能下降的问题。其解决方案的关键在于提出多视角重建(Multi-View Reconstruction, MVR)方法,该方法能够无损地将高分辨率点云转换为多视角图像,并基于重建机制构建异常检测框架,从而增强对全局信息的学习能力,显著提升检测精度,在Real3D-AD基准上实现了对象级89.6%和点级95.7%的AU-ROC。

链接: https://arxiv.org/abs/2507.21555
作者: Yihan Sun,Yuqi Cheng,Yunkang Cao,Yuxin Zhang,Weiming Shen
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC), 2025

点击查看摘要

Abstract:3D anomaly detection is critical in industrial quality inspection. While existing methods achieve notable progress, their performance degrades in high-precision 3D anomaly detection due to insufficient global information. To address this, we propose Multi-View Reconstruction (MVR), a method that losslessly converts high-resolution point clouds into multi-view images and employs a reconstruction-based anomaly detection framework to enhance global information learning. Extensive experiments demonstrate the effectiveness of MVR, achieving 89.6% object-wise AU-ROC and 95.7% point-wise AU-ROC on the Real3D-AD benchmark.
zh

[CV-62] Sun sensor calibration algorithms: A systematic mapping and survey

【速读】:该论文旨在解决太阳敏感器(sun sensor)在轨校准过程中因多种不确定性源(如制造误差、电气干扰、环境变化等)导致的精度下降问题,这些问题具有小幅度、时空变异性强的特点,且现有文献缺乏系统性的方法整合与综述。解决方案的关键在于提出一种系统性映射方法,对不同配置下的太阳敏感器建模与校准算法进行全面梳理与分析,识别当前研究中的空白,并为未来提升太阳敏感器校准精度和寿命期内稳定性的技术方向提供明确建议。

链接: https://arxiv.org/abs/2507.21541
作者: Michael Herman,Olivia J. Pinon Fischer,Dimitri N. Mavris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: Submitted to Acta Astronautica

点击查看摘要

Abstract:Attitude sensors determine the spacecraft attitude through the sensing of an astronomical object, field or other phenomena. The Sun and fixed stars are the two primary astronomical sensing objects. Attitude sensors are critical components for the survival and knowledge improvement of spacecraft. Of these, sun sensors are the most common and important sensor for spacecraft attitude determination. The sun sensor measures the Sun vector in spacecraft coordinates. The sun sensor calibration process is particularly difficult due to the complex nature of the uncertainties involved. The uncertainties are small, difficult to observe, and vary spatio-temporally over the lifecycle of the sensor. In addition, the sensors are affected by numerous sources of uncertainties, including manufacturing, electrical, environmental, and interference sources. This motivates the development of advanced calibration algorithms to minimize uncertainty over the sensor lifecycle and improve accuracy. Although modeling and calibration techniques for sun sensors have been explored extensively in the literature over the past two decades, there is currently no resource that consolidates and systematically reviews this body of work. The present review proposes a systematic mapping of sun sensor modeling and calibration algorithms across a breadth of sensor configurations. It specifically provides a comprehensive survey of each methodology, along with an analysis of research gaps and recommendations for future directions in sun sensor modeling and calibration techniques.
zh

[CV-63] PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在安全对齐机制下仍易受复杂对抗攻击的问题,特别是现有越狱方法多依赖显式有害提示而忽视了模型在多步推理过程中通过组合信息产生的隐蔽漏洞。其解决方案的关键在于受软件安全中返回导向编程(Return-Oriented Programming, ROP)启发,将有害指令分解为一系列单独看似无害的视觉“工具”(gadgets),并通过精心设计的文本提示引导模型在推理过程中整合这些工具,从而生成连贯且有害的输出。此方法使恶意意图呈现为推理过程的涌现特性,难以从单一组件中检测,显著提升了攻击成功率(在SafeBench上超过0.90),揭示了LVLM推理链中的关键脆弱性,并强调需构建覆盖完整推理流程的安全防御体系。

链接: https://arxiv.org/abs/2507.21540
作者: Quanchen Zou,Zonghao Ying,Moyang Chen,Wenzhuo Xu,Yisong Xiao,Yakai Li,Deyue Zhang,Dongdong Yang,Zhao Liu,Xiangzheng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.
zh

[CV-64] Suppressing Gradient Conflict for Generalizable Deepfake Detection

【速读】:该论文旨在解决深度伪造(deepfake)检测模型在联合训练原始真实数据与在线合成伪造图像时性能下降的问题,这一现象违背了“增加源域数据应提升检测准确率”的普遍认知。通过实证分析发现,性能退化源于反向传播过程中的梯度冲突,导致模型在源域准确性与目标域泛化能力之间产生权衡。解决方案的关键在于提出一种冲突抑制的深度伪造检测框架(Conflict-Suppressed Deepfake Detection, CS-DFD),其核心由两个协同模块构成:一是更新向量搜索(Update Vector Search, UVS)模块,通过将梯度搜索转化为极值优化问题,找到一个能同时最小化两类数据损失的最优更新方向;二是冲突梯度降低(Conflict Gradient Reduction, CGR)模块,引入冲突下降损失(Conflict Descent Loss),强制学习低冲突特征嵌入空间,以对齐梯度方向并减少干扰。两者协同作用,有效缓解了参数优化和表示学习中的梯度干扰,显著提升了模型在域内检测精度和跨域泛化能力。

链接: https://arxiv.org/abs/2507.21530
作者: Ming-Hui Liu,Harry Cheng,Xin Luo,Xin-Shun Xu
机构: Shandong University (山东大学); Quan Cheng Laboratory (全城实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: V1

点击查看摘要

Abstract:Robust deepfake detection models must be capable of generalizing to ever-evolving manipulation techniques beyond training data. A promising strategy is to augment the training data with online synthesized fake images containing broadly generalizable artifacts. However, in the context of deepfake detection, it is surprising that jointly training on both original and online synthesized forgeries may result in degraded performance. This contradicts the common belief that incorporating more source-domain data should enhance detection accuracy. Through empirical analysis, we trace this degradation to gradient conflicts during backpropagation which force a trade-off between source domain accuracy and target domain generalization. To overcome this issue, we propose a Conflict-Suppressed Deepfake Detection (CS-DFD) framework that explicitly mitigates the gradient conflict via two synergistic modules. First, an Update Vector Search (UVS) module searches for an alternative update vector near the initial gradient vector to reconcile the disparities of the original and online synthesized forgeries. By further transforming the search process into an extremum optimization problem, UVS yields the uniquely update vector, which maximizes the simultaneous loss reductions for each data type. Second, a Conflict Gradient Reduction (CGR) module enforces a low-conflict feature embedding space through a novel Conflict Descent Loss. This loss penalizes misaligned gradient directions and guides the learning of representations with aligned, non-conflicting gradients. The synergy of UVS and CGR alleviates gradient interference in both parameter optimization and representation learning. Experiments on multiple deepfake benchmarks demonstrate that CS-DFD achieves state-of-the-art performance in both in-domain detection accuracy and cross-domain generalization.
zh

[CV-65] Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance ACM-MM2025

【速读】:该论文旨在解决烹饪过程可视化(Cooking Process Visualization)中的两个核心挑战:一是食材在不同烹饪步骤中外观变化多样,导致生成图像与文本描述之间存在语义不一致;二是当前步骤可能依赖于前序操作,需保持图像序列的上下文连贯性。解决方案的关键在于提出名为Chain-of-Cooking的模型,其核心创新包括:(1) 动态补丁选择模块(Dynamic Patch Selection Module),通过检索先前生成的图像补丁作为参考,确保当前步骤图像的外观与文本内容语义一致;(2) 语义演化模块(Semantic Evolution Module)与双向思维链(Bidirectional Chain-of-Thought, CoT)引导机制,前者建立潜在提示与当前烹饪步骤之间的语义关联并融合至潜在特征,后者则更新合并后的特征以维持图像序列的逻辑顺序和一致性。

链接: https://arxiv.org/abs/2507.21529
作者: Mengling Xu,Ming Tao,Bing-Kun Bao
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges to visualize the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.
zh

[CV-66] Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration

【速读】:该论文旨在解决大规模视觉-语言模型在主动学习(Active Learning, AL)中面临的两个核心问题:一是如何准确估计不确定性以实现高效样本选择,二是如何在参数量庞大的模型中实现计算高效的采样策略。其解决方案的关键在于提出一种参数高效的训练方法,将不确定性校准损失(uncertainty calibration loss)嵌入到AL框架中,并设计了一个可微分的损失函数,用于促进模型对不确定性的准确校准,从而在极少样本下也能选出最具信息量的数据进行微调。实验表明,该方法在多个数据集和视觉骨干网络上性能优于或相当复杂特征驱动的采样技术,同时具备显著的计算效率优势。

链接: https://arxiv.org/abs/2507.21521
作者: Athmanarayanan Lakshmi Narayanan,Amrutha Machireddy,Ranganath Krishnan
机构: Intel Labs (英特尔实验室); Intel Corporation (英特尔公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Joint Conference on Neural Networks 2025 (Accepted)

点击查看摘要

Abstract:Active Learning (AL) has emerged as a powerful approach for minimizing labeling costs by selectively sampling the most informative data for neural network model development. Effective AL for large-scale vision-language models necessitates addressing challenges in uncertainty estimation and efficient sampling given the vast number of parameters involved. In this work, we introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the AL framework. We propose a differentiable loss function that promotes uncertainty calibration for effectively selecting fewer and most informative data samples for fine-tuning. Through extensive experiments across several datasets and vision backbones, we demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques while being computationally very efficient. Additionally, we investigate the efficacy of Prompt learning versus Low-rank adaptation (LoRA) in sample selection, providing a detailed comparative analysis of these methods in the context of efficient AL.
zh

[CV-67] VAGU GtS: LLM -Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding

【速读】:该论文旨在解决当前视频异常检测(Video Anomaly Detection, VAD)研究中缺乏同时支持异常理解(anomaly understanding)与异常定位(anomaly grounding)的统一框架和基准的问题。现有方法通常局限于单一任务:传统深度神经网络(DNN)方法侧重于时间定位,而大语言模型(LLM)方法则强调语义理解,二者难以协同。为应对这一挑战,作者提出了首个集成两项任务的基准数据集VAGU(Video Anomaly Grounding and Understanding),其中包含异常类别、语义解释、精确时间定位及视频问答(Video QA)标注。解决方案的关键在于提出训练-free的“先粗看后细察”(Glance then Scrutinize, GtS)框架,该框架通过文本提示引导模型先识别高概率异常区域,再进行细节解读与边界优化;并引入JeAUG指标,联合评估语义可解释性与时间精度,从而实现更全面、客观的性能衡量。

链接: https://arxiv.org/abs/2507.21507
作者: Shibo Gao,Peipei Yang,Yangyang Liu,Yi Chen,Han Zhu,Xuyao Zhang,Linlin Huang
机构: Beijing Jiaotong University (北京交通大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 21 pages, 19 figures, 8 tables

点击查看摘要

Abstract:Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals. Current VAD methods mainly fall into two categories: traditional DNN-based approaches that focus on temporal localization, and LLM-based approaches that emphasize semantic understanding. Both anomaly understanding and grounding are essential for comprehensive video anomaly detection and can complement each other. However, no existing model or dataset supports both tasks simultaneously. To address this, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark to integrate both tasks. Each VAGU instance includes annotations for anomaly category, semantic explanation, precise temporal grounding and Video QA. We also provide multiple-choice Video QA for objective evaluation. Based on this dataset, we propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts. The framework first enables coarse localization of high-probability anomalous regions, followed by detailed anomaly interpretation and temporal boundary refinement. Additionally, we propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision, overcoming the limitations of traditional metrics. Extensive experiments verify the effectiveness of our benchmark, framework, and evaluation metric.
zh

[CV-68] MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对视觉不可回答问题时缺乏诚实行为的问题,即模型常会生成看似合理但实际错误或虚构的回答,从而影响其可信度。解决方案的关键在于:首先构建了一个大规模、高质量的基准测试集 MoHoBench,包含超过12,000个视觉问答样本,并通过多阶段过滤与人工验证确保数据可靠性;其次,系统性地评估了28个主流MLLMs的诚实行为,发现模型诚实性不仅受语言建模能力影响,更显著依赖于视觉信息处理能力;最后,提出基于监督学习和偏好学习的初步对齐方法,以提升模型在视觉不可回答场景下的拒绝回答能力,为未来可信赖MLLMs的研究奠定基础。

链接: https://arxiv.org/abs/2507.21503
作者: Yanxu Zhu,Shitong Duan,Xiangxu Zhang,Jitao Sang,Peng Zhang,Tun Lu,Xiao Zhou,Jing Yao,Xiaoyuan Yi,Xing Xie
机构: Beijing Jiaotong University (北京交通大学); Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs’ capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models’ response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs’ honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at this https URL.
zh

[CV-69] Describe Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval ICCV2025

【速读】:该论文旨在解决开放集三维物体检索(open-set 3D object retrieval, 3DOR)中因训练数据不足导致的泛化能力弱的问题。现有方法通常依赖多模态输入(如体素、点云和多视角图像)并训练专用主干网络,但难以在未见类别上实现良好性能。其解决方案的关键在于提出一种名为“Describe, Adapt and Combine (DAC)”的简单而有效的框架,仅使用多视角图像即可实现强泛化能力:通过将CLIP模型与多模态大语言模型(Multimodal Large Language Model, MLLM)协同融合,利用MLLM在训练阶段描述已知类别以对齐CLIP目标,在推理阶段提供关于未知类别的外部语义提示以补充视觉线索;同时引入Additive-Bias Low-Rank adaptation (AB-LoRA)机制缓解过拟合,显著提升对未见类别的泛化性能。实验表明,DAC在四个开放集3DOR数据集上平均mAP提升达+10.01%,且在图像基和跨数据集设置下均验证了其泛化优势。

链接: https://arxiv.org/abs/2507.21489
作者: Zhichuan Wang,Yang Zhou,Zhe Liu,Rui Yu,Song Bai,Yulong Wang,Xinwei He,Xiang Bai
机构: Huazhong Agricultural University (华中农业大学); Shenzhen University (深圳大学); The University of Hong Kong (香港大学); University of Louisville (路易斯维尔大学); ByteDance (字节跳动); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP’s training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at this https URL.
zh

[CV-70] An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

【速读】:该论文旨在解决复杂低光场景下光场(Light Field)中角度特征建模不准确的问题,尤其在时域上难以可靠提取运动目标的判别性空间-角度线索。其核心解决方案是提出一种新型的光场极平面结构图像(Epipolar-Plane Structure Image, ESI)表示方法,该方法显式建模光场中的几何结构,并利用极平面内光线角度的突变增强低光场景下的视觉表达能力,同时降低高维光场中的冗余信息。在此基础上,进一步设计了角度-时间交互网络(Angular-Temporal Interaction Network, ATINet),通过学习几何结构与角度-时间交互特征来提升光场目标跟踪性能,并支持自监督优化以增强时域内几何特征的交互能力。

链接: https://arxiv.org/abs/2507.21460
作者: Mianzhao Wang,Fan Shi,Xu Cheng,Feifei Zhang,Shengyong Chen
机构: Tianjin University of Technology (天津理工大学); Engineering Research Center of Learning-Based Intelligent System (教育部学习型智能系统工程研究中心); Key Laboratory of Computer Vision and System (教育部计算机视觉与系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.
zh

[CV-71] Boost Self-Supervised Dataset Distillation via Parameterization Predefined Augmentation and Approximation

【速读】:该论文旨在解决大规模数据集在训练深度模型时带来的高昂计算成本问题,提出通过自监督方式对图像及其表示进行压缩,从而生成具有高泛化能力的紧凑数据集,以替代原始大数据集进行训练。其核心解决方案在于:首先,引入基于低维基底的图像与表示参数化方法,实验证明基底选择对保持原始数据特征至关重要;其次,针对自监督学习中数据增强带来的随机性不稳定问题,采用预定义增强策略加以缓解;最后,利用轻量级网络建模同一图像不同增强视图间的表示关系,进一步提升蒸馏结果的紧凑性和性能。该方法显著提升了蒸馏效率、跨架构泛化能力和迁移学习表现。

链接: https://arxiv.org/abs/2507.21455
作者: Sheng-Feng Yu,Jia-Jiun Yao,Wei-Chen Chiu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Macronix International Co., Ltd. (美光科技国际有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although larger datasets are crucial for training large deep models, the rapid growth of dataset size has brought a significant challenge in terms of considerable training costs, which even results in prohibitive computational expenses. Dataset Distillation becomes a popular technique recently to reduce the dataset size via learning a highly compact set of representative exemplars, where the model trained with these exemplars ideally should have comparable performance with respect to the one trained with the full dataset. While most of existing works upon dataset distillation focus on supervised datasets, we instead aim to distill images and their self-supervisedly trained representations into a distilled set. This procedure, named as Self-Supervised Dataset Distillation, effectively extracts rich information from real datasets, yielding the distilled sets with enhanced cross-architecture generalizability. Particularly, in order to preserve the key characteristics of original dataset more faithfully and compactly, several novel techniques are proposed: 1) we introduce an innovative parameterization upon images and representations via distinct low-dimensional bases, where the base selection for parameterization is experimentally shown to play a crucial role; 2) we tackle the instability induced by the randomness of data augmentation – a key component in self-supervised learning but being underestimated in the prior work of self-supervised dataset distillation – by utilizing predetermined augmentations; 3) we further leverage a lightweight network to model the connections among the representations of augmented views from the same image, leading to more compact pairs of distillation. Extensive experiments conducted on various datasets validate the superiority of our approach in terms of distillation efficiency, cross-architecture generalization, and transfer learning performance.
zh

[CV-72] Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation AAAI2026

【速读】:该论文旨在解决视觉语言导航(Vision Language Navigation, VLN)任务中因场景表征过于细节化和视觉-语言对齐模糊而导致的语义理解不足问题,从而影响导航决策的准确性与指令遵循能力。其解决方案的关键在于提出一种递归式视觉想象(Recursive Visual Imagination, RVI)机制与自适应语言锚定(Adaptive Linguistic Grounding, ALG)策略:RVI通过结构化建模历史轨迹为紧凑的神经网格,促使智能体关注视觉变化的规律性和语义场景布局,而非被几何细节误导;ALG则针对不同语言成分精准对齐情境记忆,实现细粒度语义匹配,提升导航动作预测的准确性与进展感知能力。该方法在VLN-CE和ObjectNav等挑战性任务上优于现有最先进模型,验证了RVI与ALG的有效性。

链接: https://arxiv.org/abs/2507.21450
作者: Bolei Chen,Jiaxu Kang,Yifei Wang,Ping Zhong,Qi Wu,Jianxin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to AAAI 2026

点击查看摘要

Abstract:Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.
zh

[CV-73] Dual Cross-image Semantic Consistency with Self-aware Pseudo Labeling for Semi-supervised Medical Image Segmentation

【速读】:该论文旨在解决半监督医学图像分割中因标注数据有限而导致的模型性能瓶颈问题,特别是现有方法依赖像素级伪标签一致性训练时,忽视了更高级别的语义一致性(如目标区域层面),且由于标注与未标注数据数量不平衡,导致特征提取存在严重偏差。解决方案的关键在于提出一种名为Dual Cross-image Semantic Consistency (DuCiSC) 的新框架,其核心创新为引入双范式机制:一是通过显式对齐标注图像与未标注图像的原型(prototype)来增强跨图像的区域级语义一致性;二是利用标注图像与融合图像之间的原型对齐进一步强化语义一致性。该设计有效缓解了特征不一致问题,并结合一种自感知置信度估计策略精准筛选可靠伪标签,从而充分利用未标注数据的训练动态,显著提升分割性能。

链接: https://arxiv.org/abs/2507.21440
作者: Han Wu,Chong Wang,Zhiming Cui
机构: ShanghaiTech University (上海科技大学); Lingang Laboratory (临港实验室); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TMI

点击查看摘要

Abstract:Semi-supervised learning has proven highly effective in tackling the challenge of limited labeled training data in medical image segmentation. In general, current approaches, which rely on intra-image pixel-wise consistency training via pseudo-labeling, overlook the consistency at more comprehensive semantic levels (e.g., object region) and suffer from severe discrepancy of extracted features resulting from an imbalanced number of labeled and unlabeled data. To overcome these limitations, we present a new \underlineDual \underlineCross-\underlineimage \underlineSemantic \underlineConsistency (DuCiSC) learning framework, for semi-supervised medical image segmentation. Concretely, beyond enforcing pixel-wise semantic consistency, DuCiSC proposes dual paradigms to encourage region-level semantic consistency across: 1) labeled and unlabeled images; and 2) labeled and fused images, by explicitly aligning their prototypes. Relying on the dual paradigms, DuCiSC can effectively establish consistent cross-image semantics via prototype representations, thereby addressing the feature discrepancy issue. Moreover, we devise a novel self-aware confidence estimation strategy to accurately select reliable pseudo labels, allowing for exploiting the training dynamics of unlabeled data. Our DuCiSC method is extensively validated on four datasets, including two popular binary benchmarks in segmenting the left atrium and pancreas, a multi-class Automatic Cardiac Diagnosis Challenge dataset, and a challenging scenario of segmenting the inferior alveolar nerve that features complicated anatomical structures, showing superior segmentation results over previous state-of-the-art approaches. Our code is publicly available at \hrefthis https URLthis https URL.
zh

[CV-74] MapDiffusion: Generative Diffusion for Vectorized Online HD Map Construction and Uncertainty Estimation in Autonomous Driving IROS2025

【速读】:该论文旨在解决传统在线高精地图构建方法在面对现实场景中的不确定性(如遮挡和缺失车道线)时,仅提供确定性点估计而无法刻画潜在多解性的问题。其解决方案的关键在于提出MapDiffusion,一种基于扩散机制的生成式建模方法,通过在BEV(鸟瞰图)潜空间条件下迭代优化随机初始化的查询向量,从概率分布角度学习可能的地图表示,从而生成多个合理的矢量化地图样本,并利用这些样本聚合提升预测精度及提供与场景模糊性直接相关的不确定性估计。

链接: https://arxiv.org/abs/2507.21423
作者: Thomas Monninger,Zihan Zhang,Zhipeng Mo,Md Zafar Anwar,Steffen Staab,Sihao Ding
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研发公司); University of Stuttgart, Institute for Artificial Intelligence (斯图加特大学人工智能研究所); University of California, San Diego (加州大学圣地亚哥分校); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

Abstract:Autonomous driving requires an understanding of the static environment from sensor data. Learned Bird’s-Eye View (BEV) encoders are commonly used to fuse multiple inputs, and a vector decoder predicts a vectorized map representation from the latent BEV grid. However, traditional map construction models provide deterministic point estimates, failing to capture uncertainty and the inherent ambiguities of real-world environments, such as occlusions and missing lane markings. We propose MapDiffusion, a novel generative approach that leverages the diffusion paradigm to learn the full distribution of possible vectorized maps. Instead of predicting a single deterministic output from learned queries, MapDiffusion iteratively refines randomly initialized queries, conditioned on a BEV latent grid, to generate multiple plausible map samples. This allows aggregating samples to improve prediction accuracy and deriving uncertainty estimates that directly correlate with scene ambiguity. Extensive experiments on the nuScenes dataset demonstrate that MapDiffusion achieves state-of-the-art performance in online map construction, surpassing the baseline by 5% in single-sample performance. We further show that aggregating multiple samples consistently improves performance along the ROC curve, validating the benefit of distribution modeling. Additionally, our uncertainty estimates are significantly higher in occluded areas, reinforcing their value in identifying regions with ambiguous sensor input. By modeling the full map distribution, MapDiffusion enhances the robustness and reliability of online vectorized HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles in complex environments.
zh

[CV-75] op2Pano: Learning to Generate Indoor Panoramas from Top-Down View ICCV2025

【速读】:该论文旨在解决从二维俯视图(top-down view)生成沉浸式360°室内全景图(360° indoor panoramas)的问题,这一任务面临缺乏显式三维结构信息、几何一致性难以保证以及图像真实感不足等挑战。解决方案的关键在于提出一种端到端的模型Top2Pano:首先通过体素占用估计(volumetric occupancy estimation)推断场景的三维结构,再利用体渲染(volumetric rendering)生成粗略的颜色与深度全景图;随后引入基于ControlNet的扩散模型精修阶段,以提升图像的真实感和结构保真度。该方法在两个数据集上均优于现有基线,能有效重建几何形状、遮挡关系和空间布局,并具备良好的泛化能力,适用于从简化的平面图生成高质量全景图像。

链接: https://arxiv.org/abs/2507.21371
作者: Zitong Zhang,Suranjan Gautam,Rui Yu
机构: University of Louisville (路易斯维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Generating immersive 360° indoor panoramas from 2D top-down views has applications in virtual reality, interior design, real estate, and robotics. This task is challenging due to the lack of explicit 3D structure and the need for geometric consistency and photorealism. We propose Top2Pano, an end-to-end model for synthesizing realistic indoor panoramas from top-down views. Our method estimates volumetric occupancy to infer 3D structures, then uses volumetric rendering to generate coarse color and depth panoramas. These guide a diffusion-based refinement stage using ControlNet, enhancing realism and structural fidelity. Evaluations on two datasets show Top2Pano outperforms baselines, effectively reconstructing geometry, occlusions, and spatial arrangements. It also generalizes well, producing high-quality panoramas from schematic floorplans. Our results highlight Top2Pano’s potential in bridging top-down views with immersive indoor synthesis.
zh

[CV-76] Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation ICCV2025

【速读】:该论文旨在解决域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)中因未见环境下的域偏移(domain shift)导致模型性能显著下降的问题。现有方法通常通过将特征投影到源域来增强特征对齐,但忽略了潜在的域先验信息,从而限制了泛化能力。其解决方案的关键在于提出概率扩散对齐框架(Probabilistic Diffusion Alignment Framework, PDAF),引入潜域先验(Latent Domain Prior, LDP)以建模域偏移,并将其作为条件因子用于对齐源域与未见目标域。PDAF通过三个模块实现:潜域先验提取器(LPE)监督学习域偏移,域补偿模块(DCM)调整特征表示以缓解域偏移,以及扩散先验估计器(DPE)利用扩散过程在无需配对样本的情况下估计LDP,从而迭代建模域偏移并逐步优化特征表示,提升复杂目标场景下的泛化性能。

链接: https://arxiv.org/abs/2507.21367
作者: I-Hsiang Chen,Hua-En Chang,Wei-Ting Chen,Jenq-Neng Hwang,Sy-Yen Kuo
机构: National Taiwan University (台湾大学); University of Washington (华盛顿大学); Microsoft (微软); Chang Gung University (长庚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre-trained segmentation model and utilizes paired source and pseudo-target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban scenes.
zh

[CV-77] Evaluating Deep Learning Models for African Wildlife Image Classification: From DenseNet to Vision Transformers

【速读】:该论文旨在解决非洲野生动物种群数量急剧下降背景下,如何利用深度学习技术实现高效、准确的野生动物图像自动分类问题,从而支持生物多样性监测与保护工作。其解决方案的关键在于通过迁移学习(transfer learning)结合冻结特征提取器的方式,在公开的四种非洲野生动物图像数据集(水牛、大象、犀牛和斑马)上对比多种主流模型性能,发现卷积神经网络(CNN)中DenseNet-201在精度(67%)与资源消耗之间取得较好平衡,而视觉Transformer(ViT-H/14)虽达到99%最高准确率但计算成本过高;进一步将最优CNN模型部署至Hugging Face Gradio Space以实现实时野外应用,验证了轻量化模型在实际 conservation 场景中的可行性,为非洲本地化AI研究提供了可落地的技术路径与实践指导。

链接: https://arxiv.org/abs/2507.21364
作者: Lukman Jibril Aliyu,Umar Sani Muhammad,Bilqisu Ismail,Nasiru Muhammad,Almustapha A Wakili,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad,Mustapha Abdullahi
机构: Arewa Data Science Academy(阿雷瓦数据科学学院); Azman University(阿兹曼大学); Towson University(托兹大学); Universität Hamburg(汉堡大学); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a camera-ready paper at Deep Learning Indaba 2025 (Kigali, Rwanda)

点击查看摘要

Abstract:Wildlife populations in Africa face severe threats, with vertebrate numbers declining by over 65% in the past five decades. In response, image classification using deep learning has emerged as a promising tool for biodiversity monitoring and conservation. This paper presents a comparative study of deep learning models for automatically classifying African wildlife images, focusing on transfer learning with frozen feature extractors. Using a public dataset of four species: buffalo, elephant, rhinoceros, and zebra; we evaluate the performance of DenseNet-201, ResNet-152, EfficientNet-B4, and Vision Transformer ViT-H/14. DenseNet-201 achieved the best performance among convolutional networks (67% accuracy), while ViT-H/14 achieved the highest overall accuracy (99%), but with significantly higher computational cost, raising deployment concerns. Our experiments highlight the trade-offs between accuracy, resource requirements, and deployability. The best-performing CNN (DenseNet-201) was integrated into a Hugging Face Gradio Space for real-time field use, demonstrating the feasibility of deploying lightweight models in conservation settings. This work contributes to African-grounded AI research by offering practical insights into model selection, dataset preparation, and responsible deployment of deep learning tools for wildlife conservation.
zh

[CV-78] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

【速读】:该论文旨在解决视觉-based鸟瞰图(Bird’s-eye-view, BEV)3D目标检测方法在构建BEV表示时,因仅依赖提取的目标特征而忽略道路、人行道等环境结构信息(spatial occupancy),导致对物理世界感知不全面的问题。其解决方案的关键在于提出一种多任务学习框架——协同感知器(Collaborative Perceiver, CoP),通过引入空间占据预测作为辅助任务,挖掘3D目标检测与占据预测之间的结构和概念一致性,并设计了三个核心模块:1)结合局部密度信息(Local Density Information, LDO)生成密集占据真值以重建环境细节;2)采用体素高度引导采样(Voxel-Height-guided Sampling, VHS)策略根据物体属性提炼细粒度局部特征;3)构建全局-局部协同特征融合(Collaborative Feature Fusion, CFF)模块,实现两任务间互补知识的无缝整合,从而提升BEV表示的鲁棒性与感知能力。

链接: https://arxiv.org/abs/2507.21358
作者: Jicheng Yuan,Manh Nguyen Duc,Qian Liu,Manfred Hauswirth,Danh Le Phuoc
机构: Technische Universität Berlin (柏林工业大学); Fraunhofer FOKUS (弗劳恩霍夫协会信息通信技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link this https URL.
zh

[CV-79] Group Relative Augmentation for Data Efficient Action Detection

【速读】:该论文旨在解决在仅用少量样本的情况下,将大规模视频-语言模型(Video-Language Model, VLM)适配到动作检测任务时所面临的过拟合问题以及场景级预训练与个体中心理解之间的粒度不匹配问题。其解决方案的关键在于:首先,在冻结的VLM骨干网络中引入一种基于FiLM的可学习内部特征增强机制,通过生成与任务直接相关的多样化特征变体来提升模型对少样本场景的适应能力;其次,设计了一种分组加权损失函数(group-weighted loss),根据每个增强样本相对于组平均预测的分歧程度动态调整其训练贡献,从而优先利用信息量丰富且合理的增强样本进行鲁棒学习,显著提升了模型的数据效率和性能表现。

链接: https://arxiv.org/abs/2507.21353
作者: Deep Anil Patel,Iain Melvin,Zachary Izzo,Martin Renqiang Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges like overfitting and the granularity mismatch between scene-level pre-training and required person-centric understanding. We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation. Applied within the frozen VLM backbone using FiLM, these augmentations generate diverse feature variations directly relevant to the task. Additionally, we introduce a group-weighted loss function that dynamically modulates the training contribution of each augmented sample based on its prediction divergence relative to the group average. This promotes robust learning by prioritizing informative yet reasonable augmentations. We demonstrate our method’s effectiveness on complex multi-label, multi-person action detection datasets (AVA, MOMA), achieving strong mAP performance and showcasing significant data efficiency for adapting VLMs from limited examples.
zh

[CV-80] Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging

【速读】:该论文旨在解决磁共振成像(MRI)采集时间过长的问题,这一问题导致成本增加并降低患者舒适度。现有方法虽尝试利用先前个体特异性MRI扫描的先验信息以提升当前图像重建质量,但整合这些信息通常依赖于耗时的传统配准算法,限制了其临床实用性。本文提出了一种基于深度学习的新型MRI重建框架,其关键创新在于将初始重建网络、深度配准模型与基于Transformer的增强网络相结合,从而在保证图像质量的同时显著缩短重建时间。实验表明,该方法在多个加速因子下均优于现有技术,并且对下游脑部分割任务具有更好的精度和体积一致性,同时大幅减少总重建耗时,更适用于实时临床场景。

链接: https://arxiv.org/abs/2507.21349
作者: Amirmohammad Shamaei,Alexander Stebner,Salome(Lou)Bosshart,Johanna Ospel,Gouri Ginde,Mariana Bento,Roberto Souza
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach’s superiority over existing methods (p 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at this https URL.
zh

[CV-81] Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

【速读】:该论文试图解决的问题是:视觉语言模型(Vision Language Models, VLMs)在面对违反格赖斯合作原则(Grice’s maxims of conversation)的语境时,其理解与响应能力是否具备类人特性。解决方案的关键在于通过向人工构造的问题中添加修饰语(modifiers),模拟对格赖斯准则的违背(如量的准则、质的准则、关系准则和方式准则),并系统评估三种前沿VLMs(GPT-4o、Claude-3.5-Sonnet 和 Gemini-1.5-Flash)在VQA v2.0数据集上的响应变化。实验结果表明,随着修饰语的引入,模型性能持续下降,这揭示了当前VLMs在处理语用层面不规范或含糊表达时存在显著局限性,为深入理解其认知鲁棒性提供了新的评测范式。

链接: https://arxiv.org/abs/2507.21335
作者: Monika Shah,Sudarshan Balaji,Somdeb Sarkhel,Sanorita Dey,Deepak Venugopal
机构: University of Memphis(孟菲斯大学); Adobe Research(Adobe 研究院); University of Maryland Baltimore County(马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice’s maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice’s maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.
zh

[CV-82] GLCP: Global-to-Local Connectivity Preservation for Tubular Structure Segmentation MICCAI2025

【速读】:该论文旨在解决管状结构(如血管网络)分割中的结构碎片化问题,该问题会严重影响下游应用的性能。现有方法主要通过设计各类损失函数来约束全局拓扑结构,但往往忽视局部不连续区域,导致分割结果不理想。解决方案的关键在于提出一种全局到局部连接保持(Global-to-Local Connectivity Preservation, GLCP)框架,其核心创新包括:1)交互式多头分割(Interactive Multi-head Segmentation, IMS)模块,可联合学习全局分割结果、骨架图和局部断点图,从而显式定位并修复局部不连续区域,同时保持全局拓扑完整性;2)轻量级双注意力精炼(Dual-Attention-based Refinement, DAR)模块,进一步优化分割结果的质量。该方法在2D和3D数据集上均显著优于当前主流方法,在准确性和连续性方面表现更优。

链接: https://arxiv.org/abs/2507.21328
作者: Feixiang Zhou,Zhuangzhi Gao,He Zhao,Jianyang Xie,Yanda Meng,Yitian Zhao,Gregory Y.H. Lip,Yalin Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 (Oral)

点击查看摘要

Abstract:Accurate segmentation of tubular structures, such as vascular networks, plays a critical role in various medical domains. A remaining significant challenge in this task is structural fragmentation, which can adversely impact downstream applications. Existing methods primarily focus on designing various loss functions to constrain global topological structures. However, they often overlook local discontinuity regions, leading to suboptimal segmentation results. To overcome this limitation, we propose a novel Global-to-Local Connectivity Preservation (GLCP) framework that can simultaneously perceive global and local structural characteristics of tubular networks. Specifically, we propose an Interactive Multi-head Segmentation (IMS) module to jointly learn global segmentation, skeleton maps, and local discontinuity maps, respectively. This enables our model to explicitly target local discontinuity regions while maintaining global topological integrity. In addition, we design a lightweight Dual-Attention-based Refinement (DAR) module to further improve segmentation quality by refining the resulting segmentation maps. Extensive experiments on both 2D and 3D datasets demonstrate that our GLCP achieves superior accuracy and continuity in tubular structure segmentation compared to several state-of-the-art approaches. The source codes will be available at this https URL.
zh

[CV-83] VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction

【速读】:该论文旨在解决远程会议中3D虚拟会话的实时重建问题,即如何从单个2D网络摄像头视频流中生成高质量、实时且忠实于输入视频的3D高斯表示(3D Gaussian reconstruction),从而提升远程会议中的共在感(copresence)和沉浸感。现有方法通常依赖复杂硬件、固定外观注册或预训练生成模型,导致适用性受限。其解决方案的关键在于:通过独立条件化每一帧视频来实现“真实性”(authenticity)——即重建结果能忠实还原捕获视角下的输入视频,并自然泛化至新视角;同时引入稳定性损失(stability loss)确保时序视频序列的重建具有时间一致性。该方法仅需标准2D摄像头即可实现实时、真实且稳定的3D视频会议,显著提升了可访问性和实用性。

链接: https://arxiv.org/abs/2507.21311
作者: Martin de La Gorce,Charlie Hewitt,Tibor Takacs,Robert Gerdisch,Zafiirah Hosenie,Givi Meishvili,Marek Kowalski,Thomas J. Cashman,Antonio Criminisi
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications. We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences. We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.
zh

[CV-84] Fairness and Robustness of CLIP-Based Models for Chest X-rays MICCAI

【速读】:该论文旨在解决基于CLIP(Contrastive Language-Image Pre-training)的模型在胸部X光图像分类任务中公平性(fairness)与鲁棒性(robustness)不足的问题。尽管这些模型在准确性和判别性能上表现良好,但其在不同临床场景下对患者年龄、性别和种族等敏感属性的公平性以及对图像中潜在捷径学习(shortcut learning)的鲁棒性尚未充分探索。解决方案的关键在于:通过在三个公开数据集(MIMIC-CXR、NIH-CXR14 和 NEATX)上系统评估六种主流CLIP模型,量化其在六个临床子群体中的公平性差异,并进一步通过对比有无胸腔引流管(chest drain)的气胸病例来检验模型是否依赖于非诊断性的视觉线索(spurious correlations)。结果表明,模型在年龄维度存在显著性能差距,且所有模型在缺少胸腔引流管时性能下降,揭示了对捷径特征的依赖;同时,嵌入空间分析显示敏感属性可被分类,但主成分分析(PCA)未能有效可视化此类模式,凸显了现有可视化方法在评估模型公平性方面的局限性。

链接: https://arxiv.org/abs/2507.21291
作者: Théo Sourget,David Restrepo,Céline Hudelot,Enzo Ferrante,Stergios Christodoulidis,Maria Vakalopoulou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the FAIMI MICCAI workshop 2025

点击查看摘要

Abstract:Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at this https URL
zh

[CV-85] HDR Environment Map Estimation with Latent Diffusion Models

【速读】:该论文旨在解决从单视角图像中估计高动态范围(High Dynamic Range, HDR)环境贴图时存在的几何失真与接缝伪影问题,尤其是采用等距圆柱投影(Equirectangular Projection, ERP)表示时在极区的变形和边缘接缝 artefact。解决方案的关键在于提出两种创新方法:一是通过在潜在空间中的自动编码器中引入ERP卷积填充策略,有效消除ERP格式下的边界接缝伪影;二是设计一种全景适应的扩散变压器(PanoDiT)网络架构,以优化扩散模型对ERP结构的建模能力,从而减少ERP失真并提升环境贴图的质量与光照准确性。

链接: https://arxiv.org/abs/2507.21261
作者: Jack Hilliard,Adrian Hilton,Jean-Yves Guillemaut
机构: University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We advance the field of HDR environment map estimation from a single-view image by establishing a novel approach leveraging the Latent Diffusion Model (LDM) to produce high-quality environment maps that can plausibly light mirror-reflective surfaces. A common issue when using the ERP representation, the format used by the vast majority of approaches, is distortions at the poles and a seam at the sides of the environment map. We remove the border seam artefact by proposing an ERP convolutional padding in the latent autoencoder. Additionally, we investigate whether adapting the diffusion network architecture to the ERP format can improve the quality and accuracy of the estimated environment map by proposing a panoramically-adapted Diffusion Transformer architecture. Our proposed PanoDiT network reduces ERP distortions and artefacts, but at the cost of image quality and plausibility. We evaluate with standard benchmarks to demonstrate that our models estimate high-quality environment maps that perform competitively with state-of-the-art approaches in both image quality and lighting accuracy.
zh

[CV-86] racking Moose using Aerial Object Detection

【速读】:该论文旨在解决无人机(UAV)在野生动物监测中面临的挑战,即如何在计算资源受限的条件下实现对地面上微小目标(如动物个体)的高效检测。其核心问题在于:一方面,载人飞机成本高、风险大且易扰动野生动物;另一方面,自主无人机因算力有限难以部署复杂的深度学习模型进行小目标检测(Small Object Detection)。解决方案的关键在于引入**分块增强(patching augmentation)**策略,通过对训练数据进行不同尺度和方式的分块处理,系统评估三种架构差异显著的目标检测模型在多种配置下的性能表现。研究表明,即使使用轻量级、低复杂度模型,在合理设置的分块增强下也能达到93%以上的平均精度均值(mAP@IoU=0.5),从而验证了简单模型在有限计算资源下仍具备良好检测能力,为小型无人机平台部署生成式AI(Generative AI)驱动的小目标检测系统提供了可行路径。

链接: https://arxiv.org/abs/2507.21256
作者: Christopher Indris,Raiyan Rahman,Goetz Bramesfeld,Guanghui Wang
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Aerial wildlife tracking is critical for conservation efforts and relies on detecting small objects on the ground below the aircraft. It presents technical challenges: crewed aircraft are expensive, risky and disruptive; autonomous drones have limited computational capacity for onboard AI systems. Since the objects of interest may appear only a few pixels wide, small object detection is an inherently challenging computer vision subfield compounded by computational efficiency needs. This paper applies a patching augmentation to datasets to study model performance under various settings. A comparative study of three common yet architecturally diverse object detectors is conducted using the data, varying the patching method’s hyperparameters against detection accuracy. Each model achieved at least 93% mAP@IoU=0.5 on at least one patching configuration. Statistical analyses provide an in-depth commentary on the effects of various factors. Analysis also shows that faster, simpler models are about as effective as models that require more computational power for this task and perform well given limited patch scales, encouraging UAV deployment. Datasets and models will be made available via this https URL.
zh

[CV-87] Dual Guidance Semi-Supervised Action Detection

【速读】:该论文旨在解决在标注数据稀缺条件下,如何提升时空动作定位(spatio-temporal action localization)模型的预测性能问题。其核心挑战在于如何有效利用未标注数据来增强模型对动作边界和类别的一致性理解。解决方案的关键在于提出了一种双指导网络(dual guidance network),该网络通过联合优化帧级分类与边界框预测,实现跨帧和跨框的动作类别一致性约束,从而更可靠地筛选出高质量的伪边界框(pseudo-bounding boxes),显著提升模型在有限标注数据下的泛化能力。

链接: https://arxiv.org/abs/2507.21247
作者: Ankit Singh,Efstratios Gavves,Cees G. M. Snoek,Hilde Kuehne
机构: IIT Madras(印度理工学院马德拉斯分校); University of Amsterdam(阿姆斯特丹大学); University of Tuebingen(图宾根大学); MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) has shown tremendous potential to improve the predictive performance of deep learning models when annotations are hard to obtain. However, the application of SSL has so far been mainly studied in the context of image classification. In this work, we present a semi-supervised approach for spatial-temporal action localization. We introduce a dual guidance network to select better pseudo-bounding boxes. It combines a frame-level classification with a bounding-box prediction to enforce action class consistency across frames and boxes. Our evaluation across well-known spatial-temporal action localization datasets, namely UCF101-24 , J-HMDB-21 and AVA shows that the proposed module considerably enhances the model’s performance in limited labeled data settings. Our framework achieves superior results compared to extended image-based semi-supervised baselines.
zh

[CV-88] On Explaining Visual Captioning with Hybrid Markov Logic Networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在多模态任务(如图像描述生成)中缺乏可解释性的问题,即如何阐明模型如何整合视觉信息、语言信息与知识表示以生成有意义的描述。其解决方案的关键在于构建一种基于混合马尔可夫逻辑网络(Hybrid Markov Logic Networks, HMLNs)的新颖解释框架,该框架能够将符号规则与实值函数相结合,通过学习训练样本上的HMLN分布,并在条件于生成样本时推断分布变化,从而量化哪些训练实例可能为生成特定描述提供了更丰富的信息,实现对模型决策过程的可解释性分析。

链接: https://arxiv.org/abs/2507.21246
作者: Monika Shah,Somdeb Sarkhel,Deepak Venugopal
机构: University of Memphis (孟菲斯大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have made tremendous progress in multimodal tasks such as image captioning. However, explaining/interpreting how these models integrate visual information, language information and knowledge representation to generate meaningful captions remains a challenging problem. Standard metrics to measure performance typically rely on comparing generated captions with human-written ones that may not provide a user with a deep insights into this integration. In this work, we develop a novel explanation framework that is easily interpretable based on Hybrid Markov Logic Networks (HMLNs) - a language that can combine symbolic rules with real-valued functions - where we hypothesize how relevant examples from the training data could have influenced the generation of the observed caption. To do this, we learn a HMLN distribution over the training instances and infer the shift in distributions over these instances when we condition on the generated sample which allows us to quantify which examples may have been a source of richer information to generate the observed caption. Our experiments on captions generated for several state-of-the-art captioning models using Amazon Mechanical Turk illustrate the interpretability of our explanations, and allow us to compare these models along the dimension of explainability.
zh

[CV-89] Learning from Limited and Imperfect Data

【速读】:该论文旨在解决深度学习模型在真实世界数据分布下性能下降的问题,尤其是面对长尾分布(long-tailed distribution)和标注数据有限等挑战时,传统算法在多样性生成、泛化能力及跨域适应性方面的不足。解决方案的关键在于提出一系列实用的算法框架:第一,在长尾数据上训练生成式模型以缓解模式崩溃(mode-collapse),提升尾部类别的图像多样性;第二,通过归纳正则化(inductive regularization)策略增强尾部类别的泛化能力,无需显式生成样本即可实现与头部类别相当的性能;第三,针对标注稀缺场景设计优化相关指标的半监督学习算法;第四,实现仅需少量甚至无标签样本的高效领域自适应(efficient domain adaptation)。这些方法共同推动了深度神经网络在复杂、不完美现实数据中的鲁棒学习能力。

链接: https://arxiv.org/abs/2507.21205
作者: Harsh Rangwani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Thesis

点击查看摘要

Abstract:The distribution of data in the world (eg, internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used for learning from imperfect datasets with long-tailed imbalances and distribution shifts. To expand the use of deep models, it is essential to overcome the labor-intensive curation process by developing robust algorithms that can learn from diverse, real-world data distributions. Toward this goal, we develop practical algorithms for Deep Neural Networks which can learn from limited and imperfect data present in the real world. This thesis is divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the thesis focuses on Learning Generative Models from Long-Tail Data, where we mitigate the mode-collapse and enable diverse aesthetic image generations for tail (minority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as effectively as the head classes without requiring explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the Efficient Domain Adaptation of the model to various domains with very few to zero labeled samples.
zh

[CV-90] PanoGAN A Deep Generative Model for Panoramic Dental Radiographs

【速读】:该论文旨在解决牙科研究与教育中影像数据稀缺的问题,提出了一种基于生成对抗网络(Generative Adversarial Network, GAN)的方案来合成牙科全景X光片。其关键在于采用深度卷积GAN(Deep Convolutional GAN, DCGAN)架构,并结合Wasserstein损失函数与梯度惩罚(Wasserstein GAN with Gradient Penalty, WGANGP),在2322张不同质量的原始图像上进行训练,聚焦于牙槽结构区域并去除其他解剖结构;同时通过系统性预处理和数据清洗确保输入标准化,同时保留解剖变异,并对比四种模型配置(调整判别器迭代次数、特征深度及是否使用去噪先验)以优化生成图像的质量与真实性,最终由临床专家评估显示:未去噪训练的模型能更好还原如下颌管和骨小梁等细节,而去噪训练则提升整体图像清晰度,揭示了细节保真与图像质量之间的权衡关系。

链接: https://arxiv.org/abs/2507.21200
作者: Soren Pedersen,Sanyam Jain,Mikkel Chavez,Viktor Ladehoff,Bruna Neves de Freitas,Ruben Pauwels
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This paper presents the development of a generative adversarial network (GAN) for synthesizing dental panoramic radiographs. Although exploratory in nature, the study aims to address the scarcity of data in dental research and education. We trained a deep convolutional GAN (DCGAN) using a Wasserstein loss with gradient penalty (WGANGP) on a dataset of 2322 radiographs of varying quality. The focus was on the dentoalveolar regions, other anatomical structures were cropped out. Extensive preprocessing and data cleaning were performed to standardize the inputs while preserving anatomical variability. We explored four candidate models by varying critic iterations, feature depth, and the use of denoising prior to training. A clinical expert evaluated the generated radiographs based on anatomical visibility and realism, using a 5-point scale (1 very poor 5 excellent). Most images showed moderate anatomical depiction, although some were degraded by artifacts. A trade-off was observed the model trained on non-denoised data yielded finer details especially in structures like the mandibular canal and trabecular bone, while a model trained on denoised data offered superior overall image clarity and sharpness. These findings provide a foundation for future work on GAN-based methods in dental imaging.
zh

[CV-91] ChartM3: Benchmarking Chart Editing with Multimodal Instructions

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在图表编辑任务中因依赖自然语言指令而难以实现细粒度操作的问题,其核心挑战在于自然语言表达的模糊性无法有效支持精确的视觉元素修改。解决方案的关键在于提出一种多模态图表编辑范式(Multimodal chart editing),即用户意图通过自然语言与视觉指示符(visual indicators)共同表达,以明确标注需修改的图表元素;并构建了 Chart \textM^3 基准数据集,包含多层次复杂度的 1,000 个样本和多角度评估指标(涵盖视觉外观与代码正确性),从而系统性地推动模型对多模态输入的理解与执行能力。此外,研究进一步构建了包含 24,000 条样本的 Chart \textM^3 -Train 训练集,通过对多模态大语言模型(MLLMs)进行微调,显著提升了其在实际图表编辑场景中的性能表现。

链接: https://arxiv.org/abs/2507.21167
作者: Danglu Yang,Liang Zhang,Zihao Yue,Liangyu Chen,Yichen Xu,Wenxuan Wang,Qin Jin
机构: RUCChina
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart \textM^3 , a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart \textM^3 contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart \textM^3 provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart \textM^3 -Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at this https URL. %this https URL datasets, codes, and evaluation tools are available at this https URL.
zh

[CV-92] Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues

【速读】:该论文旨在解决复杂城市环境中行人意图预测(Pedestrian Intention Prediction)的挑战,传统方法依赖于基于帧序列的监督学习,需大量重训练才能适应新场景,难以实现灵活部署。其解决方案的关键在于提出一种零样本(zero-shot)框架BF-PIP,该框架基于Gemini 2.5 Pro模型,直接从包含结构化JAAD元数据的连续短时视频片段中推断行人过街意图,通过引入边界框标注和自车速度等多模态提示信息,实现对时间连续性与空间上下文的联合建模,从而在无需额外训练的情况下达到73%的预测准确率,显著优于基于GPT-4V的离散帧处理方法。

链接: https://arxiv.org/abs/2507.21161
作者: Pallavi Zambare,Venkata Nikhil Thanikella,Ying Liu
机构: Texas Tech University (德克萨斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IEEE 3rd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings 2025)

点击查看摘要

Abstract:Pedestrian intention prediction is essential for autonomous driving in complex urban environments. Conventional approaches depend on supervised learning over frame sequences and require extensive retraining to adapt to new scenarios. Here, we introduce BF-PIP (Beyond Frames Pedestrian Intention Prediction), a zero-shot approach built upon Gemini 2.5 Pro. It infers crossing intentions directly from short, continuous video clips enriched with structured JAAD metadata. In contrast to GPT-4V based methods that operate on discrete frames, BF-PIP processes uninterrupted temporal clips. It also incorporates bounding-box annotations and ego-vehicle speed via specialized multimodal prompts. Without any additional training, BF-PIP achieves 73% prediction accuracy, outperforming a GPT-4V baseline by 18 %. These findings illustrate that combining temporal video inputs with contextual cues enhances spatiotemporal perception and improves intent inference under ambiguous conditions. This approach paves the way for agile, retraining-free perception module in intelligent transportation system.
zh

[CV-93] Unmasking Synthetic Realities in Generative AI: A Comprehensive Review of Adversarially Robust Deepfake Detection Systems

【速读】:该论文旨在解决深度伪造(deepfake)检测方法在实际应用中面临的关键挑战,尤其是当前方法对对抗性扰动(adversarial perturbations)的脆弱性问题。其解决方案的核心在于系统性地梳理和评估两类主流检测范式:一是基于统计异常和分层特征提取的全合成媒体检测;二是利用多模态线索(如视觉伪影与时间不一致性)定位真实内容中的篡改区域。研究强调,尽管现有方法在受控环境中表现出良好精度与适应性,但缺乏对对抗鲁棒性的充分验证,从而导致其在真实世界对抗场景下的可靠性不足。为此,作者贡献了一个开源代码库(GitHub repository),支持方法复现与测试,并呼吁未来工作优先发展具备对抗韧性、可扩展且模态无关的架构,以构建更可信的深度伪造检测系统。

链接: https://arxiv.org/abs/2507.21157
作者: Naseem Khan,Tuan Nguyen,Amine Bermak,Issa Khalil
机构: Hamad bin Khalifa University (哈马德本哈利法大学); Qatar Computing Research Institute (卡塔尔计算研究研究所)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 4 Tables, 3 Figures

点击查看摘要

Abstract:The rapid advancement of Generative Artificial Intelligence has fueled deepfake proliferation-synthetic media encompassing fully generated content and subtly edited authentic material-posing challenges to digital security, misinformation mitigation, and identity preservation. This systematic review evaluates state-of-the-art deepfake detection methodologies, emphasizing reproducible implementations for transparency and validation. We delineate two core paradigms: (1) detection of fully synthetic media leveraging statistical anomalies and hierarchical feature extraction, and (2) localization of manipulated regions within authentic content employing multi-modal cues such as visual artifacts and temporal inconsistencies. These approaches, spanning uni-modal and multi-modal frameworks, demonstrate notable precision and adaptability in controlled settings, effectively identifying manipulations through advanced learning techniques and cross-modal fusion. However, comprehensive assessment reveals insufficient evaluation of adversarial robustness across both paradigms. Current methods exhibit vulnerability to adversarial perturbations-subtle alterations designed to evade detection-undermining reliability in real-world adversarial contexts. This gap highlights critical disconnect between methodological development and evolving threat landscapes. To address this, we contribute a curated GitHub repository aggregating open-source implementations, enabling replication and testing. Our findings emphasize urgent need for future work prioritizing adversarial resilience, advocating scalable, modality-agnostic architectures capable of withstanding sophisticated manipulations. This review synthesizes strengths and shortcomings of contemporary deepfake detection while charting paths toward robust trustworthy systems.
zh

[CV-94] Page image classification for content-specific data processing

【速读】:该论文旨在解决历史文献数字化项目中产生的大量页面图像难以进行人工分类与分析的问题,尤其是在内容高度异质(包括手写体、印刷体、图表、表格等多种元素)的情况下,传统方法效率低下。解决方案的关键在于开发并评估一个专为历史文档页面设计的图像分类系统,利用人工智能和机器学习技术实现自动化分类,从而支持针对不同内容类型(如文本需OCR处理、图形需图像分析)定制化下游处理流程。

链接: https://arxiv.org/abs/2507.21114
作者: Kateryna Lutsai,Pavel Straňák
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 65 pages, 57 figures, 20 tables

点击查看摘要

Abstract:Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)
zh

[CV-95] A Tactical Behaviour Recognition Framework Based on Causal Multimodal Reasoning : A Study on Covert Audio-Video Analysis Combining GAN Structure Enhancement and Phonetic Accent Modelling

【速读】:该论文旨在解决战术视频中因高噪声和弱结构导致的语义理解与威胁检测难题,特别是在复杂环境中难以准确识别威胁链的问题。解决方案的关键在于提出TACTIC-GRAPHS系统,其核心是融合谱图理论(spectral graph theory)与多模态图神经推理(multimodal graph neural reasoning),通过谱嵌入(spectral embedding)、时序因果边建模(temporal causal edge modeling)以及跨异构模态的判别路径推理(discriminative path inference)实现对视觉、听觉和动作线索的联合建模。该方法利用图注意力机制与拉普拉斯谱映射(Laplacian spectral mapping)完成跨模态加权和因果信号分析,从而在TACTIC-AVS和TACTIC-Voice数据集上实现了89.3%的时间对齐准确率和超过85%的完整威胁链识别率,同时保持节点延迟在±150毫秒以内,显著提升了结构可解释性与实际应用能力。

链接: https://arxiv.org/abs/2507.21100
作者: Wei Meng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper introduces a structurally innovative and mathematically rigorous framework for multimodal tactical reasoning, offering a significant advance in causal inference and graph-based threat recognition under noisy conditions

点击查看摘要

Abstract:This paper introduces TACTIC-GRAPHS, a system that combines spectral graph theory and multimodal graph neural reasoning for semantic understanding and threat detection in tactical video under high noise and weak structure. The framework incorporates spectral embedding, temporal causal edge modeling, and discriminative path inference across heterogeneous modalities. A semantic-aware keyframe extraction method fuses visual, acoustic, and action cues to construct temporal graphs. Using graph attention and Laplacian spectral mapping, the model performs cross-modal weighting and causal signal analysis. Experiments on TACTIC-AVS and TACTIC-Voice datasets show 89.3 percent accuracy in temporal alignment and over 85 percent recognition of complete threat chains, with node latency within plus-minus 150 milliseconds. The approach enhances structural interpretability and supports applications in surveillance, defense, and intelligent security systems.
zh

[CV-96] GAITEX: Human motion dataset from impaired gait and rehabilitation exercises of inertial and optical sensor data

【速读】:该论文旨在解决基于可穿戴惯性测量单元(Inertial Measurement Units, IMUs)的运动分析模型开发中因缺乏大规模、多样化数据集而导致的性能瓶颈问题。其关键解决方案是构建了一个多模态数据集,包含19名受试者在执行物理治疗练习和步态相关任务时的同步IMU与标记式运动捕捉(Marker-based Motion Capture, MoCap)数据,涵盖正确动作与临床相关变体、正常与异常步态模式;同时提供高精度的IMU方向估计、骨骼肌系统坐标系对齐的处理结果、个体化OpenSim模型、逆向运动学计算结果及可视化工具,从而为机器学习模型在自动运动评估、步态分析、活动时间分割和生物力学参数估计等任务中的训练与基准测试提供可靠支持。

链接: https://arxiv.org/abs/2507.21069
作者: Andreas Spilz,Heiko Oppel,Jochen Werner,Kathrin Stucke-Straub,Felix Capanni,Michael Munz
机构: Ulm University of Applied Sciences (乌尔姆应用科学大学); Biomechatronic Research Group (生物机械研究组); AI for Sensor Data Analytics Research Group (传感器数据智能分析研究组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Wearable inertial measurement units (IMUs) offer a cost-effective and scalable means to assess human movement quality in clinical and everyday settings. However, the development of robust sensor-based classification models for physiotherapeutic exercises and gait analysis requires large, diverse datasets, which are costly and time-consuming to collect. Here, we present a multimodal dataset of physiotherapeutic exercises - including correct and clinically relevant variants - and gait-related exercises - including both normal and impaired gait patterns - recorded from 19 participants using synchronized IMUs and marker-based motion capture (MoCap). The dataset includes raw data from nine IMUs and thirty-five optical markers capturing full-body kinematics. Each IMU is additionally equipped with four optical markers, enabling precise comparison between IMU-derived orientation estimates and reference values from the MoCap system. To support further analysis, we also provide processed IMU orientations aligned with common segment coordinate systems, subject-specific OpenSim models, inverse kinematics results, and tools for visualizing IMU orientations in the musculoskeletal context. Detailed annotations of movement execution quality and time-stamped segmentations support diverse analysis goals. This dataset supports the development and benchmarking of machine learning models for tasks such as automatic exercise evaluation, gait analysis, temporal activity segmentation, and biomechanical parameter estimation. To facilitate reproducibility, we provide code for postprocessing, sensor-to-segment alignment, inverse kinematics computation, and technical validation. This resource is intended to accelerate research in machine learning-driven human movement analysis.
zh

[CV-97] Hot-Swap MarkBoard: An Efficient Black-box Watermarking Approach for Large-scale Model Distribution

【速读】:该论文旨在解决在端侧人工智能(On-Device AI)场景下,大规模分发的深度学习(Deep Learning, DL)模型面临的知识产权(Intellectual Property, IP)保护难题。现有基于后门的水印技术多适用于云服务模式(AI-as-a-Service, AIaaS),难以适应每个用户实例需携带唯一水印的需求,且修改水印通常需要重新训练模型,效率低下。其解决方案的关键在于提出Hot-Swap MarkBoard方法:通过在多分支低秩适配(Low-Rank Adaptation, LoRA)模块中独立嵌入多个水印,实现无需重训即可通过分支切换快速定制用户专属水印;同时引入参数混淆机制,将水印权重与基础模型参数纠缠,防止水印被移除而不损害模型性能,从而支持黑盒验证并兼容多种模型架构和任务类型。

链接: https://arxiv.org/abs/2507.20650
作者: Zhicheng Zhang,Peizhuo Lv,Mengke Wan,Jiang Fang,Diandian Guo,Yezeng Chen,Yinlong Liu,Wei Ma,Jiyan Sun,Liru Geng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Deep Learning (DL) models have been increasingly deployed on end-user devices as On-Device AI, offering improved efficiency and privacy. However, this deployment trend poses more serious Intellectual Property (IP) risks, as models are distributed on numerous local devices, making them vulnerable to theft and redistribution. Most existing ownership protection solutions (e.g., backdoor-based watermarking) are designed for cloud-based AI-as-a-Service (AIaaS) and are not directly applicable to large-scale distribution scenarios, where each user-specific model instance must carry a unique watermark. These methods typically embed a fixed watermark, and modifying the embedded watermark requires retraining the model. To address these challenges, we propose Hot-Swap MarkBoard, an efficient watermarking method. It encodes user-specific n -bit binary signatures by independently embedding multiple watermarks into a multi-branch Low-Rank Adaptation (LoRA) module, enabling efficient watermark customization without retraining through branch swapping. A parameter obfuscation mechanism further entangles the watermark weights with those of the base model, preventing removal without degrading model performance. The method supports black-box verification and is compatible with various model architectures and DL tasks, including classification, image generation, and text generation. Extensive experiments across three types of tasks and six backbone models demonstrate our method’s superior efficiency and adaptability compared to existing approaches, achieving 100% verification accuracy.
zh

[CV-98] owards Universal Modal Tracking with Online Dense Temporal Token Learning

【速读】:该论文旨在解决多模态视频目标跟踪任务中模型泛化能力弱、训练成本高以及跨模态信息融合效率低的问题。其核心挑战在于如何设计一个统一架构,支持RGB、RGB+热成像、RGB+深度和RGB+事件等多种模态组合的视频跟踪任务,同时保持高效推理与高质量性能。解决方案的关键在于提出了一种基于视频级采样(video-level sampling)与在线密集时间标记关联(online dense temporal token association)的统一框架——\modaltracker,通过引入两个新颖的门控感知器(gated perceivers),利用门控注意力机制自适应学习跨模态表示,并以一次性训练方式压缩至共享参数空间,从而实现多任务推理下的模态可扩展性(modality scalability)。此方法不仅显著提升了模型对历史时序信息的利用效率(如将净化后的token序列作为未来帧的时序提示),还避免了传统多模态跟踪器需独立训练的冗余,有效降低了训练负担并增强了模型表征能力。

链接: https://arxiv.org/abs/2507.20177
作者: Yaozong Zheng,Bineng Zhong,Qihua Liang,Shengping Zhang,Guorong Li,Xianxian Li,Rongrong Ji
机构: Guangxi Normal University (广西师范大学); Harbin Institute of Technology (哈尔滨工业大学); University of Chinese Academy of Sciences (中国科学院大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: arXiv admin note: text overlap with arXiv:2401.01686

点击查看摘要

Abstract:We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called \modaltracker). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbfVideo-level Sampling. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbfVideo-level Association. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbfModality Scalable. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our \modaltracker achieves a new \textitSOTA performance. The code will be available at this https URL.
zh

[CV-99] Supervised Quantum Image Processing

【速读】:该论文旨在解决大数据与人工智能时代下,图像数据存储、处理和分析效率低下的问题。其核心挑战在于如何在有限资源约束下高效管理海量图像信息并提升计算性能。解决方案的关键在于利用量子图像表示(Quantum Image Representations, QImRs)技术,通过对比四种不同的QImR方法——张量网络表示(TNR)、灵活量子图像表示(FRQI)、增强型量子图像表示(NEQR)及量子概率图像编码(QPIE)——发现FRQI在图像信息压缩方面表现最优;同时,进一步验证基于QImRs的量子核方法在二分类任务中可实现与经典线性核相当的平均准确率,但所需存储资源呈指数级减少,从而显著提升资源利用效率。

链接: https://arxiv.org/abs/2507.22039
作者: Marco Parigi,Mehran Khosrojerdi,Filippo Caruso,Leonardo Banchi
机构: University of Florence (佛罗伦萨大学); LENS - European Laboratory for Non-Linear Spectroscopy (欧洲非线性光谱实验室); INFN Sezione di Firenze (意大利国家核物理研究院佛罗伦萨分部)
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:In the era of big data and artificial intelligence, the increasing volume of data and the demand to solve more and more complex computational challenges are two driving forces for improving the efficiency of data storage, processing and analysis. Quantum image processing (QIP) is an interdisciplinary field between quantum information science and image processing, which has the potential to alleviate some of these challenges by leveraging the power of quantum computing. In this work, we compare and examine the compression properties of four different Quantum Image Representations (QImRs): namely, Tensor Network Representation (TNR), Flexible Representation of Quantum Image (FRQI), Novel Enhanced Quantum Representation NEQR, and Quantum Probability Image Encoding (QPIE). Our simulations show that FRQI performs a higher compression of image information than TNR, NEQR, and QPIE. Furthermore, we investigate the trade-off between accuracy and memory in binary classification problems, evaluating the performance of quantum kernels based on QImRs compared to the classical linear kernel. Our results indicate that quantum kernels provide comparable classification average accuracy but require exponentially fewer resources for image storage.
zh

[CV-100] ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

【速读】:该论文旨在解决医学人工智能中一个关键问题:如何将放射科报告中的自由文本描述(如“左下肺叶3 mm结节”)与三维胸部CT图像中精确的像素级分割结果进行对齐,从而实现文本到空间位置的语义锚定。传统数据集依赖结构化标签或预定义类别,难以捕捉临床语言的复杂性和多样性,而ReXGroundingCT通过构建首个公开可用的自由文本-像素级标注数据集,填补了这一空白。其解决方案的关键在于采用系统化的三阶段流程:首先利用GPT-4自动提取阳性肺部和胸膜病变,再由专家标注者进行人工分割,最终形成包含8,028个病灶实体的高质量标注数据,其中约79%为局灶性异常,21%为非局灶性异常,并通过放射科医师进行质量控制,确保了标注的准确性与临床可靠性。

链接: https://arxiv.org/abs/2507.22030
作者: Mohammed Baharoon,Luyang Luo,Michael Moritz,Abhinav Kumar,Sung Eun Kim,Xiaoman Zhang,Miao Zhu,Mahmoud Hussain Alabbad,Maha Sbayel Alhazmi,Neel P. Mistry,Kent Ryan Kleinschmidt,Brady Chrisler,Sathvik Suryadevara,Sri Sai Dinesh Jaliparthi,Noah Michael Prudlo,Mark David Marino,Jeremy Palacio,Rithvik Akula,Hong-Yu Zhou,Ibrahim Ethem Hamamci,Scott J. Adams,Hassan Rayhan AlOmaish,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Saint Louis University School of Medicine (圣路易斯大学医学院); Icahn School of Medicine at Mount Sinai (伊坎医学院); Brigham and Women’s Hospital (布里格姆妇女医院); King Abdullah Specialized Children’s Hospital (阿卜杜拉国王专科儿童医院); King Abdulaziz Medical City (阿卜杜勒阿齐兹医疗中心); Royal University Hospital (皇家大学医院); University of Zurich (苏黎世大学); Seoul National University Hospital (首尔国立大学医院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ReXGroundingCT, the first publicly available dataset to link free-text radiology findings with pixel-level segmentations in 3D chest CT scans that is manually annotated. While prior datasets have relied on structured labels or predefined categories, ReXGroundingCT captures the full expressiveness of clinical language represented in free text and grounds it to spatially localized 3D segmentation annotations in volumetric imaging. This addresses a critical gap in medical AI: the ability to connect complex, descriptive text, such as “3 mm nodule in the left lower lobe”, to its precise anatomical location in three-dimensional space, a capability essential for grounded radiology report generation systems. The dataset comprises 3,142 non-contrast chest CT scans paired with standardized radiology reports from the CT-RATE dataset. Using a systematic three-stage pipeline, GPT-4 was used to extract positive lung and pleural findings, which were then manually segmented by expert annotators. A total of 8,028 findings across 16,301 entities were annotated, with quality control performed by board-certified radiologists. Approximately 79% of findings are focal abnormalities, while 21% are non-focal. The training set includes up to three representative segmentations per finding, while the validation and test sets contain exhaustive labels for each finding entity. ReXGroundingCT establishes a new benchmark for developing and evaluating sentence-level grounding and free-text medical segmentation models in chest CT. The dataset can be accessed at this https URL.
zh

[CV-101] Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images

【速读】:该论文旨在解决生成式 AI (Generative AI) 在复杂心血管诊断场景中应用不足的问题,尤其是如何利用大规模无标注3D心脏CT图像和多模态数据提升模型对心血管异常的识别与理解能力。解决方案的关键在于提出Cardiac-CLIP这一多模态基础模型,采用两阶段预训练策略:第一阶段使用3D掩码自编码器(3D masked autoencoder, MAE)进行自监督表示学习,从大量未标注体积数据中提取丰富的解剖结构和上下文特征;第二阶段引入对比学习机制,对齐视觉与文本表征,实现跨模态语义对齐。此外,研究构建了包含16,641例真实临床CT扫描及标准化放射学报告的高质量训练数据集,并通过病理向量生成软标签矩阵以指导对比学习,从而显著提升模型在心血管异常分类、信息检索和临床分析等下游任务中的性能,尤其在急性冠状动脉综合征的前瞻性预测方面表现出卓越效果。

链接: https://arxiv.org/abs/2507.22024
作者: Yutao Hu,Ying Zheng,Shumei Miao,Xiaolei Zhang,Jiahao Xia,Yaolei Qi,Yiyang Zhang,Yuting He,Qian Chen,Jing Ye,Hongyan Qiao,Xiuhua Hu,Lei Xu,Jiayin Zhang,Hui Liu,Minwen Zheng,Yining Wang,Daimin Zhang,Ji Zhang,Wenqi Shao,Yun Liu,Longjiang Zhang,Guanyu Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.
zh

[CV-102] Cyst-X: AI-Powered Pancreatic Cancer Risk Prediction from Multicenter MRI in Centralized and Federated Learning

【速读】:该论文旨在解决胰腺导管内乳头状黏液性肿瘤(Intraductal Papillary Mucinous Neoplasms, IPMNs)恶性风险评估的临床难题,当前指南和专家判读存在误诊率高、过度手术或漏诊等问题。解决方案的关键在于提出Cyst-X这一人工智能(AI)框架,利用多中心磁共振成像(MRI)数据训练模型,通过提取具有生物学意义的影像特征实现对IPMN恶性的精准预测(AUC=0.82),显著优于京都指南(AUC=0.75)和放射科专家水平,并在联邦学习设置下验证了其隐私保护下的协作建模能力,从而为胰腺囊肿风险分层提供可推广、可信赖的新范式。

链接: https://arxiv.org/abs/2507.22017
作者: Hongyi Pan,Gorkem Durak,Elif Keles,Deniz Seyithanoglu,Zheyuan Zhang,Alpay Medetalibeyoglu,Halil Ertugrul Aktas,Andrea Mia Bejar,Ziliang Hong,Yavuz Taktak,Gulbiz Dagoglu Kartal,Mehmet Sukru Erturk,Timurhan Cebeci,Maria Jaramillo Gonzalez,Yury Velichko,Lili Zhao,Emil Agarunov,Federica Proietto Salanitri,Concetto Spampinato,Pallavi Tiwari,Ziyue Xu,Sachin Jambawalikar,Ivo G. Schoots,Marco J. Bruno,Chenchang Huang,Candice Bolan,Tamas Gonda,Frank H. Miller,Rajesh N. Keswani,Michael B. Wallace,Ulas Bagci
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pancreatic cancer is projected to become the second-deadliest malignancy in Western countries by 2030, highlighting the urgent need for better early detection. Intraductal papillary mucinous neoplasms (IPMNs), key precursors to pancreatic cancer, are challenging to assess with current guidelines, often leading to unnecessary surgeries or missed malignancies. We present Cyst-X, an AI framework that predicts IPMN malignancy using multicenter MRI data, leveraging MRI’s superior soft tissue contrast over CT. Trained on 723 T1- and 738 T2-weighted scans from 764 patients across seven institutions, our models (AUC=0.82) significantly outperform both Kyoto guidelines (AUC=0.75) and expert radiologists. The AI-derived imaging features align with known clinical markers and offer biologically meaningful insights. We also demonstrate strong performance in a federated learning setting, enabling collaborative training without sharing patient data. To promote privacy-preserving AI development and improve IPMN risk stratification, the Cyst-X dataset is released as the first large-scale, multi-center pancreatic cysts MRI dataset.
zh

[CV-103] VidFuncta: Towards Generalizable Neural Representations for Ultrasound Videos MICCAI2025

【速读】:该论文旨在解决标准深度学习方法在处理超声视频时面临的挑战,尤其是由于采集过程非标准化和操作者偏倚导致的全视频分析困难。其解决方案的关键在于提出VidFuncta框架,该框架基于隐式神经表示(Implicit Neural Representations, INRs)思想,将可变长度的超声视频编码为紧凑且时间分辨的表示:通过分离出一个静态的视频特异性向量与一系列随时间变化的调制向量(modulation vectors),从而同时捕捉视频的时间动态性和数据集层面的冗余信息。此方法不仅在视频重建任务上优于2D和3D基线模型,还使下游任务(如射血分数预测、B线检测和乳腺病变分类)能够直接作用于学习到的一维调制向量,展现出良好的泛化能力和效率。

链接: https://arxiv.org/abs/2507.21863
作者: Julia Wolleb,Florentin Bieder,Paul Friedrich,Hemant D. Tagare,Xenophon Papademetris
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted 6th International Workshop of Advances in Simplifying Medical UltraSound (ASMUS) to be held at MICCAI 2025

点击查看摘要

Abstract:Ultrasound is widely used in clinical care, yet standard deep learning methods often struggle with full video analysis due to non-standardized acquisition and operator bias. We offer a new perspective on ultrasound video analysis through implicit neural representations (INRs). We build on Functa, an INR framework in which each image is represented by a modulation vector that conditions a shared neural network. However, its extension to the temporal domain of medical videos remains unexplored. To address this gap, we propose VidFuncta, a novel framework that leverages Functa to encode variable-length ultrasound videos into compact, time-resolved representations. VidFuncta disentangles each video into a static video-specific vector and a sequence of time-dependent modulation vectors, capturing both temporal dynamics and dataset-level redundancies. Our method outperforms 2D and 3D baselines on video reconstruction and enables downstream tasks to directly operate on the learned 1D modulation vectors. We validate VidFuncta on three public ultrasound video datasets – cardiac, lung, and breast – and evaluate its downstream performance on ejection fraction prediction, B-line detection, and breast lesion classification. These results highlight the potential of VidFuncta as a generalizable and efficient representation framework for ultrasound videos. Our code is publicly available under this https URL.
zh

[CV-104] ST-DAI: Single-shot 2.5D Spatial Transcriptomics with Intra-Sample Domain Adaptive Imputation for Cost-efficient 3D Reconstruction

【速读】:该论文针对3D空间转录组(3D spatial transcriptomics, ST)中因每张组织切片全采样成本高昂而面临的挑战,提出了一种单次采样框架ST-DAI。其核心问题在于:现有基于组织学图像预测基因表达的方法依赖大量外部数据集,导致高成本且存在显著域差异,从而在新样本上泛化能力差。解决方案的关键在于两个层面:一是采用一种成本高效的2.5D采样策略,仅对中心切片进行全采样,邻近切片稀疏采样以保留体积上下文信息;二是设计一种单次3D插补学习方法,利用仅样本特异性的训练数据从该2.5D采样方案中重建完整3D ST。具体实现中引入位置对齐、伪监督生成与快速多域优化(Fast Multi-Domain Refinement, FMDR)机制,并结合参数高效域对齐层(Parameter-Efficient Domain-Alignment Layers, PDLs)和置信度评分生成器(Confidence Score Generator, CSG),有效缓解切片间的位置错位与域差异问题,同时通过重加权伪标签引导可靠区域的插补,最终在显著降低实验成本的前提下实现接近全采样方法的基因表达预测性能。

链接: https://arxiv.org/abs/2507.21516
作者: Jiahe Qian,Yaoyu Fang,Xinkun Wang,Lee A. Cooper,Bo Zhou
机构: Northwestern University (西北大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures, 3 tables, under review

点击查看摘要

Abstract:For 3D spatial transcriptomics (ST), the high per-section acquisition cost of fully sampling every tissue section remains a significant challenge. Although recent approaches predict gene expression from histology images, these methods require large external datasets, which leads to high-cost and suffers from substantial domain discrepancies that lead to poor generalization on new samples. In this work, we introduce ST-DAI, a single-shot framework for 3D ST that couples a cost-efficient 2.5D sampling scheme with an intra-sample domain-adaptive imputation framework. First, in the cost-efficient 2.5D sampling stage, one reference section (central section) is fully sampled while other sections (adjacent sections) is sparsely sampled, thereby capturing volumetric context at significantly reduced experimental cost. Second, we propose a single-shot 3D imputation learning method that allows us to generate fully sampled 3D ST from this cost-efficient 2.5D ST scheme, using only sample-specific training. We observe position misalignment and domain discrepancy between sections. To address those issues, we adopt a pipeline that first aligns the central section to the adjacent section, thereafter generates dense pseudo-supervision on the central section, and then performs Fast Multi-Domain Refinement (FMDR), which adapts the network to the domain of the adjacent section while fine-tuning only a few parameters through the use of Parameter-Efficient Domain-Alignment Layers (PDLs). During this refinement, a Confidence Score Generator (CSG) reweights the pseudo-labels according to their estimated reliability, thereby directing imputation toward trustworthy regions. Our experimental results demonstrate that ST-DAI achieves gene expression prediction performance comparable to fully sampled approaches while substantially reducing the measurement burden.
zh

[CV-105] Querying GI Endoscopy Images: A VQA Approach

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学影像领域,特别是胃肠道(Gastrointestinal, GI)内镜图像的视觉问答(Visual Question Answering, VQA)任务中表现不佳的问题。解决方案的关键在于对Florence2模型进行适配与优化,使其能够有效理解并回答基于GI内镜图像的医学相关问题,并通过ROUGE、BLEU和METEOR等标准指标评估其性能,从而提升医疗诊断辅助系统的准确性与实用性。

链接: https://arxiv.org/abs/2507.21165
作者: Gaurav Parajuli
机构: Johannes Kepler University Linz (约翰内斯·开普勒林茨大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image. It has enormous potential for the development of medical diagnostic AI systems. Such a system can help clinicians diagnose gastro-intestinal (GI) diseases accurately and efficiently. Although many of the multimodal LLMs available today have excellent VQA capabilities in the general domain, they perform very poorly for VQA tasks in specialized domains such as medical imaging. This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images. We also evaluate the model performance using standard metrics like ROUGE, BLEU and METEOR
zh

[CV-106] Comparative Analysis of Vision Transformers and Convolutional Neural Networks for Medical Image Classification

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在医学影像任务中相对于传统卷积神经网络(Convolutional Neural Networks, CNNs)的有效性尚未充分探索的问题。其解决方案的关键在于通过系统性的对比实验,评估四种先进模型(ResNet-50、EfficientNet-B0、ViT-Base 和 DeiT-Small)在三种关键医学影像任务(胸部X光肺炎检测、脑肿瘤分类和皮肤癌黑色素瘤检测)中的性能表现,从而揭示不同架构在特定任务中的优势,为临床决策支持系统中模型选择提供实证依据。

链接: https://arxiv.org/abs/2507.21156
作者: Kunal Kawadkar
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 8 figures, 3 tables. Submitted to IEEE Access

点击查看摘要

Abstract:The emergence of Vision Transformers (ViTs) has revolutionized computer vision, yet their effectiveness compared to traditional Convolutional Neural Networks (CNNs) in medical imaging remains under-explored. This study presents a comprehensive comparative analysis of CNN and ViT architectures across three critical medical imaging tasks: chest X-ray pneumonia detection, brain tumor classification, and skin cancer melanoma detection. We evaluated four state-of-the-art models - ResNet-50, EfficientNet-B0, ViT-Base, and DeiT-Small - across datasets totaling 8,469 medical images. Our results demonstrate task-specific model advantages: ResNet-50 achieved 98.37% accuracy on chest X-ray classification, DeiT-Small excelled at brain tumor detection with 92.16% accuracy, and EfficientNet-B0 led skin cancer classification at 81.84% accuracy. These findings provide crucial insights for practitioners selecting architectures for medical AI applications, highlighting the importance of task-specific architecture selection in clinical decision support systems.
zh

人工智能

[AI-0] Foundation Models for Demand Forecasting via Dual-Strategy Ensembling

【速读】:该论文旨在解决现实供应链中销售预测的准确性难题,其核心挑战包括层级结构复杂性、领域分布偏移以及外部因素的动态变化。针对这些问题,论文提出了一种统一的集成框架,其关键在于结合两种互补策略:一是层级集成(Hierarchical Ensemble, HE),通过按语义层级(如门店、品类、部门)划分训练与推理过程以捕捉局部模式;二是架构集成(Architectural Ensemble, AE),融合不同模型骨干网络的预测结果以降低偏差并提升稳定性。该方法在M5基准和三个外部销售数据集上验证有效,显著提升了多层级预测精度与泛化能力。

链接: https://arxiv.org/abs/2507.22053
作者: Wei Yang,Defu Cao,Yan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate demand forecasting is critical for supply chain optimization, yet remains difficult in practice due to hierarchical complexity, domain shifts, and evolving external factors. While recent foundation models offer strong potential for time series forecasting, they often suffer from architectural rigidity and limited robustness under distributional change. In this paper, we propose a unified ensemble framework that enhances the performance of foundation models for sales forecasting in real-world supply chains. Our method combines two complementary strategies: (1) Hierarchical Ensemble (HE), which partitions training and inference by semantic levels (e.g., store, category, department) to capture localized patterns; and (2) Architectural Ensemble (AE), which integrates predictions from diverse model backbones to mitigate bias and improve stability. We conduct extensive experiments on the M5 benchmark and three external sales datasets, covering both in-domain and zero-shot forecasting. Results show that our approach consistently outperforms strong baselines, improves accuracy across hierarchical levels, and provides a simple yet effective mechanism for boosting generalization in complex forecasting environments.
zh

[AI-1] he Interspeech 2025 Speech Accessibility Project Challenge INTERSPEECH

【速读】:该论文旨在解决当前自动语音识别(Automatic Speech Recognition, ASR)系统在识别有言语障碍个体语音时性能不足的问题,其根本原因在于缺乏公开的训练数据。解决方案的关键在于发起2025年Interspeech Speech Accessibility Project (SAP)挑战赛,利用收集自500多位具有多样化言语障碍个体的超过400小时标注语音数据,构建了一个高质量、多样化的公共数据集,并通过EvalAI平台的远程评估流程对参赛模型进行Word Error Rate (WER)和语义得分(Semantic Score)双重指标评测,从而推动ASR系统在残障人群中的性能提升,为未来相关技术发展树立新的基准。

链接: https://arxiv.org/abs/2507.22047
作者: Xiuwen Zheng,Bornali Phukon,Jonghwan Na,Ed Cutrell,Kyu Han,Mark Hasegawa-Johnson,Pan-Pan Jiang,Aadhrik Kuila,Colin Lea,Bob MacDonald,Gautam Mantena,Venkatesh Ravichandran,Leda Sari,Katrin Tomanek,Chang D. Yoo,Chris Zwilling
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of Interspeech, 2025

点击查看摘要

Abstract:While the last decade has witnessed significant advancements in Automatic Speech Recognition (ASR) systems, performance of these systems for individuals with speech disabilities remains inadequate, partly due to limited public training data. To bridge this gap, the 2025 Interspeech Speech Accessibility Project (SAP) Challenge was launched, utilizing over 400 hours of SAP data collected and transcribed from more than 500 individuals with diverse speech disabilities. Hosted on EvalAI and leveraging the remote evaluation pipeline, the SAP Challenge evaluates submissions based on Word Error Rate and Semantic Score. Consequently, 12 out of 22 valid teams outperformed the whisper-large-v2 baseline in terms of WER, while 17 teams surpassed the baseline on SemScore. Notably, the top team achieved the lowest WER of 8.11%, and the highest SemScore of 88.44% at the same time, setting new benchmarks for future ASR systems in recognizing impaired speech.
zh

[AI-2] Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全防护方面的关键挑战,特别是针对“不安全图像-查询对”(unsafe image-query pairs)——即专门设计用于绕过安全约束、诱导模型产生不当响应的输入样本。由于此类攻击样本稀疏,现有防御方法难以获得足够多样化的训练数据,且传统基于外部模块的护栏机制(guardrail-type methods)无法根除模型内在漏洞,而监督微调(Supervised Fine-Tuning, SFT)则易导致对无害输入的过度拒绝,损害模型通用性能。论文提出的解决方案是Secure Tug-of-War(SecTOW),其核心在于构建一个由防御者和辅助攻击者组成的迭代对抗训练框架,二者均通过广义近端策略优化(Generalized Reward Policy Optimization, GRPO)进行强化学习训练:攻击者识别防御模型漏洞并扩展 jailbreak 数据,防御者利用这些数据持续增强安全性;同时设计简化奖励机制以减少对复杂生成标签的依赖,并引入质量监控机制防止防御者过度拒绝无害输入,从而在提升安全性的前提下维持模型整体性能。

链接: https://arxiv.org/abs/2507.22037
作者: Muzhi Dai,Shixuan Liu,Zhiyuan Zhao,Junyu Gao,Hao Sun,Xuelong Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs–jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO). During the iterative process, the attacker identifies security vulnerabilities in the defense model and expands jailbreak data. The expanded data are then used to train the defender, enabling it to address identified security vulnerabilities. We also design reward mechanisms used for GRPO to simplify the use of response labels, reducing dependence on complex generative labels and enabling the efficient use of synthetic data. Additionally, a quality monitoring mechanism is used to mitigate the defender’s over-refusal of harmless inputs and ensure the diversity of the jailbreak data generated by the attacker. Experimental results on safety-specific and general benchmarks demonstrate that SecTOW significantly improves security while preserving general performance.
zh

[AI-3] PHAX: A Structured Argumentation Framework for User-Centered Explainable AI in Public Health and Biomedical Sciences

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)在公共健康和生物医学科学领域中面临的局限性问题,即现有方法虽能提供特征重要性或模型内部机制的解释,但缺乏结构化、适应不同利益相关者需求的能力,难以满足临床医生、政策制定者及公众等多元用户群体对透明度与社会问责性的要求。解决方案的关键在于提出PHAX(Public Health Argumentation and eXplainability)框架,其核心是通过融合默认推理(defeasible reasoning)、自适应自然语言处理技术和用户建模,构建一个多层架构,生成情境感知且面向特定受众的人类中心型解释。该框架将解释过程形式化为论证链(argument chains),从而支持决策合理性说明、推荐理由辩护以及跨用户类型的交互式对话,显著提升了AI输出在医疗术语简化、医患沟通和政策论证等场景中的可理解性和可信度。

链接: https://arxiv.org/abs/2507.22009
作者: Bahar İlgen,Akshat Dubey,Georges Hattab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Under review

点击查看摘要

Abstract:Ensuring transparency and trust in AI-driven public health and biomedical sciences systems requires more than accurate predictions-it demands explanations that are clear, contextual, and socially accountable. While explainable AI (XAI) has advanced in areas like feature attribution and model interpretability, most methods still lack the structure and adaptability needed for diverse health stakeholders, including clinicians, policymakers, and the general public. We introduce PHAX-a Public Health Argumentation and eXplainability framework-that leverages structured argumentation to generate human-centered explanations for AI outputs. PHAX is a multi-layer architecture combining defeasible reasoning, adaptive natural language techniques, and user modeling to produce context-aware, audience-specific justifications. More specifically, we show how argumentation enhances explainability by supporting AI-driven decision-making, justifying recommendations, and enabling interactive dialogues across user types. We demonstrate the applicability of PHAX through use cases such as medical term simplification, patient-clinician communication, and policy justification. In particular, we show how simplification decisions can be modeled as argument chains and personalized based on user expertise-enhancing both interpretability and trust. By aligning formal reasoning methods with communicative demands, PHAX contributes to a broader vision of transparent, human-centered AI in public health.
zh

[AI-4] ach Me to Trick: Exploring Adversarial Transferability via Knowledge Distillation

【速读】:该论文旨在解决如何通过知识蒸馏(Knowledge Distillation, KD)提升黑盒对抗样本的迁移性与生成效率的问题。其核心解决方案在于利用多个异构教师模型(ResNet50 和 DenseNet-161)对轻量级学生模型进行知识蒸馏,采用课程式切换和联合优化两种策略训练学生模型,并在生成阶段使用FG、FGS和PGD攻击方法评估其性能。实验表明,该方法在保持对抗样本迁移成功率接近集成基线的同时,将生成时间缩短至原来的六分之一,且低温度设置与硬标签监督显著增强迁移能力,验证了KD不仅适用于模型压缩,还可作为提升黑盒对抗攻击效率与效果的有效工具。

链接: https://arxiv.org/abs/2507.21992
作者: Siddhartha Pradhan,Shikshya Shiwakoti,Neha Bathuri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:We investigate whether knowledge distillation (KD) from multiple heterogeneous teacher models can enhance the generation of transferable adversarial examples. A lightweight student model is trained using two KD strategies: curriculum-based switching and joint optimization, with ResNet50 and DenseNet-161 as teachers. The trained student is then used to generate adversarial examples using FG, FGS, and PGD attacks, which are evaluated against a black-box target model (GoogLeNet). Our results show that student models distilled from multiple teachers achieve attack success rates comparable to ensemble-based baselines, while reducing adversarial example generation time by up to a factor of six. An ablation study further reveals that lower temperature settings and the inclusion of hard-label supervision significantly enhance transferability. These findings suggest that KD can serve not only as a model compression technique but also as a powerful tool for improving the efficiency and effectiveness of black-box adversarial attacks.
zh

[AI-5] ChemDFM-R: An Chemical Reason er LLM Enhanced with Atomized Chemical Knowledge

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在化学等科学领域应用中因领域理解浅层化和推理能力有限而导致性能受限的问题。其解决方案的关键在于:首先构建一个原子化知识点的综合性数据集,以增强模型对化学基本原理与逻辑结构的理解;其次提出一种混合来源的知识蒸馏策略,融合专家标注的知识与通用领域的推理能力;最后通过领域特定的强化学习进一步提升化学推理能力。这一系列方法使模型在多个化学基准测试中达到最先进性能,并输出可解释、基于推理链的响应,显著提升了人机协作场景下的可靠性与实用性。

链接: https://arxiv.org/abs/2507.21990
作者: Zihan Zhao,Bo Chen,Ziping Wan,Lu Chen,Xuanze Lin,Shiyang Yu,Situo Zhang,Da Ma,Zichen Zhu,Danyang Zhang,Huayang Wang,Zhongyang Dai,Liyang Wen,Xin Chen,Kai Yu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 13 figures, 4 tables

点击查看摘要

Abstract:While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model’s understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves state-of-the-art performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.
zh

[AI-6] he Effect of Compression Techniques on Large Multimodal Language Models in the Medical Domain

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗领域应用时面临的高计算成本问题,特别是显存占用过高的限制。其核心解决方案在于提出一种新颖的层选择策略用于结构化剪枝(structural pruning),并结合激活感知量化(activation-aware quantization)技术,在“剪枝-微调-量化”(prune-SFT-quantize)的压缩流程中实现高效模型压缩。关键创新点在于通过精细化的层选择机制与量化策略协同优化,在保持模型性能的同时将7B参数模型的显存占用降低至4 GB(减少70%),且相比传统压缩方法在相同压缩比下提升4%的性能表现。

链接: https://arxiv.org/abs/2507.21976
作者: Tanvir Ahmed Khan,Aranya Saha,Ismam Nur Swapnil,Mohammad Ariful Haque
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures. tcolorbox dependencies were removed for arXiv compatibility. All references are included via a precompiled .bbl file

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) hold huge potential for usage in the medical domain, but their computational costs necessitate efficient compression techniques. This paper evaluates the impact of structural pruning and activation-aware quantization on a fine-tuned LLAVA model for medical applications. We propose a novel layer selection method for pruning, analyze different quantization techniques, and assess the performance trade-offs in a prune-SFT-quantize pipeline. Our proposed method enables MLLMs with 7B parameters to run within 4 GB of VRAM, reducing memory usage by 70% while achieving 4% higher model performance compared to traditional pruning and quantization techniques in the same compression ratio.
zh

[AI-7] Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks

【速读】:该论文旨在解决移动网络中根因分析(Root Cause Analysis, RCA)的挑战,尤其是现有方法在可解释性、领域专业知识和因果推理方面的不足。其核心解决方案是提出一个轻量级框架,利用大语言模型(Large Language Models, LLMs)进行RCA,并引入TeleLogs这一标注的故障排查数据集用于基准测试。关键创新在于采用两阶段训练策略:首先通过监督微调(Supervised Fine-Tuning)融合领域知识,再结合强化学习(Reinforcement Learning)提升推理质量,从而生成结构化、多步骤的诊断解释,显著增强模型的准确性与可解释性,且在不同规模LLM上均表现出优于当前最优推理与非推理模型的性能,展现出在实际网络运维中应用的潜力。

链接: https://arxiv.org/abs/2507.21974
作者: Mohamed Sana,Nicola Piovesan,Antonio De Domenico,Yibin Kang,Haozhe Zhang,Merouane Debbah,Fadhel Ayed
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning. In this work, we propose a lightweight framework that leverages Large Language Models (LLMs) for RCA. To do so, we introduce TeleLogs, a curated dataset of annotated troubleshooting problems designed to benchmark RCA capabilities. Our evaluation reveals that existing open-source reasoning LLMs struggle with these problems, underscoring the need for domain-specific adaptation. To address this issue, we propose a two-stage training methodology that combines supervised fine-tuning with reinforcement learning to improve the accuracy and reasoning quality of LLMs. The proposed approach fine-tunes a series of RCA models to integrate domain knowledge and generate structured, multi-step diagnostic explanations, improving both interpretability and effectiveness. Extensive experiments across multiple LLM sizes show significant performance gains over state-of-the-art reasoning and non-reasoning models, including strong generalization to randomized test variants. These results demonstrate the promise of domain-adapted, reasoning-enhanced LLMs for practical and explainable RCA in network operation and management.
zh

[AI-8] hou Shalt Not Prompt: Zero-Shot Human Activity Recognition in Smart Homes via Language Modeling of Sensor Data Activities

【速读】:该论文旨在解决零样本人体活动识别(Zero-shot Human Activity Recognition, Zero-shot HAR)在跨智能家庭场景中面临的挑战,特别是现有基于提示语言模型(Prompt-the-LLM)方法所存在的隐私泄露风险、对外部服务的依赖以及因模型版本更新导致预测不一致等问题。其解决方案的关键在于:将传感器数据与活动类别统一建模为自然语言表示,并利用这些语言嵌入(language embeddings)进行零样本分类,从而避免直接向大语言模型(Large Language Model, LLM)输入提示(prompting),实现更稳定、隐私友好的零样本活动识别。

链接: https://arxiv.org/abs/2507.21964
作者: Sourish Gunesh Dhekane,Thomas Ploetz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing zero-shot human activity recognition (HAR) methods is a critical direction in smart home research – considering its impact on making HAR systems work across smart homes having diverse sensing modalities, layouts, and activities of interest. The state-of-the-art solutions along this direction are based on generating natural language descriptions of the sensor data and feeding it via a carefully crafted prompt to the LLM to perform classification. Despite their performance guarantees, such ``prompt-the-LLM’’ approaches carry several risks, including privacy invasion, reliance on an external service, and inconsistent predictions due to version changes, making a case for alternative zero-shot HAR methods that do not require prompting the LLMs. In this paper, we propose one such solution that models sensor data and activities using natural language, leveraging its embeddings to perform zero-shot classification and thereby bypassing the need to prompt the LLMs for activity predictions. The impact of our work lies in presenting a detailed case study on six datasets, highlighting how language modeling can bolster HAR systems in zero-shot recognition.
zh

[AI-9] Fine-Tuning Code Language Models to Detect Cross-Language Bugs

【速读】:该论文旨在解决多语言编程(Multilingual Programming)中跨语言缺陷(Cross-Language Bugs, CLBs)的检测难题,这类缺陷源于不同编程语言之间的交互行为,传统单语言缺陷检测工具难以有效识别。其解决方案的关键在于利用预训练代码语言模型(Code Language Models, CodeLMs)进行微调,并构建了一个涵盖三种语言组合(Python-C/C++、Java-C/C++、Python-Java)和九类交互类型的CLB数据集。实验表明,微调后的CodeLMs在CLB检测任务上显著优于未微调版本,其中UniXcoder-base模型表现最佳(F1=0.7407),且小规模模型往往优于大规模模型,进一步验证了专用CLB数据对模型性能提升的重要性。

链接: https://arxiv.org/abs/2507.21954
作者: Zengyang Li,Yimeng Li,Binbin Huang,Peng Liang,Ran Mo,Hui Liu,Yutao Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 33 pages, 6 images, 9 tables, Manuscript submitted to a journal (2025)

点击查看摘要

Abstract:Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all CodeLMs performed poorly before fine-tuning, but exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, small fine-tuned CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs’ performance was improved, while others showed degraded performance.
zh

[AI-10] MapAgent : Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自主代理在执行移动设备图形用户界面(Graphical User Interface, GUI)任务时,因缺乏对真实应用场景知识而引发的任务规划失效与幻觉问题。解决方案的关键在于提出一种名为MapAgent的新颖LLM代理框架,其核心创新是构建基于历史轨迹的结构化页面记忆数据库(page-memory database),并通过粗粒度到细粒度的任务规划机制,将相似历史页面检索并注入LLM规划器,从而增强对现实应用情境的理解,实现更精准、上下文感知的任务规划;同时,借助双LLM架构的任务执行器确保任务执行过程的有效追踪与控制。

链接: https://arxiv.org/abs/2507.21953
作者: Yi Kong,Dianxi Shi,Guoli Yang,Zhang ke-di,Chenlin Huang,Xiaopeng Li,Songchang Jin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent advancement of autonomous agents powered by Large Language Models (LLMs) has demonstrated significant potential for automating tasks on mobile devices through graphical user interfaces (GUIs). Despite initial progress, these agents still face challenges when handling complex real-world tasks. These challenges arise from a lack of knowledge about real-life mobile applications in LLM-based agents, which may lead to ineffective task planning and even cause hallucinations. To address these challenges, we propose a novel LLM-based agent framework called MapAgent that leverages memory constructed from historical trajectories to augment current task planning. Specifically, we first propose a trajectory-based memory mechanism that transforms task execution trajectories into a reusable and structured page-memory database. Each page within a trajectory is extracted as a compact yet comprehensive snapshot, capturing both its UI layout and functional context. Secondly, we introduce a coarse-to-fine task planning approach that retrieves relevant pages from the memory database based on similarity and injects them into the LLM planner to compensate for potential deficiencies in understanding real-world app scenarios, thereby achieving more informed and context-aware task planning. Finally, planned tasks are transformed into executable actions through a task executor supported by a dual-LLM architecture, ensuring effective tracking of task progress. Experimental results in real-world scenarios demonstrate that MapAgent achieves superior performance to existing methods. The code will be open-sourced to support further research.
zh

[AI-11] Libra: Large Chinese-based Safeguard for AI Content

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中因安全与伦理问题带来的潜在危害,尤其是针对中文语境下的LLM安全治理难题。其解决方案的关键在于提出Libra-Guard系统,该系统采用两阶段课程训练(two-stage curriculum training)策略:首先在合成样本上进行防护预训练以提升数据效率,随后在高质量真实数据上微调,从而显著降低对人工标注的依赖;同时配套开发了首个面向中文内容的安全评估基准Libra-Test,涵盖七类关键危害场景并包含专家标注的5700余条样本,为系统性能提供严谨验证。实验表明,Libra-Guard在安全性任务上达到86.79%准确率,优于多个开源模型,并接近闭源先进模型水平。

链接: https://arxiv.org/abs/2507.21929
作者: Ziyang Chen,Huimu Yu,Xing Wu,Dongqin Liu,Songlin Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in text understanding and generation but raise significant safety and ethical concerns in high-stakes applications. To mitigate these risks, we present Libra-Guard, a cutting-edge safeguard system designed to enhance the safety of Chinese-based LLMs. Leveraging a two-stage curriculum training pipeline, Libra-Guard enhances data efficiency by employing guard pretraining on synthetic samples, followed by fine-tuning on high-quality, real-world data, thereby significantly reducing reliance on manual annotations. To enable rigorous safety evaluations, we also introduce Libra-Test, the first benchmark specifically designed to evaluate the effectiveness of safeguard systems for Chinese content. It covers seven critical harm scenarios and includes over 5,700 samples annotated by domain experts. Experiments show that Libra-Guard achieves 86.79% accuracy, outperforming Qwen2.5-14B-Instruct (74.33%) and ShieldLM-Qwen-14B-Chat (65.69%), and nearing closed-source models like Claude-3.5-Sonnet and GPT-4o. These contributions establish a robust framework for advancing the safety governance of Chinese LLMs and represent a tentative step toward developing safer, more reliable Chinese AI systems.
zh

[AI-12] Vibe Coding as a Reconfiguration of Intent Mediation in Software Development: Definition Implications and Research Agenda

【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 在软件开发中日益普及,开发者对这一新兴范式的认知滞后于其快速应用,导致实践中存在概念理解不足与潜在风险。为应对这一问题,作者提出“ vibe coding”作为新的软件开发范式,其关键在于将开发者意图的中介机制从传统的确定性指令(deterministic instruction)转变为基于概率推理(probabilistic inference)的自然语言对话协作模式,从而实现人与生成式 AI 在共创软件制品过程中的协同流动(collaborative flow)。这种转变重构了认知工作分配,使知识劳动在人类与机器之间重新配置,并推动软件开发的专业重心由传统设计与技术实现向协同编排(collaborative orchestration)迁移。

链接: https://arxiv.org/abs/2507.21928
作者: Christian Meske,Tobias Hermanns,Esther von der Weiden,Kai-Uwe Loser,Thorsten Berger
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Software development is undergoing a fundamental transformation as vibe coding becomes widespread, with large portions of contemporary codebases now being AI-generated. The disconnect between rapid adoption and limited conceptual understanding highlights the need for an inquiry into this emerging paradigm. Drawing on an intent perspective and historical analysis, we define vibe coding as a software development paradigm where humans and generative AI engage in collaborative flow to co-create software artifacts through natural language dialogue, shifting the mediation of developer intent from deterministic instruction to probabilistic inference. By intent mediation, we refer to the fundamental process through which developers translate their conceptual goals into representations that computational systems can execute. Our results show that vibe coding reconfigures cognitive work by redistributing epistemic labor between humans and machines, shifting the expertise in the software development process away from traditional areas such as design or technical implementation toward collaborative orchestration. We identify key opportunities, including democratization, acceleration, and systemic leverage, alongside risks, such as black box codebases, responsibility gaps, and ecosystem bias. We conclude with a research agenda spanning human-, technology-, and organization-centered directions to guide future investigations of this paradigm.
zh

[AI-13] LLM -based Content Classification Approach for GitHub Repositories by the README Files

【速读】:该论文试图解决的问题是:GitHub仓库的README文件内容不完整或结构不规范,导致其潜在使用价值和社区影响力受限。解决方案的关键在于利用大型语言模型(LLMs)对README文件中的不同段落进行自动分类,从而提升对仓库内容的理解与识别效率。研究采用三种编码器-only 的预训练模型(BERT、DistilBERT 和 RoBERTa),基于包含4226个标注段落的黄金标准数据集进行微调,实现了高达0.98的F1分数,显著优于现有方法;同时引入参数高效微调(PEFT)技术如低秩适应(LoRA),在保持高性能的同时大幅降低计算成本,为自动化工具开发提供了可行路径。

链接: https://arxiv.org/abs/2507.21899
作者: Malik Uzair Mehmood,Shahid Hussain,Wen Li Wang,Muhammad Usama Malik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 8 pages, 4 Figures

点击查看摘要

Abstract:GitHub is the world’s most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository’s README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files. Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa. These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections. This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98. Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance. The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files. Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages.
zh

[AI-14] Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

【速读】:该论文旨在解决疼痛评估中缺乏准确、一致且可连续监测手段的问题,以支持临床决策并改善疼痛管理策略。其解决方案的关键在于提出了一种基于呼吸信号(respiration signal)的自动疼痛评估流程,结合了高效的交叉注意力Transformer架构与多窗口(multi-windowing)策略,从而在保持模型紧凑性的同时有效捕捉短时、长时及全局特征,显著提升了模型的表征能力与性能表现。

链接: https://arxiv.org/abs/2507.21886
作者: Stefanos Gkikas,Ioannis Kyprakis,Manolis Tsiknakis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textitSecond Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model’s representational capacity.
zh

[AI-15] he Impact of Foundational Models on Patient-Centric e-Health Systems

【速读】:该论文试图解决的问题是:当前患者导向型医疗应用中人工智能(Artificial Intelligence, AI)的集成程度与成熟度尚不明确,这影响了对AI在医疗领域可信性、透明度及实际应用效果的评估。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)提取116个患者导向型医疗应用的核心功能特征,并基于Gartner AI成熟度模型对其进行分类,从而量化分析AI在这些应用中的发展阶段。研究发现,超过86.21%的应用仍处于AI集成的早期阶段,仅有约13.79%达到高级集成水平,揭示了当前AI在医疗场景中尚未充分成熟的问题。

链接: https://arxiv.org/abs/2507.21882
作者: Elmira Onagh,Alireza Davoodi,Maleknaz Nayebi
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Paper published in COMPSAC 2025

点击查看摘要

Abstract:As Artificial Intelligence (AI) becomes increasingly embedded in healthcare technologies, understanding the maturity of AI in patient-centric applications is critical for evaluating its trustworthiness, transparency, and real-world impact. In this study, we investigate the integration and maturity of AI feature integration in 116 patient-centric healthcare applications. Using Large Language Models (LLMs), we extracted key functional features, which are then categorized into different stages of the Gartner AI maturity model. Our results show that over 86.21% of applications remain at the early stages of AI integration, while only 13.79% demonstrate advanced AI integration.
zh

[AI-16] Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

【速读】:该论文旨在解决疼痛评估中主观性强、缺乏连续监测能力的问题,从而提升疼痛管理的客观性与精准度。其关键解决方案是提出一种基于皮肤电活动(Electrodermal Activity, EDA)信号的多表示融合分析流程,通过创建并可视化多种信号表示形式(如波形图),在单一多表示图中进行联合分析,结合多种处理与滤波技术及表示组合方式,实现对疼痛状态的更可靠识别,其性能优于传统融合方法,在多个场景下展现出更强的鲁棒性和有效性。

链接: https://arxiv.org/abs/2507.21881
作者: Stefanos Gkikas,Ioannis Kyprakis,Manolis Tsiknakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual’s condition. This study has been submitted to the \textitSecond Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.
zh

[AI-17] ny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

【速读】:该论文旨在解决疼痛评估中缺乏客观、连续监测手段的问题,以提升临床决策支持与患者管理效率。其核心挑战在于如何从多模态生理信号(如电皮肤活动、血容脉搏、呼吸信号和外周血氧饱和度)中提取高精度特征,实现自动化的疼痛识别。解决方案的关键是提出了一种轻量级预训练嵌入模型Tiny-BioMoE,该模型基于440万张生物信号图像表示进行训练,仅含730万参数,能够高效生成高质量嵌入向量用于下游任务;实验表明其在多种单模态及多模态组合下均表现出优异的自动疼痛识别性能,为下一代疼痛评估系统提供了可扩展、实用的工具。

链接: https://arxiv.org/abs/2507.21875
作者: Stefanos Gkikas,Ioannis Kyprakis,Manolis Tsiknakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textitSecond Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed approach introduces \textitTiny-BioMoE, a lightweight pretrained embedding model for biosignal analysis. Trained on 4.4 million biosignal image representations and consisting of only 7.3 million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit\textcolorblueThe model’s architecture (code) and weights are available at this https URL.
zh

[AI-18] A Neuro-Symbolic Approach for Probabilistic Reasoning on Graph Data

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时缺乏符号领域知识融合能力与复杂推理支持的问题,同时克服传统关系贝叶斯网络(Relational Bayesian Networks, RBNs)在表示学习方面的局限性。其核心解决方案是提出一种神经符号框架,将GNN无缝集成到RBN中,从而结合GNN的强大学习能力与RBN的灵活符号建模和概率推理优势。关键创新在于两种实现方式:一是直接将GNN编译为RBN原生语言以保持语义一致性,二是保留GNN作为外部组件并确保与RBN范式的完全对齐;此外,还设计了最大后验(MAP)推理方法用于此类混合模型,显著提升了在节点分类和多目标网络优化等任务中的性能与可解释性。

链接: https://arxiv.org/abs/2507.21873
作者: Raffaele Pojer,Andrea Passerini,Kim G. Larsen,Manfred Jaeger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to the Journal of Artificial Intelligence Research (JAIR); under revision. 29 pages, 6 figures. Code available at this https URL

点击查看摘要

Abstract:Graph neural networks (GNNs) excel at predictive tasks on graph-structured data but often lack the ability to incorporate symbolic domain knowledge and perform general reasoning. Relational Bayesian Networks (RBNs), in contrast, enable fully generative probabilistic modeling over graph-like structures and support rich symbolic knowledge and probabilistic inference. This paper presents a neuro-symbolic framework that seamlessly integrates GNNs into RBNs, combining the learning strength of GNNs with the flexible reasoning capabilities of RBNs. We develop two implementations of this integration: one compiles GNNs directly into the native RBN language, while the other maintains the GNN as an external component. Both approaches preserve the semantics and computational properties of GNNs while fully aligning with the RBN modeling paradigm. We also propose a maximum a-posteriori (MAP) inference method for these neuro-symbolic models. To demonstrate the framework’s versatility, we apply it to two distinct problems. First, we transform a GNN for node classification into a collective classification model that explicitly models homo- and heterophilic label patterns, substantially improving accuracy. Second, we introduce a multi-objective network optimization problem in environmental planning, where MAP inference supports complex decision-making. Both applications include new publicly available benchmark datasets. This work introduces a powerful and coherent neuro-symbolic approach to graph data, bridging learning and reasoning in ways that enable novel applications and improved performance across diverse tasks. Comments: Submitted to the Journal of Artificial Intelligence Research (JAIR); under revision. 29 pages, 6 figures. Code available at this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21873 [cs.AI] (or arXiv:2507.21873v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.21873 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Raffaele Pojer [view email] [v1] Tue, 29 Jul 2025 14:43:25 UTC (3,021 KB)
zh

[AI-19] MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

【速读】:该论文旨在解决自动驾驶系统中因真实世界数据长尾分布导致的泛化能力不足问题,特别是对罕见但安全关键的车辆类别检测性能差的问题。解决方案的关键在于提出MultiEditor,一个基于双分支潜在扩散框架的图像与LiDAR点云联合编辑方法;其核心创新是引入3D高斯泼溅(3D Gaussian Splatting, 3DGS)作为目标物体的结构和外观先验,并设计多层级外观控制机制(包括像素级贴合、语义级引导和多分支精修),同时提出一种深度引导的可变形跨模态条件模块,利用3DGS渲染的深度图自适应实现模态间相互指导,从而显著提升跨模态一致性与重建保真度。

链接: https://arxiv.org/abs/2507.21872
作者: Shouyi Lu,Zihan Lin,Chao Lu,Huanran Wang,Guirong Zhuo,Lianqing Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism–comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement–to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.
zh

[AI-20] EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO)算法在训练过程中因组内奖励稀疏且相同而导致的优势坍缩(advantage collapse)问题,该问题会削弱强化学习信号的有效性并影响模型性能。解决方案的关键在于提出EDGE-GRPO算法,其核心创新包括两个方面:一是引入基于熵驱动的优势计算(Entropy-Driven Advantage),通过细粒度样本级别的策略熵来动态调整优势估计,从而缓解奖励一致性引发的梯度失效;二是采用引导式误差修正(Guided Error Correction),利用内部反馈机制增强训练信号的多样性与准确性,有效提升策略优化的稳定性与收敛性。

链接: https://arxiv.org/abs/2507.21848
作者: Xingjian Zhang,Siwei Wen,Wenjun Wu,Lei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbfEntropy-\textbfDriven Advantage and \textbfGuided \textbfError Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at this https URL.
zh

[AI-21] Probabilistic Active Goal Recognition KR2025

【速读】:该论文旨在解决多智能体环境中观测者如何通过主动信息收集来提升对其他智能体隐藏目标的推理能力问题,即从传统的被动目标识别(Passive Goal Recognition)向主动目标识别(Active Goal Recognition, AGR)演进。其解决方案的关键在于提出一个融合联合信念更新机制(joint belief update mechanism)与蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的集成框架,使观测者能够在无需领域知识的前提下高效规划行动并推断目标状态,从而显著降低不确定性。实证结果表明,该方法在网格环境中的表现优于传统被动识别方法,且其通用性MCTS策略可媲美强领域的贪婪基线,为构建更具交互性和适应性的多智能体系统提供了可靠的技术路径。

链接: https://arxiv.org/abs/2507.21846
作者: Chenyuan Zhang,Cristian Rojas Cardenas,Hamid Rezatofighi,Mor Vered,Buser Say
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Accepted by KR2025

点击查看摘要

Abstract:In multi-agent environments, effective interaction hinges on understanding the beliefs and intentions of other agents. While prior work on goal recognition has largely treated the observer as a passive reasoner, Active Goal Recognition (AGR) focuses on strategically gathering information to reduce uncertainty. We adopt a probabilistic framework for Active Goal Recognition and propose an integrated solution that combines a joint belief update mechanism with a Monte Carlo Tree Search (MCTS) algorithm, allowing the observer to plan efficiently and infer the actor’s hidden goal without requiring domain-specific knowledge. Through comprehensive empirical evaluation in a grid-based domain, we show that our joint belief update significantly outperforms passive goal recognition, and that our domain-independent MCTS performs comparably to our strong domain-specific greedy baseline. These results establish our solution as a practical and robust framework for goal inference, advancing the field toward more interactive and adaptive multi-agent systems.
zh

[AI-22] Against racing to AGI: Cooperation deterrence and catastrophic risks

【速读】:该论文试图解决的问题是“AGI Racing”(人工智能通用智能竞赛)是否符合各国或主要AI发展实体的自利逻辑,即是否应通过加速前沿AI研发以率先实现人工通用智能(AGI)来获取战略优势。论文指出,这种竞速策略会显著增加 catastrophic risks(灾难性风险),如核不稳定性,并削弱技术AI安全研究的有效性;同时其预期收益可能被高估,尤其“获胜方能否完全主导失败方”存疑。解决方案的关键在于转向国际协作与协调机制,辅以审慎设计的威慑措施,这不仅能大幅降低风险,还能实现AGI竞速所声称的大部分利益,因此更符合各方长期利益。

链接: https://arxiv.org/abs/2507.21839
作者: Leonard Dung,Max Hellrigel-Holderbaum
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AGI Racing is the view that it is in the self-interest of major actors in AI development, especially powerful nations, to accelerate their frontier AI development to build highly capable AI, especially artificial general intelligence (AGI), before competitors have a chance. We argue against AGI Racing. First, the downsides of racing to AGI are much higher than portrayed by this view. Racing to AGI would substantially increase catastrophic risks from AI, including nuclear instability, and undermine the prospects of technical AI safety research to be effective. Second, the expected benefits of racing may be lower than proponents of AGI Racing hold. In particular, it is questionable whether winning the race enables complete domination over losers. Third, international cooperation and coordination, and perhaps carefully crafted deterrence measures, constitute viable alternatives to racing to AGI which have much smaller risks and promise to deliver most of the benefits that racing to AGI is supposed to provide. Hence, racing to AGI is not in anyone’s self-interest as other actions, particularly incentivizing and seeking international cooperation around AI issues, are preferable.
zh

[AI-23] Analysis of Fourier Neural Operators via Effective Field Theory

【速读】:该论文旨在解决傅里叶神经算子(Fourier Neural Operators, FNOs)在高维偏微分方程(PDE)求解中缺乏理论解释的问题,特别是其稳定性、泛化能力和频率行为的机制不明确。解决方案的关键在于首次对FNOs在无限维函数空间中进行有效场论(effective-field-theory)分析,推导出层核和四点顶点的闭合递推关系,并在此基础上揭示非线性激活函数会将低频输入耦合至高频模式,这些高频模式通常因谱截断而被丢弃;同时,对于宽网络,理论给出了权重初始化集合的临界条件,以确保小输入扰动在整个深度上保持均匀尺度,从而提升模型稳定性和特征学习能力。

链接: https://arxiv.org/abs/2507.21833
作者: Taeyoung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 pages, 10 figures

点击查看摘要

Abstract:Fourier Neural Operators (FNOs) have emerged as leading surrogates for high-dimensional partial-differential equations, yet their stability, generalization and frequency behavior lack a principled explanation. We present the first systematic effective-field-theory analysis of FNOs in an infinite-dimensional function space, deriving closed recursion relations for the layer kernel and four-point vertex and then examining three practically important settings-analytic activations, scale-invariant cases and architectures with residual connections. The theory shows that nonlinear activations inevitably couple frequency inputs to high-frequency modes that are otherwise discarded by spectral truncation, and experiments confirm this frequency transfer. For wide networks we obtain explicit criticality conditions on the weight-initialization ensemble that keep small input perturbations to have uniform scale across depth, and empirical tests validate these predictions. Taken together, our results quantify how nonlinearity enables neural operators to capture non-trivial features, supply criteria for hyper-parameter selection via criticality analysis, and explain why scale-invariant activations and residual connections enhance feature learning in FNOs.
zh

[AI-24] DualSG: A Dual-Stream Explicit Semantic-Guided Multivariate Time Series Forecasting Framework ACM-MM2025

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)进行多变量时间序列预测(Multivariate Time Series Forecasting, MTSF)时存在的两个核心问题:一是将LLMs直接作为端到端预测器导致数值精度下降,且迫使LLMs处理其设计初衷之外的模式;二是现有方法依赖隐式对齐文本与时间序列模态在潜在空间中的关系,常面临对齐困难。解决方案的关键在于提出DualSG框架,该框架采用双流结构,将LLM定位为语义引导模块(Semantic Guide),而非独立预测器,从而通过显式的语义指导来修正传统数值预测结果。其中,Time Series Caption作为显式提示格式,以自然语言总结趋势模式并提供可解释的上下文,替代了隐式对齐机制;同时引入的caption-guided融合模块能够显式建模变量间关系,降低噪声和计算开销。实验表明,DualSG在多个真实世界数据集上显著优于15个先进基线方法,验证了数值预测与语义引导显式结合的有效性。

链接: https://arxiv.org/abs/2507.21830
作者: Kuiye Ding,Fanda Fan,Yao Wang,Ruijie jian,Xiaorui Wang,Luqi Gong,Yishan Jiang,Chunjie Luo an Jianfeng Zhan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ACM Multimedia 2025 (ACM MM 2025)

点击查看摘要

Abstract:Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance.
zh

[AI-25] An Agent ic AI for a New Paradigm in Business Process Development

【速读】:该论文旨在解决传统工业自动化中业务流程设计依赖任务导向、缺乏灵活性与智能性的局限性问题。其核心解决方案是提出一种基于代理(Agent)的业务流程设计方法,将流程组织的核心从任务转向目标(goal)、业务对象(business object)和代理(agent),通过多代理协作实现复杂目标的分解与达成,从而在动态工业环境中实现模块化、灵活且情境感知的自动化。

链接: https://arxiv.org/abs/2507.21823
作者: Mohammad Azarijafari,Luisa Mich,Michele Missikoff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence agents represent the next major revolution in the continuous technological evolution of industrial automation. In this paper, we introduce a new approach for business process design and development that leverages the capabilities of Agentic AI. Departing from the traditional task-based approach to business process design, we propose an agent-based method, where agents contribute to the achievement of business goals, identified by a set of business objects. When a single agent cannot fulfill a goal, we have a merge goal that can be achieved through the collaboration of multiple agents. The proposed model leads to a more modular and intelligent business process development by organizing it around goals, objects, and agents. As a result, this approach enables flexible and context-aware automation in dynamic industrial environments.
zh

[AI-26] Unlocking Interpretability for RF Sensing: A Complex-Valued White-Box Transformer

【速读】:该论文旨在解决深度无线感知(Deep Wireless Sensing, DWS)模型普遍存在的黑箱特性导致的可解释性不足问题,这限制了其在安全敏感物理应用场景中的泛化能力和可信度。解决方案的关键在于提出RF-CRATE,这是首个基于复数稀疏率缩减原理的数学可解释深度网络架构,通过非平凡的理论推导将原始实值白盒Transformer扩展至复数域,并利用CR-Calculus框架构建了完整的复数域白盒Transformer模块(包括自注意力机制和残差多层感知机),同时引入子空间正则化策略提升有限无线数据下的特征判别能力,从而在保持与传统黑盒模型相当性能的同时实现全链条数学可解释性。

链接: https://arxiv.org/abs/2507.21799
作者: Xie Zhang,Yina Wang,Chenshu Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The empirical success of deep learning has spurred its application to the radio-frequency (RF) domain, leading to significant advances in Deep Wireless Sensing (DWS). However, most existing DWS models function as black boxes with limited interpretability, which hampers their generalizability and raises concerns in security-sensitive physical applications. In this work, inspired by the remarkable advances of white-box transformers, we present RF-CRATE, the first mathematically interpretable deep network architecture for RF sensing, grounded in the principles of complex sparse rate reduction. To accommodate the unique RF signals, we conduct non-trivial theoretical derivations that extend the original real-valued white-box transformer to the complex domain. By leveraging the CR-Calculus framework, we successfully construct a fully complex-valued white-box transformer with theoretically derived self-attention and residual multi-layer perceptron modules. Furthermore, to improve the model’s ability to extract discriminative features from limited wireless data, we introduce Subspace Regularization, a novel regularization strategy that enhances feature diversity, resulting in an average performance improvement of 19.98% across multiple sensing tasks. We extensively evaluate RF-CRATE against seven baselines with multiple public and self-collected datasets involving different RF signals. The results show that RF-CRATE achieves performance on par with thoroughly engineered black-box models, while offering full mathematical interpretability. More importantly, by extending CRATE to the complex domain, RF-CRATE yields substantial improvements, achieving an average classification gain of 5.08% and reducing regression error by 10.34% across diverse sensing tasks compared to CRATE. RF-CRATE is fully open-sourced at: this https URL.
zh

[AI-27] MoDeSuite: Robot Learning Task Suite for Benchmarking Mobile Manipulation with Deformable Objects

【速读】:该论文旨在解决现有机器人学习算法在处理可变形物体(deformable objects)时面临的挑战,尤其是移动操作(mobile manipulation)任务中缺乏标准化评估基准的问题。当前多数基准仅针对刚性物体设计,难以全面衡量机器人在复杂现实场景下对柔性材料的操控能力。解决方案的关键在于提出首个专门面向移动操作可变形物体的任务套件——MoDeSuite,其包含八个不同类型的移动操作任务,覆盖弹性体与可变形物体,每个任务均源自真实应用场景,并要求机器人基座与机械臂协同工作,同时利用物体的形变特性完成目标。该套件不仅用于评估强化学习和模仿学习算法的性能,还通过在Spot机器人上的实机部署验证了从仿真到现实的迁移潜力,为该领域研究提供了新的方向与工具。

链接: https://arxiv.org/abs/2507.21796
作者: Yuying Zhang,Kevin Sebastian Luck,Francesco Verdoja,Ville Kyrki,Joni Pajarinen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile manipulation is a critical capability for robots operating in diverse, real-world environments. However, manipulating deformable objects and materials remains a major challenge for existing robot learning algorithms. While various benchmarks have been proposed to evaluate manipulation strategies with rigid objects, there is still a notable lack of standardized benchmarks that address mobile manipulation tasks involving deformable objects. To address this gap, we introduce MoDeSuite, the first Mobile Manipulation Deformable Object task suite, designed specifically for robot learning. MoDeSuite consists of eight distinct mobile manipulation tasks covering both elastic objects and deformable objects, each presenting a unique challenge inspired by real-world robot applications. Success in these tasks requires effective collaboration between the robot’s base and manipulator, as well as the ability to exploit the deformability of the objects. To evaluate and demonstrate the use of the proposed benchmark, we train two state-of-the-art reinforcement learning algorithms and two imitation learning algorithms, highlighting the difficulties encountered and showing their performance in simulation. Furthermore, we demonstrate the practical relevance of the suite by deploying the trained policies directly into the real world with the Spot robot, showcasing the potential for sim-to-real transfer. We expect that MoDeSuite will open a novel research domain in mobile manipulation involving deformable objects. Find more details, code, and videos at this https URL. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21796 [cs.RO] (or arXiv:2507.21796v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.21796 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-28] Hybrid Causal Identification and Causal Mechanism Clustering

【速读】:该论文旨在解决多环境条件下观测数据中异质因果关系(heterogeneous causality)的识别问题,即在不同环境下因果机制可能发生变化时,如何准确推断变量间的因果方向。传统基于加性噪声模型(Additive Noise Model, ANM)的方法通常假设单一因果机制,难以刻画真实世界中复杂的因果异质性。其解决方案的关键在于提出混合条件变分因果推理模型(Mixture Conditional Variational Causal Inference, MCVCI),该模型利用混合高斯分布与神经网络结合的优势,通过混合条件变分自编码器的概率边界似然作为因果决策准则,从而实现对异质因果结构的有效建模;进一步地,将因果异质性显式建模为聚类数量,并提出混合条件变分因果聚类(MCVCC)方法,可揭示不同因果机制的表达模式,显著优于现有最优方法。

链接: https://arxiv.org/abs/2507.21792
作者: Saixiong Liu,Yuhua Qian,Jue Li,Honghong Cheng,Feijiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bivariate causal direction identification is a fundamental and vital problem in the causal inference field. Among binary causal methods, most methods based on additive noise only use one single causal mechanism to construct a causal model. In the real world, observations are always collected in different environments with heterogeneous causal relationships. Therefore, on observation data, this paper proposes a Mixture Conditional Variational Causal Inference model (MCVCI) to infer heterogeneous causality. Specifically, according to the identifiability of the Hybrid Additive Noise Model (HANM), MCVCI combines the superior fitting capabilities of the Gaussian mixture model and the neural network and elegantly uses the likelihoods obtained from the probabilistic bounds of the mixture conditional variational auto-encoder as causal decision criteria. Moreover, we model the casual heterogeneity into cluster numbers and propose the Mixture Conditional Variational Causal Clustering (MCVCC) method, which can reveal causal mechanism expression. Compared with state-of-the-art methods, the comprehensive best performance demonstrates the effectiveness of the methods proposed in this paper on several simulated and real data.
zh

[AI-29] Proposing a Semantic Movie Recommendation System Enhanced by ChatGPT s NLP Results

【速读】:该论文旨在解决传统推荐系统在电影推荐中因依赖显式标签(如Genre)而导致个性化不足的问题,从而影响用户参与度和满意度。其解决方案的关键在于构建一个基于语义信息的知识图谱,利用大语言模型ChatGPT对电影简介进行情感倾向分析,提取更深层次的语义特征,进而提升推荐准确率。相较于仅使用出版商提供的显式类型标签,该方法能更有效地捕捉用户偏好与内容之间的潜在关联。

链接: https://arxiv.org/abs/2507.21770
作者: Ali Fallahi,Azam Bastanfard,Amineh Amini,Hadi Saboohi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: May 2023, 6 pages, 5 figures

点击查看摘要

Abstract:The importance of recommender systems on the web has grown, especially in the movie industry, with a vast selection of options to watch. To assist users in traversing available items and finding relevant results, recommender systems analyze operational data and investigate users’ tastes and habits. Providing highly individualized suggestions can boost user engagement and satisfaction, which is one of the fundamental goals of the movie industry, significantly in online platforms. According to recent studies and research, using knowledge-based techniques and considering the semantic ideas of the textual data is a suitable way to get more appropriate results. This study provides a new method for building a knowledge graph based on semantic information. It uses the ChatGPT, as a large language model, to assess the brief descriptions of movies and extract their tone of voice. Results indicated that using the proposed method may significantly enhance accuracy rather than employing the explicit genres supplied by the publishers.
zh

[AI-30] owards a rigorous evaluation of RAG systems: the challenge of due diligence

【速读】:该论文旨在解决生成式 AI(Generative AI)在高风险领域如金融尽职调查中,检索增强生成(Retrieval-Augmented Generation, RAG)系统可靠性不足的问题,特别是其仍存在幻觉、答非所问、引用失败和回避回答等系统性错误。解决方案的关键在于提出一种结合人工标注与大语言模型判别器(LLM-Judge)的鲁棒评估协议,并借鉴预测驱动推理(Prediction Powered Inference, PPI)方法,实现具有统计保障的精确性能度量,从而为工业场景下RAG系统的可靠性和可扩展性评估提供标准化框架与数据支持。

链接: https://arxiv.org/abs/2507.21753
作者: Grégoire Martinon,Alexandra Lorenzo de Brionne,Jérôme Bohard,Antoine Lojou,Damien Hervault,Nicolas J-B. Brunel(ENSIIE, LaMME)
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: in French language. EvalLLM2025: Workshop on Evaluation Generative Models (LLM) and Challenges, AMIAD, 2025, Marseille, France

点击查看摘要

Abstract:The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further analysis. Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications.
zh

[AI-31] SAT-Based Bounded Fitting for the Description Logic ALC ISWC2025

【速读】:该论文旨在解决在描述逻辑ALC及其语法片段中,从正例和负例数据中学习逻辑公式的**有界拟合(bounded fitting)**问题。其核心挑战在于如何在限制公式大小的前提下,高效地找到能够准确区分正负例的逻辑表达式,并保证学习过程具有概率上的可靠性。解决方案的关键在于:首先证明了所有研究的语法片段中的有界拟合问题是NP完全的,即使在仅有一个正例和一个负例的特殊情况下也是如此;其次,通过将有界拟合嵌入Valiant的PAC(Probably Approximately Correct)学习框架,为算法提供了概率保证,而其他传统概念学习方法无法做到这一点;最后,作者基于SAT求解器实现了ALC及其片段的有界拟合算法,并引入优化策略以提升效率,从而在实践中验证了该方法的有效性与竞争力。

链接: https://arxiv.org/abs/2507.21752
作者: Maurice Funk,Jean Christoph Jung,Tom Voellmer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, full version of paper accepted at ISWC 2025

点击查看摘要

Abstract:Bounded fitting is a general paradigm for learning logical formulas from positive and negative data examples, that has received considerable interest recently. We investigate bounded fitting for the description logic ALC and its syntactic fragments. We show that the underlying size-restricted fitting problem is NP-complete for all studied fragments, even in the special case of a single positive and a single negative example. By design, bounded fitting comes with probabilistic guarantees in Valiant’s PAC learning framework. In contrast, we show that other classes of algorithms for learning ALC concepts do not provide such guarantees. Finally, we present an implementation of bounded fitting in ALC and its fragments based on a SAT solver. We discuss optimizations and compare our implementation to other concept learning tools.
zh

[AI-32] Zero-Shot Machine Unlearning with Proxy Adversarial Data Generation IJCAI2025

【速读】:该论文旨在解决零样本场景下的模型遗忘(zero-shot unlearning)问题,即在仅能访问待删除样本而无法获取剩余训练数据的情况下,如何有效实现模型对特定样本的影响消除,同时避免因参数调整导致的过遗忘(over-unlearning)现象。其解决方案的关键在于提出了一种名为ZS-PAG的新框架,核心创新包括:(1) 通过生成对抗样本近似不可获取的剩余数据;(2) 基于生成样本定位一个特定子空间进行遗忘操作,从而在零样本条件下防止性能下降;(3) 引入基于影响函数的伪标签策略,考虑遗忘过程对剩余样本的影响,进一步提升遗忘后模型的性能。该方法具备理论保证,并在多个基准数据集上验证了其有效性与优越性。

链接: https://arxiv.org/abs/2507.21738
作者: Huiqiang Chen,Tianqing Zhu,Xin Yu,Wanlei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model’s performance on the remaining data significantly drops due to the change in the model’s parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i.e., zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model’s performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.
zh

[AI-33] GDAIP: A Graph-Based Domain Adaptive Framework for Individual Brain Parcellation

【速读】:该论文旨在解决跨数据集场景下个体脑分区(individual brain parcellation)学习中因域分布不一致(domain shift)导致的性能下降问题。现有深度学习方法通常假设数据分布一致,难以适应真实世界中的多源fMRI数据。其解决方案的关键在于提出图域自适应框架GDAIP,通过构建群体级与个体级脑图(brain graph),结合图注意力网络(Graph Attention Network, GAT)与基于最小最大熵(Minimax Entropy, MME)的域自适应机制,在无标签目标脑图顶点上进行预测熵的对抗优化,从而将参考图谱从群体级图结构迁移至个体级图结构,实现跨数据集条件下的高质量个体脑分区。

链接: https://arxiv.org/abs/2507.21727
作者: Jianfei Zhu,Haiqi Zhu,Shaohui Liu,Feng Jiang,Baichun Wei,Chunzhi Yi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent deep learning approaches have shown promise in learning such individual brain parcellations from functional magnetic resonance imaging (fMRI). However, most existing methods assume consistent data distributions across domains and struggle with domain shifts inherent to real-world cross-dataset scenarios. To address this challenge, we proposed Graph Domain Adaptation for Individual Parcellation (GDAIP), a novel framework that integrates Graph Attention Networks (GAT) with Minimax Entropy (MME)-based domain adaptation. We construct cross-dataset brain graphs at both the group and individual levels. By leveraging semi-supervised training and adversarial optimization of the prediction entropy on unlabeled vertices from target brain graph, the reference atlas is adapted from the group-level brain graph to the individual brain graph, enabling individual parcellation under cross-dataset settings. We evaluated our method using parcellation visualization, Dice coefficient, and functional homogeneity. Experimental results demonstrate that GDAIP produces individual parcellations with topologically plausible boundaries, strong cross-session consistency, and ability of reflecting functional organization.
zh

[AI-34] Unrolling Dynamic Programming via Graph Filters

【速读】:该论文旨在解决动态规划(Dynamic Programming, DP)在大规模状态-动作空间或存在长期依赖问题时计算成本过高的难题。其核心挑战在于传统策略迭代(Policy Iteration)等方法因收敛速度慢、内存需求高而难以高效求解贝尔曼最优方程(Bellman’s optimality equations)。解决方案的关键是提出一种可学习的参数化模型——BellNet,该模型将策略迭代过程展开并截断为一个神经网络结构,并通过最小化贝尔曼误差(Bellman error)从随机值函数初始化中进行训练;同时,作者借助图信号处理(Graph Signal Processing)视角,将马尔可夫决策过程(Markov Decision Process, MDP)的转移概率矩阵视为加权有向图的邻接矩阵,从而将BellNet解释为一系列非线性图滤波器的级联,实现了对策略与价值迭代的统一建模,并在推理阶段显式控制复杂度。

链接: https://arxiv.org/abs/2507.21705
作者: Sergio Rozada,Samuel Rey,Gonzalo Mateos,Antonio G. Marques
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic programming (DP) is a fundamental tool used across many engineering fields. The main goal of DP is to solve Bellman’s optimality equations for a given Markov decision process (MDP). Standard methods like policy iteration exploit the fixed-point nature of these equations to solve them iteratively. However, these algorithms can be computationally expensive when the state-action space is large or when the problem involves long-term dependencies. Here we propose a new approach that unrolls and truncates policy iterations into a learnable parametric model dubbed BellNet, which we train to minimize the so-termed Bellman error from random value function initializations. Viewing the transition probability matrix of the MDP as the adjacency of a weighted directed graph, we draw insights from graph signal processing to interpret (and compactly re-parameterize) BellNet as a cascade of nonlinear graph filters. This fresh look facilitates a concise, transferable, and unifying representation of policy and value iteration, with an explicit handle on complexity during inference. Preliminary experiments conducted in a grid-like environment demonstrate that BellNet can effectively approximate optimal policies in a fraction of the iterations required by classical methods.
zh

[AI-35] A Multi-Agent Generative AI Framework for IC Module-Level Verification Automation

【速读】:该论文旨在解决芯片设计流程中验证(verification)环节的瓶颈问题,当前生成式 AI 技术在代码生成等任务上已有初步进展,但在复杂验证任务中仍处于探索阶段,且单一大语言模型(Large Language Model, LLM)方法存在能力局限。解决方案的关键在于提出一种多智能体验证框架(Multi-Agent Verification Framework, MAVF),通过多个专业化智能体的协同工作——包括规范解析、验证策略生成和代码实现——构建从设计规格到测试平台(testbench)的自动化转换系统,从而显著提升验证文档解析与生成以及测试平台自动化的效率与质量。

链接: https://arxiv.org/abs/2507.21694
作者: Wenbo Liu,Forbes Hou,Jon Zhang,Hong Liu,Allen Lei
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 20 pages, 12 figures. DVCon China 2025

点击查看摘要

Abstract:As large language models demonstrate enormous potential in the field of Electronic Design Automation (EDA), generative AI-assisted chip design is attracting widespread attention from academia and industry. Although these technologies have made preliminary progress in tasks such as code generation, their application in chip verification – a critical bottleneck in the chip development cycle – remains at an exploratory stage. This paper proposes an innovative Multi-Agent Verification Framework (MAVF) aimed at addressing the limitations of current single-LLM approaches in complex verification tasks. Our framework builds an automated transformation system from design specifications to testbench through the collaborative work of multiple specialized agents, including specification parsing, verification strategy generation, and code implementation. Through verification experiments on multiple chip modules of varying complexity, results show that MAVF significantly outperforms traditional manual methods and single-dialogue generative AI approaches in verification document parsing and generation, as well as automated testbench generation. This research opens new directions for exploring generative AI applications in verification automation, potentially providing effective approaches to solving the most challenging bottleneck issues in chip design.
zh

[AI-36] MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages ModelsPrompts and Scenarios

【速读】:该论文旨在解决生成式 AI (Generative AI) 在代码生成场景下带来的学术诚信与招聘公平性问题,即如何有效检测由大型语言模型(LLMs)生成的代码。其解决方案的关键在于构建了一个多语言、多场景的基准数据集 MultiAIGCD,涵盖 Python、Java 和 Go 三种编程语言,包含从问题描述生成代码、修复运行时错误和纠正输出错误等三种典型使用场景,共计 121,271 条 AI 生成代码与 32,148 条人工编写的代码片段,并在此基础上对三种前沿 AI 生成代码检测模型进行系统性评估,验证其在跨模型和跨语言场景下的鲁棒性。

链接: https://arxiv.org/abs/2507.21693
作者: Basak Demirok,Mucahid Kutlu,Selin Mergen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) rapidly advance, their role in code generation has expanded significantly. While this offers streamlined development, it also creates concerns in areas like education and job interviews. Consequently, developing robust systems to detect AI-generated code is imperative to maintain academic integrity and ensure fairness in hiring processes. In this study, we introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go. From the CodeNet dataset’s problem definitions and human-authored codes, we generate several code samples in Java, Python, and Go with six different LLMs and three different prompts. This generation process covered three key usage scenarios: (i) generating code from problem descriptions, (ii) fixing runtime errors in human-written code, and (iii) correcting incorrect outputs. Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets. We also benchmark three state-of-the-art AI-generated code detection models and assess their performance in various test scenarios such as cross-model and cross-language. We share our dataset and codes to support research in this field.
zh

[AI-37] Can the current trends of AI handle a full course of mathematics?

【速读】:该论文试图解决的问题是:当前人工智能(Artificial Intelligence, AI)在承担大学水平数学课程全过程教学责任方面的可行性与局限性。研究从课程大纲制定、教学内容呈现、学生问题解答及评估设计四个关键维度进行系统评估,发现AI在组织结构和准确性方面表现较强,但在情感互动等人类特有的教学要素上仍存在显著不足。解决方案的关键在于整合人类教师与AI的优势潜力,通过人机协同的方式优化教学效果,从而更有效地实现大学数学课程的全流程构建与实施。

链接: https://arxiv.org/abs/2507.21664
作者: Mariam Alsayyad,Fayadh Kadhem
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); History and Overview (math.HO)
备注: 36 pages

点击查看摘要

Abstract:This paper addresses the question of how able the current trends of Artificial Intelligence (AI) are in managing to take the responsibility of a full course of mathematics at a college level. The study evaluates this ability in four significant aspects, namely, creating a course syllabus, presenting selected material, answering student questions, and creating an assessment. It shows that even though the AI is strong in some important parts like organization and accuracy, there are still some human aspects that are far away from the current abilities of AI. There is still a hidden emotional part, even in science, that cannot be fulfilled by the AI in its current state. This paper suggests some recommendations to integrate the human and AI potentials to create better outcomes in terms of reaching the target of creating a full course of mathematics, at a university level, as best as possible.
zh

[AI-38] AI Literacy as a Key Driver of User Experience in AI-Powered Assessment: Insights from Socratic Mind

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 工具在高等教育中日益普及背景下,学生如何与这些系统互动及其对学习效果影响的问题。研究发现,学生的AI素养(尤其是自我效能感、概念理解能力和应用技能)是预测其对Socratic Mind这一交互式AI辅助形成性评估工具的可用性感知、满意度和参与度的关键因素;而先前的AI使用经验则无显著影响。解决方案的关键在于:设计者应通过集成自适应引导和以用户为中心的功能来支持不同AI素养水平的学习者,从而构建更具包容性、激励性和有效性的AI赋能学习环境。

链接: https://arxiv.org/abs/2507.21654
作者: Meryem Yilmaz Soylu,Jeonghyun Lee,Jui-Tse Hung,Christopher Zhang Cui,David A. Joyner
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 34 pages, 1 figure, 3 tables

点击查看摘要

Abstract:As Artificial Intelligence (AI) tools become increasingly embedded in higher education, understanding how students interact with these systems is essential to supporting effective learning. This study examines how students’ AI literacy and prior exposure to AI technologies shape their perceptions of Socratic Mind, an interactive AI-powered formative assessment tool. Drawing on Self-Determination Theory and user experience research, we analyze relationships among AI literacy, perceived usability, satisfaction, engagement, and perceived learning effectiveness. Data from 309 undergraduates in Computer Science and Business courses were collected through validated surveys. Partial least squares structural equation modeling showed that AI literacy - especially self-efficacy, conceptual understanding, and application skills - significantly predicts usability, satisfaction, and engagement. Usability and satisfaction, in turn, strongly predict perceived learning effectiveness, while prior AI exposure showed no significant effect. These findings highlight that AI literacy, rather than exposure alone, shapes student experiences. Designers should integrate adaptive guidance and user-centered features to support diverse literacy levels, fostering inclusive, motivating, and effective AI-based learning environments.
zh

[AI-39] DGP: A Dual-Granularity Prompting Framework for Fraud Detection with Graph-Enhanced LLM s

【速读】:该论文旨在解决在异构欺诈检测图中,纯文本提示(text-only prompting)方法因多跳关系导致邻居节点信息呈指数级增长,从而引发提示内容冗长、无关信息淹没关键信号的问题。解决方案的关键在于提出双粒度提示(Dual Granularity Prompting, DGP),通过保留目标节点的细粒度文本细节,并对邻居信息采用不同模态的定制化摘要策略——即对文本字段进行双层语义抽象,对数值特征进行统计聚合——实现对冗余邻居内容的有效压缩,从而在有限的token预算内提升欺诈检测性能。

链接: https://arxiv.org/abs/2507.21653
作者: Yuan Li,Jun Hu,Bryan Hooi,Bingsheng He,Cheng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world fraud detection applications benefit from graph learning techniques that jointly exploit node features, often rich in textual data, and graph structural information. Recently, Graph-Enhanced LLMs emerge as a promising graph learning approach that converts graph information into prompts, exploiting LLMs’ ability to reason over both textual and structural information. Among them, text-only prompting, which converts graph information to prompts consisting solely of text tokens, offers a solution that relies only on LLM tuning without requiring additional graph-specific encoders. However, text-only prompting struggles on heterogeneous fraud-detection graphs: multi-hop relations expand exponentially with each additional hop, leading to rapidly growing neighborhoods associated with dense textual information. These neighborhoods may overwhelm the model with long, irrelevant content in the prompt and suppress key signals from the target node, thereby degrading performance. To address this challenge, we propose Dual Granularity Prompting (DGP), which mitigates information overload by preserving fine-grained textual details for the target node while summarizing neighbor information into coarse-grained text prompts. DGP introduces tailored summarization strategies for different data modalities, bi-level semantic abstraction for textual fields and statistical aggregation for numerical features, enabling effective compression of verbose neighbor content into concise, informative prompts. Experiments across public and industrial datasets demonstrate that DGP operates within a manageable token budget while improving fraud detection performance by up to 6.8% (AUPRC) over state-of-the-art methods, showing the potential of Graph-Enhanced LLMs for fraud detection.
zh

[AI-40] GUARD-CAN: Graph-Understanding and Recurrent Architecture for CAN Anomaly Detection

【速读】:该论文旨在解决车载网络中控制器局域网(Controller Area Network, CAN)因缺乏加密和身份认证而面临的多种网络攻击问题。其核心解决方案是提出一种名为GUARD-CAN的异常检测框架,该框架通过结合图表示学习与时间序列建模实现多维度异常识别:首先将CAN消息划分为固定长度的时间窗口,并构建保留消息顺序的图结构;随后利用过完备自动编码器(Overcomplete Autoencoder, AE)与图卷积网络(Graph Convolutional Network, GCN)生成图嵌入向量;再将这些向量序列输入门控循环单元(Gated Recurrent Unit, GRU)以捕捉跨窗口的时间异常模式。该方法无需复杂特征工程即可有效检测四类典型CAN攻击(泛洪、模糊、重放和伪造攻击),并支持序列级与窗口级双重异常检测,从而提升整体检测性能。

链接: https://arxiv.org/abs/2507.21640
作者: Hyeong Seon Kim,Huy Kang Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Comments:12 pages, 3 figures, 3 tables; accepted to the 26th World Conference on Information Security Applications (WISA 2025)

点击查看摘要

Abstract:Modern in-vehicle networks face various cyber threats due to the lack of encryption and authentication in the Controller Area Network (CAN). To address this security issue, this paper presents GUARD-CAN, an anomaly detection framework that combines graph-based representation learning with time-series modeling. GUARD-CAN splits CAN messages into fixed-length windows and converts each window into a graph that preserves message order. To detect anomalies in the timeaware and structure-aware context at the same window, GUARD-CAN takes advantage of the overcomplete Autoencoder (AE) and Graph Convolutional Network (GCN) to generate graph embedding vectors. The model groups these vectors into sequences and feeds them into the Gated Recurrent Unit (GRU) to detect temporal anomaly patterns across the graphs. GUARD-CAN performs anomaly detection at both the sequence level and the window level, and this allows multi-perspective performance evaluation. The model also verifies the importance of window size selection through an analysis based on Shannon entropy. As a result, GUARD-CAN shows that the proposed model detects four types of CAN attacks (flooding, fuzzing, replay and spoofing attacks) effectively without relying on complex feature engineering.
zh

[AI-41] Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)研究中基准测试过于依赖游戏环境、难以直接迁移至真实世界具身应用的问题。为应对这一挑战,作者提出 Assistax——一个面向辅助机器人任务的开源基准平台,其核心创新在于采用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架模拟机器人与主动人类患者之间的交互,并通过 JAX 的硬件加速实现物理仿真中的高效训练。关键解决方案包括:利用向量化训练显著提升计算效率(相比基于 CPU 的方案快达 370 倍),以及构建多样化伙伴代理群体以评估机器人在零样本协作场景下的泛化能力,从而为推进辅助机器人领域的 RL 研究提供可靠基准。

链接: https://arxiv.org/abs/2507.21638
作者: Leonard Hinckeldey,Elliot Fosong,Elle Miller,Rimvydas Rubavicius,Trevor McInroe,Patricia Wollstadt,Christiane B. Wiebel-Herboth,Subramanian Ramamoorthy,Stefano V. Albrecht
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted for the Coordination and Cooperation in Multi-Agent Reinforcement Learning Workshop at the Reinforcement Learning Conference 2025

点击查看摘要

Abstract:The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX’s hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to 370\times faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent’s zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: this https URL.
zh

[AI-42] Self-Aware Safety Augmentation: Leverag ing Internal Semantic Understanding to Enhance Safety in Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对有害输入时的安全性问题,相较于仅处理文本的模型,LVLMs 更容易因视觉与语言信息的复杂交互而产生安全风险。其核心解决方案是提出 Self-Aware Safety Augmentation (SASA),关键在于利用模型中间层中蕴含的语义表征,将其投影到早期以安全为导向的层中,从而增强模型对潜在风险的识别能力,且无需额外微调。此外,通过线性探测(linear probing)解析模型内部语义理解状态,可在生成前实现风险预警,显著提升 LVLM 的安全性,同时保持任务性能不受明显影响。

链接: https://arxiv.org/abs/2507.21637
作者: Wanying Wang,Zeyu Ma,Han Zheng,Xin Tan,Mingang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACM Multimedia 2025

点击查看摘要

Abstract:Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose \textbfSelf-Aware Safety Augmentation (SASA), a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model’s inherent semantic understanding to enhance safety recognition without fine-tuning. Then, we employ linear probing to articulate the model’s internal semantic comprehension to detect the risk before the generation process. Extensive experiments on various datasets and tasks demonstrate that SASA significantly improves the safety of LVLMs, with minimal impact on the utility.
zh

[AI-43] StaffPro: an LLM Agent for Joint Staffing and Profiling

【速读】:该论文旨在解决劳动力管理中的两个紧密关联的核心问题:排班(staffing),即任务分配与人员调度,可能涉及团队组建;以及人员画像(profiling),即从非结构化数据中持续估计员工的技能、偏好等潜在属性。解决方案的关键在于提出了一种名为StaffPro的大型语言模型(Large Language Model, LLM)代理系统,其创新性地将排班决策与潜在特征估计通过形式化数学框架相耦合,并引入持续的人机反馈循环机制,使代理能够基于自然语言表达优化目标、处理文本任务描述,并在交互中实现对员工属性的终身学习与动态更新,从而保障长期最优的排班性能。

链接: https://arxiv.org/abs/2507.21636
作者: Alessio Maritan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents integrate pre-trained LLMs with modular algorithmic components and have shown remarkable reasoning and decision-making abilities. In this work, we investigate their use for two tightly intertwined challenges in workforce management: staffing, i.e., the assignment and scheduling of tasks to workers, which may require team formation; and profiling, i.e., the continuous estimation of workers’ skills, preferences, and other latent attributes from unstructured data. We cast these problems in a formal mathematical framework that links scheduling decisions to latent feature estimation, and we introduce StaffPro, an LLM agent that addresses staffing and profiling jointly. Differently from existing staffing solutions, StaffPro allows expressing optimization objectives using natural language, accepts textual task descriptions and provides high flexibility. StaffPro interacts directly with humans by establishing a continuous human-agent feedback loop, ensuring natural and intuitive use. By analyzing human feedback, our agent continuously estimates the latent features of workers, realizing life-long worker profiling and ensuring optimal staffing performance over time. A consulting firm simulation example demonstrates that StaffPro successfully estimates workers’ attributes and generates high quality schedules. With its innovative design, StaffPro offers a robust, interpretable, and human-centric solution for automated personnel management.
zh

[AI-44] “Teammates Am I Clear?”: Analysing Legible Behaviours in Teams

【速读】:该论文旨在解决多智能体系统中决策可读性(legibility)不足的问题,尤其是在团队协作场景下,现有研究大多局限于单个智能体与人类的交互,未能充分利用可读性在多智能体协同中的优势。解决方案的关键在于提出了一种将可读性决策扩展至多智能体环境的新方法,通过增强团队中智能体之间的可理解性,显著提升了团队整体协作性能。实验结果表明,在多智能体基准场景中,包含可读性智能体的团队能够超越仅由标准最优行为智能体组成的团队。

链接: https://arxiv.org/abs/2507.21631
作者: Miguel Faria,Francisco S. Melo,Ana Paiva
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this paper we investigate the notion of legibility in sequential decision-making in the context of teams and teamwork. There have been works that extend the notion of legibility to sequential decision making, for deterministic and for stochastic scenarios. However, these works focus on one agent interacting with one human, foregoing the benefits of having legible decision making in teams of agents or in team configurations with humans. In this work we propose an extension of legible decision-making to multi-agent settings that improves the performance of agents working in collaboration. We showcase the performance of legible decision making in team scenarios using our proposed extension in multi-agent benchmark scenarios. We show that a team with a legible agent is able to outperform a team composed solely of agents with standard optimal behaviour.
zh

[AI-45] Hierarchical Graph Neural Network for Compressed Speech Steganalysis

【速读】:该论文旨在解决深度学习(Deep Learning, DL)在隐写分析(Steganalysis)中面临的计算复杂度高以及跨数据集泛化能力弱的问题。其解决方案的关键在于首次将图神经网络(Graph Neural Network, GNN)——具体为GraphSAGE架构——引入到压缩语音IP(VoIP)流的隐写分析任务中,通过从VoIP信号中构建简单图结构并利用GraphSAGE捕捉多层次的隐写信息(包括细粒度特征与高层模式),从而显著提升检测精度与效率。实验表明,该方法在短样本(0.5秒)和低嵌入率等挑战条件下仍能实现超过98%的检测准确率,并且平均检测时间低至0.016秒,相较现有最优方法在准确率上提升2.8个百分点,在效率上优化0.003秒,展现出优异的在线实时处理潜力。

链接: https://arxiv.org/abs/2507.21591
作者: Mustapha Hemis,Hamza Kheddar,Mohamed Chahine Ghanem,Bachir Boudraa
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Steganalysis methods based on deep learning (DL) often struggle with computational complexity and challenges in generalizing across different datasets. Incorporating a graph neural network (GNN) into steganalysis schemes enables the leveraging of relational data for improved detection accuracy and adaptability. This paper presents the first application of a Graph Neural Network (GNN), specifically the GraphSAGE architecture, for steganalysis of compressed voice over IP (VoIP) speech streams. The method involves straightforward graph construction from VoIP streams and employs GraphSAGE to capture hierarchical steganalysis information, including both fine grained details and high level patterns, thereby achieving high detection accuracy. Experimental results demonstrate that the developed approach performs well in uncovering quantization index modulation (QIM)-based steganographic patterns in VoIP signals. It achieves detection accuracy exceeding 98 percent even for short 0.5 second samples, and 95.17 percent accuracy under challenging conditions with low embedding rates, representing an improvement of 2.8 percent over the best performing state of the art methods. Furthermore, the model exhibits superior efficiency, with an average detection time as low as 0.016 seconds for 0.5-second samples an improvement of 0.003 seconds. This makes it efficient for online steganalysis tasks, providing a superior balance between detection accuracy and efficiency under the constraint of short samples with low embedding rates.
zh

[AI-46] Exploring the Link Between Bayesian Inference and Embodied Intelligence: Toward Open Physical-World Embodied AI Systems

【速读】:该论文试图解决的问题是:尽管贝叶斯统计与具身智能(embodied intelligence)在概念上存在深刻联系,但贝叶斯方法并未被广泛或明确地应用于现代具身智能系统中,这限制了其在开放物理世界中的适应性与泛化能力。解决方案的关键在于从“搜索”和“学习”两个核心维度重新审视贝叶斯推理与当代具身智能方法的关系,揭示当前系统多局限于封闭物理环境的根本原因,并指出贝叶斯方法有望成为推动具身智能向真正开放物理世界演进的核心工具。

链接: https://arxiv.org/abs/2507.21589
作者: Bin Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Embodied intelligence posits that cognitive capabilities fundamentally emerge from - and are shaped by - an agent’s real-time sensorimotor interactions with its environment. Such adaptive behavior inherently requires continuous inference under uncertainty. Bayesian statistics offers a principled probabilistic framework to address this challenge by representing knowledge as probability distributions and updating beliefs in response to new evidence. The core computational processes underlying embodied intelligence - including perception, action selection, learning, and even higher-level cognition - can be effectively understood and modeled as forms of Bayesian inference. Despite the deep conceptual connection between Bayesian statistics and embodied intelligence, Bayesian principles have not been widely or explicitly applied in today’s embodied intelligence systems. In this work, we examine both Bayesian and contemporary embodied intelligence approaches through two fundamental lenses: search and learning - the two central themes in modern AI, as highlighted in Rich Sutton’s influential essay “The Bitter Lesson”. This analysis sheds light on why Bayesian inference has not played a central role in the development of modern embodied intelligence. At the same time, it reveals that current embodied intelligence systems remain largely confined to closed-physical-world environments, and highlights the potential for Bayesian methods to play a key role in extending these systems toward truly open physical-world embodied intelligence.
zh

[AI-47] SafeDriveRAG : Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation

【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在自动驾驶系统中安全感知、情境理解和路径规划等关键环节的评估不足问题,尤其是在交通安全隐患场景下的表现缺乏系统性衡量。其解决方案的关键在于构建首个大规模多模态问答基准SafeDrive228K(包含228K样本和18个子任务),并提出一种基于知识图谱的检索增强生成方法(SafeDriveRAG),该方法采用新颖的多尺度子图检索算法实现高效信息获取,并融合从互联网收集的交通安全规则知识,显著提升了VLM在事故识别、极端案例处理及交通常识推理等安全敏感任务中的性能表现,实验表明该方案在五种主流VLM上平均提升达14.57%。

链接: https://arxiv.org/abs/2507.21585
作者: Hao Ye,Mengshi Qi,Zhaohong Liu,Liang Liu,Huadong Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model’s capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at this https URL.
zh

[AI-48] Finding Uncommon Ground: A Human-Centered Model for Extrospective Explanations IJCAI2023

【速读】:该论文旨在解决当前AI解释(AI Explanation)过于侧重模型内部机制、难以满足非专家用户需求的问题,即现有解释方法缺乏以人为本的视角。其解决方案的关键在于提出一种个性化解释(Personalized Explanation)方法,其中人工智能代理基于对用户偏好和交互历史的动态建模(即代理的世界观模型),来判断哪些信息对特定用户而言是新颖且相关的,并据此定制化地提供解释内容,从而提升解释的适用性与可理解性。

链接: https://arxiv.org/abs/2507.21571
作者: Laura Spillner,Nima Zargham,Mihai Pomarlan,Robert Porzel,Rainer Malaka
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Presented at the IJCAI 2023 Workshop on Explainable Artificial Intelligence (XAI)

点击查看摘要

Abstract:The need for explanations in AI has, by and large, been driven by the desire to increase the transparency of black-box machine learning models. However, such explanations, which focus on the internal mechanisms that lead to a specific output, are often unsuitable for non-experts. To facilitate a human-centered perspective on AI explanations, agents need to focus on individuals and their preferences as well as the context in which the explanations are given. This paper proposes a personalized approach to explanation, where the agent tailors the information provided to the user based on what is most likely pertinent to them. We propose a model of the agent’s worldview that also serves as a personal and dynamic memory of its previous interactions with the same user, based on which the artificial agent can estimate what part of its knowledge is most likely new information to the user.
zh

[AI-49] Model Predictive Adversarial Imitation Learning for Planning from Observation

【速读】:该论文旨在解决人类示范数据(human demonstration data)通常存在模糊性和不完整性的问题,从而导致传统模仿学习方法在规划行为上的可靠性不足。为应对这一挑战,作者提出了一种统一的解决方案:将逆强化学习(Inverse Reinforcement Learning, IRL)中的策略(policy)替换为基于规划的代理(planning-based agent),并借助对抗性模仿学习(Adversarial Imitation Learning)的框架实现从仅观察数据中端到端地交互式学习规划器。该方案的关键在于通过引入规划机制替代传统策略,显著提升了样本效率、分布外泛化能力和鲁棒性,同时增强了模型的可解释性、复杂度控制和安全性。

链接: https://arxiv.org/abs/2507.21533
作者: Tyler Han,Yanda Bao,Bhaumik Mehta,Gabriel Guo,Anubhav Vishwakarma,Emily Kang,Sanghun Jung,Rosario Scalise,Jason Zhou,Bryan Xu,Byron Boots
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Open-source code in process of being cleaned and documented for release. Please contact directly in the meantime for code. Under Review

点击查看摘要

Abstract:Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.
zh

[AI-50] Large Language Models for Wireless Communications: From Adaptation to Autonomy

【速读】:该论文旨在解决无线通信系统在日益复杂和动态环境下对智能、自适应解决方案的迫切需求,传统方法难以应对多变场景下的优化与决策挑战。其核心解决方案在于利用大语言模型(Large Language Models, LLMs)的强大学习与推理能力,从三个关键方向推动无线系统的智能化演进:一是将预训练LLMs适配至核心通信任务以提升泛化能力;二是开发面向无线场景的专用基础模型,在通用性与计算效率之间取得平衡;三是构建具备自主推理与协同能力的代理型LLM(agentic LLM),实现网络的自治管理与动态优化。这一路径显著优于传统基于规则或统计建模的方法,为未来智能、自适应无线网络提供了新的技术范式。

链接: https://arxiv.org/abs/2507.21524
作者: Le Liang,Hao Ye,Yucheng Sheng,Ouya Wang,Jiacheng Wang,Shi Jin,Geoffrey Ye Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.
zh

[AI-51] ST-GDance: Long-Term and Collision-Free Group Choreography from Music BMVC2025

【速读】:该论文旨在解决多舞者舞蹈生成中因空间-时间交互复杂性导致的计算成本高、动作冲突频发及序列长度受限的问题(即:如何在保证动作同步性和空间协调性的前提下,实现高效且无碰撞的长序列群舞生成)。其解决方案的关键在于提出ST-GDance框架,通过解耦空间与时间依赖关系来优化长期一致性与防碰撞能力:采用轻量级图卷积实现距离感知的空间建模,以捕捉舞者间的局部空间关系;同时引入加速稀疏注意力机制进行高效的时序建模,从而显著降低计算开销并保障动作流畅性与安全性。

链接: https://arxiv.org/abs/2507.21518
作者: Jing Xu,Weiqiang Wang,Cunjian Chen,Jun Liu,Qiuhong Ke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Accepted at BMVC 2025

点击查看摘要

Abstract:Group dance generation from music has broad applications in film, gaming, and animation production. However, it requires synchronizing multiple dancers while maintaining spatial coordination. As the number of dancers and sequence length increase, this task faces higher computational complexity and a greater risk of motion collisions. Existing methods often struggle to model dense spatial-temporal interactions, leading to scalability issues and multi-dancer collisions. To address these challenges, we propose ST-GDance, a novel framework that decouples spatial and temporal dependencies to optimize long-term and collision-free group choreography. We employ lightweight graph convolutions for distance-aware spatial modeling and accelerated sparse attention for efficient temporal modeling. This design significantly reduces computational costs while ensuring smooth and collision-free interactions. Experiments on the AIOZ-GDance dataset demonstrate that ST-GDance outperforms state-of-the-art baselines, particularly in generating long and coherent group dance sequences. Project page: this https URL.
zh

[AI-52] Decision Transformer-Based Drone Trajectory Planning with Dynamic Safety-Efficiency Trade-Offs IROS

【速读】:该论文旨在解决无人机在未知环境中进行轨迹规划时,如何动态调整安全性与效率之间权衡的问题。传统基于多项式的方法虽计算高效且生成平滑轨迹,但需专家知识调参才能实现期望的权衡,且调参效果不稳定;而强化学习方法虽适应性强,却未显式建模安全-效率权衡。为此,作者提出一种基于决策变换器(Decision Transformer)的轨迹规划框架,其关键在于引入“剩余回报”(Return-to-Go, RTG)作为温度参数,通过调节RTG即可直观、无需专家干预地动态控制安全性和效率之间的平衡。实验表明,该方法在结构化网格和非结构化随机环境中的仿真及真实场景下均能有效生成更安全或更高效的轨迹,优于现有基线方法。

链接: https://arxiv.org/abs/2507.21506
作者: Chang-Hun Ji,SiWoon Song,Youn-Hee Han,SungTae Moon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025. \c{opyright} 2025 IEEE. Personal use of this material is permitted. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:A drone trajectory planner should be able to dynamically adjust the safety-efficiency trade-off according to varying mission requirements in unknown environments. Although traditional polynomial-based planners offer computational efficiency and smooth trajectory generation, they require expert knowledge to tune multiple parameters to adjust this trade-off. Moreover, even with careful tuning, the resulting adjustment may fail to achieve the desired trade-off. Similarly, although reinforcement learning-based planners are adaptable in unknown environments, they do not explicitly address the safety-efficiency trade-off. To overcome this limitation, we introduce a Decision Transformer-based trajectory planner that leverages a single parameter, Return-to-Go (RTG), as a \emphtemperature parameter to dynamically adjust the safety-efficiency trade-off. In our framework, since RTG intuitively measures the safety and efficiency of a trajectory, RTG tuning does not require expert knowledge. We validate our approach using Gazebo simulations in both structured grid and unstructured random environments. The experimental results demonstrate that our planner can dynamically adjust the safety-efficiency trade-off by simply tuning the RTG parameter. Furthermore, our planner outperforms existing baseline methods across various RTG settings, generating safer trajectories when tuned for safety and more efficient trajectories when tuned for efficiency. Real-world experiments further confirm the reliability and practicality of our proposed planner.
zh

[AI-53] Evaluation and Benchmarking of LLM Agents : A Survey

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在实际部署前缺乏系统化、标准化评估方法的问题,当前研究多集中于单一维度的性能测试,难以全面反映代理在复杂真实场景中的行为表现与可靠性。其解决方案的关键在于提出一个二维分类框架,从“评估目标”(如代理行为、能力、可靠性与安全性)和“评估过程”(包括交互模式、数据集与基准、指标计算方法及工具链)两个维度对现有工作进行结构化归纳,并强调企业级应用场景中常被忽视的挑战,如基于角色的数据访问控制、长期动态交互下的可靠性保障以及合规性要求,从而为构建更全面、可扩展且贴近现实的LLM代理评估体系提供理论基础与实践指导。

链接: https://arxiv.org/abs/2507.21504
作者: Mahmoud Mohammadi,Yipeng Li,Jane Lo,Wendy Yip
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives – what to evaluate, such as agent behavior, capabilities, reliability, and safety – and (2) evaluation process – how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance, which are often overlooked in current research. We also identify future research directions, including holistic, more realistic, and scalable evaluation. This work aims to bring clarity to the fragmented landscape of agent evaluation and provide a framework for systematic assessment, enabling researchers and practitioners to evaluate LLM agents for real-world deployment.
zh

[AI-54] Large Language Models for Supply Chain Decisions

【速读】:该论文旨在解决供应链管理(Supply Chain Management)中因依赖复杂优化工具而产生的三大决策效率瓶颈问题:一是业务规划者难以理解与解释模型输出的建议;二是缺乏高效手段进行场景分析和“如果……会怎样”类问题的探究;三是需频繁人工干预以更新数学模型来适应动态商业环境。这些问题通常需要数据科学团队或技术供应商介入,严重拖慢决策周期。论文提出的关键解决方案是利用大语言模型(Large Language Models, LLMs)实现供应链工具的智能化交互与解释能力,从而将决策时间从数天至数周缩短至分钟级甚至小时级,并显著提升计划人员和高管的生产力与决策影响力。

链接: https://arxiv.org/abs/2507.21502
作者: David Simchi-Levi,Konstantina Mellou,Ishai Menache,Jeevan Pathuri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Forthcoming chapter in AI in Supply Chains: Perspectives from Global Thought Leaders, edited by Maxime C. Cohen and Tinglong Dai, and part of the Springer Series in Supply Chain Management (edited by Prof. Chris Tang)

点击查看摘要

Abstract:Supply Chain Management requires addressing a variety of complex decision-making challenges, from sourcing strategies to planning and execution. Over the last few decades, advances in computation and information technologies have enabled the transition from manual, intuition and experience-based decision-making, into more automated and data-driven decisions using a variety of tools that apply optimization techniques. These techniques use mathematical methods to improve decision-making. Unfortunately, business planners and executives still need to spend considerable time and effort to (i) understand and explain the recommendations coming out of these technologies; (ii) analyze various scenarios and answer what-if questions; and (iii) update the mathematical models used in these tools to reflect current business environments. Addressing these challenges requires involving data science teams and/or the technology providers to explain results or make the necessary changes in the technology and hence significantly slows down decision making. Motivated by the recent advances in Large Language Models (LLMs), we report how this disruptive technology can democratize supply chain technology - namely, facilitate the understanding of tools’ outcomes, as well as the interaction with supply chain tools without human-in-the-loop. Specifically, we report how we apply LLMs to address the three challenges described above, thus substantially reducing the time to decision from days and weeks to minutes and hours as well as dramatically increasing planners’ and executives’ productivity and impact. Comments: Forthcoming chapter in AI in Supply Chains: Perspectives from Global Thought Leaders, edited by Maxime C. Cohen and Tinglong Dai, and part of the Springer Series in Supply Chain Management (edited by Prof. Chris Tang) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21502 [cs.AI] (or arXiv:2507.21502v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.21502 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess

【速读】:该论文旨在解决个性化人类决策行为建模中数据需求过高这一关键问题,即现有方法通常需要每位个体大量数据(如5000局棋局)才能准确建模其决策风格,导致难以应用于新用户或数据稀疏场景。解决方案的关键在于提出Maia4All框架,通过两阶段优化实现高效个体适配:首先利用原型增强模型(prototype-enriched model)在群体与个体行为之间建立桥梁,完成“丰富化”步骤;其次通过能力水平或用户原型初始化并微调个体嵌入(individual embeddings),实现“民主化”步骤,仅需20局棋局即可高保真预测个体走法和行为模式,显著提升数据效率。该方法不仅适用于国际象棋领域,还可扩展至其他个性化AI适应场景,如特定大语言模型(idiosyncratic LLMs)的定制化建模。

链接: https://arxiv.org/abs/2507.21488
作者: Zhenwei Tang,Difan Jiao,Eric Xue,Reid McIlroy-Young,Jon Kleinberg,Siddhartha Sen,Ashton Anderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important. Chess, a long-standing AI benchmark with precise skill measurement, offers an ideal testbed for human-AI alignment. However, existing approaches to modeling human behavior require prohibitively large amounts of data from each individual, making them impractical for new or sparsely represented users. In this work, we introduce Maia4All, a framework designed to learn and adapt to individual decision-making styles efficiently, even with limited data. Maia4All achieves this through a two-stage optimization process: (1) an enrichment step, which bridges population and individual-level human behavior modeling with a prototype-enriched model, and (2) a democratization step, which leverages ability levels or user prototypes to initialize and refine individual embeddings with minimal data. Our experimental results show that Maia4All can accurately predict individual moves and profile behavioral patterns with high fidelity, establishing a new standard for personalized human-like AI behavior modeling in chess. Maia4All achieves individual human behavior modeling in chess with only 20 games, compared to the 5,000 games required previously, representing a significant improvement in data efficiency. Our work provides an example of how population AI systems can flexibly adapt to individual users using a prototype-enriched model as a bridge. This approach extends beyond chess, as shown in our case study on idiosyncratic LLMs, highlighting its potential for broader applications in personalized AI adaptation.
zh

[AI-56] HLSDebugger: Identification and Correction of Logic Bugs in HLS Code with LLM Solutions

【速读】:该论文旨在解决高阶综合(High-Level Synthesis, HLS)代码调试过程中存在的三大挑战:高质量电路数据稀缺、硬件逻辑错误比软件错误更难定位与修复,以及缺乏可靠的测试用例导致难以实现多任务协同的错误识别与修正。其关键解决方案是提出一个定制化模型HLSDebugger,该模型基于编码器-解码器结构,统一完成错误定位、错误类型预测和错误修正三项任务;同时,作者首次构建并发布了一个包含30万标注样本的大规模HLS逻辑错误数据集,显著提升了调试性能——在错误识别上超越GPT-4等先进大语言模型,在错误修正上提升超过3倍,推动了HLS自动化调试技术的发展。

链接: https://arxiv.org/abs/2507.21485
作者: Jing Wang,Shang Liu,Yao Lu,Zhiyao Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This work has been accepted at ICCAD 2025 (International Conference on Computer-Aided Design)

点击查看摘要

Abstract:High-level synthesis (HLS) accelerates hardware design by enabling the automatic translation of high-level descriptions into efficient hardware implementations. However, debugging HLS code is a challenging and labor-intensive task, especially for novice circuit designers or software engineers without sufficient hardware domain knowledge. The recent emergence of Large Language Models (LLMs) is promising in automating the HLS debugging process. Despite the great potential, three key challenges persist when applying LLMs to HLS logic debugging: 1) High-quality circuit data for training LLMs is scarce, posing a significant challenge. 2) Debugging logic bugs in hardware is inherently more complex than identifying software bugs with existing golden test cases. 3) The absence of reliable test cases requires multi-tasking solutions, performing both bug identification and correction. complicates the multi-tasking required for effective HLS debugging. In this work, we propose a customized solution named HLSDebugger to address the challenges. HLSDebugger first generates and releases a large labeled dataset with 300K data samples, targeting HLS logic bugs. The HLSDebugger model adopts an encoder-decoder structure, performing bug location identification, bug type prediction, and bug correction with the same model. HLSDebugger significantly outperforms advanced LLMs like GPT-4 in bug identification and by more than 3x in bug correction. It makes a substantial advancement in the exploration of automated debugging of HLS code.
zh

[AI-57] NCCR: to Evaluate the Robustness of Neural Networks and Adversarial Examples

【速读】:该论文旨在解决深度学习模型在面对对抗样本(adversarial examples)时缺乏有效评估其鲁棒性(robustness)的问题。现有研究多集中于攻击与防御方法,而对模型本身或输入数据的鲁棒性量化评估仍较为薄弱。解决方案的关键在于提出一种名为“神经元覆盖变化率”(Neuron Cover Change Rate, NCCR)的新指标,通过监测特定神经元输出在输入扰动下的变化程度来衡量模型的鲁棒性:NCCR值越小,表明模型对扰动越不敏感,即鲁棒性越强。实验结果表明,该指标不仅能有效评估模型鲁棒性,还可用于检测输入是否为对抗样本。

链接: https://arxiv.org/abs/2507.21483
作者: Pu Shi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks have received a lot of attention recently, and related security issues have come with it. Many studies have shown that neural networks are vulnerable to adversarial examples that have been artificially perturbed with modification, which is too small to be distinguishable by human perception. Different attacks and defenses have been proposed to solve these problems, but there is little research on evaluating the robustness of neural networks and their inputs. In this work, we propose a metric called the neuron cover change rate (NCCR) to measure the ability of deep learning models to resist attacks and the stability of adversarial examples. NCCR monitors alterations in the output of specifically chosen neurons when the input is perturbed, and networks with a smaller degree of variation are considered to be more robust. The results of the experiment on image recognition and the speaker recognition model show that our metrics can provide a good assessment of the robustness of neural networks or their inputs. It can also be used to detect whether an input is adversarial or not, as adversarial examples are always less robust.
zh

[AI-58] Capacity-Constrained Continual Learning

【速读】:该论文旨在解决容量受限智能体(capacity-constrained agents)在有限记忆和计算资源下如何最优分配资源以实现最佳性能的问题。其核心贡献在于构建并求解了一个简化的持续学习问题——容量受限的线性二次高斯(Linear-Quadratic-Gaussian, LQG)序列预测问题,并在适当的技术条件下给出了理论解;同时,对于可分解为若干子问题的场景,进一步提出了稳态下最优跨子问题容量分配的方法。这一工作为在资源受限环境下系统性地研究学习机制提供了理论基础。

链接: https://arxiv.org/abs/2507.21479
作者: Zheng Wen,Doina Precup,Benjamin Van Roy,Satinder Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Any agents we can possibly build are subject to capacity constraints, as memory and compute resources are inherently finite. However, comparatively little attention has been dedicated to understanding how agents with limited capacity should allocate their resources for optimal performance. The goal of this paper is to shed some light on this question by studying a simple yet relevant continual learning problem: the capacity-constrained linear-quadratic-Gaussian (LQG) sequential prediction problem. We derive a solution to this problem under appropriate technical conditions. Moreover, for problems that can be decomposed into a set of sub-problems, we also demonstrate how to optimally allocate capacity across these sub-problems in the steady state. We view the results of this paper as a first step in the systematic theoretical study of learning under capacity constraints.
zh

[AI-59] Hebbian Memory-Augmented Recurrent Networks: Engram Neurons in Deep Learning

【速读】:该论文旨在解决当前人工循环神经网络(Recurrent Neural Networks, RNNs)主要依赖隐式状态记忆所带来的可解释性差以及建模长程依赖能力有限的问题。其解决方案的关键在于提出一种名为“痕迹神经网络”(Engram Neural Network, ENN)的新颖循环架构,该架构引入了一个显式的、可微分的记忆矩阵,并结合海布型突触可塑性(Hebbian synaptic plasticity)与稀疏的注意力驱动检索机制,从而显式地模拟记忆的形成与回溯过程。这一设计不仅提升了模型的透明度和可解释性,还在多个基准任务上实现了与传统RNN、GRU和LSTM相当的性能表现,同时通过可视化海布痕迹揭示了符合生物学原理的记忆结构化形成机制,验证了神经科学启发机制在提升深度学习模型可解释性和鲁棒性方面的潜力。

链接: https://arxiv.org/abs/2507.21474
作者: Daniel Szelogowski
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Despite success across diverse tasks, current artificial recurrent network architectures rely primarily on implicit hidden-state memories, limiting their interpretability and ability to model long-range dependencies. In contrast, biological neural systems employ explicit, associative memory traces (i.e., engrams) strengthened through Hebbian synaptic plasticity and activated sparsely during recall. Motivated by these neurobiological insights, we introduce the Engram Neural Network (ENN), a novel recurrent architecture incorporating an explicit, differentiable memory matrix with Hebbian plasticity and sparse, attention-driven retrieval mechanisms. The ENN explicitly models memory formation and recall through dynamic Hebbian traces, improving transparency and interpretability compared to conventional RNN variants. We evaluate the ENN architecture on three canonical benchmarks: MNIST digit classification, CIFAR-10 image sequence modeling, and WikiText-103 language modeling. Our empirical results demonstrate that the ENN achieves accuracy and generalization performance broadly comparable to classical RNN, GRU, and LSTM architectures, with all models converging to similar accuracy and perplexity on the large-scale WikiText-103 task. At the same time, the ENN offers significant enhancements in interpretability through observable memory dynamics. Hebbian trace visualizations further reveal biologically plausible, structured memory formation processes, validating the potential of neuroscience-inspired mechanisms to inform the development of more interpretable and robust deep learning models.
zh

[AI-60] An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning

【速读】:该论文旨在解决红外光谱(Infrared Spectroscopy)在低数据条件下难以实现准确、自动化解读的问题,尤其针对高维、重叠光谱带对传统化学计量学方法带来的挑战。其解决方案的关键在于提出了一种端到端的大语言模型(Large Language Model, LLM)驱动的智能代理框架,该框架整合了结构化文献知识库、自动光谱预处理与特征提取,并通过多任务推理和闭环多轮交互机制实现动态优化:首先基于文献知识选择科学验证的方法生成低维特征,再利用少量样本提示(few-shot prompt)模板完成分类、回归与异常检测任务,同时将误预测样本迭代加入提示中以持续提升模型性能。此设计显著提升了在小样本场景下红外光谱分析的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.21471
作者: Zujie Xie,Zixuan Chen,Jiheng Liang,Xiangyang Yu,Ziru Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Infrared spectroscopy offers rapid, non destructive measurement of chemical and material properties but suffers from high dimensional, overlapping spectral bands that challenge conventional chemometric approaches. Emerging large language models (LLMs), with their capacity for generalization and reasoning, offer promising potential for automating complex scientific workflows. Despite this promise, their application in IR spectral analysis remains largely unexplored. This study addresses the critical challenge of achieving accurate, automated infrared spectral interpretation under low-data conditions using an LLM-driven framework. We introduce an end-to-end, large language model driven agent framework that integrates a structured literature knowledge base, automated spectral preprocessing, feature extraction, and multi task reasoning in a unified pipeline. By querying a curated corpus of peer reviewed IR publications, the agent selects scientifically validated routines. The selected methods transform each spectrum into low dimensional feature sets, which are fed into few shot prompt templates for classification, regression, and anomaly detection. A closed loop, multi turn protocol iteratively appends mispredicted samples to the prompt, enabling dynamic refinement of predictions. Across diverse materials: stamp pad ink, Chinese medicine, Pu’er tea, Citri Reticulatae Pericarpium and waste water COD datasets, the multi turn LLM consistently outperforms single turn inference, rivaling or exceeding machine learning and deep learning models under low data regimes.
zh

[AI-61] Validating Pharmacogenomics Generative Artificial Intelligence Query Prompts Using Retrieval-Augmented Generation (RAG )

【速读】:该论文旨在解决当前生成式 AI 在药理基因组学(pharmacogenomics)领域中响应准确性与专业性不足的问题,尤其在药物-基因相互作用、剂量建议和治疗意义等关键指标上的表现亟待提升。其解决方案的关键在于构建一个基于检索增强生成(retrieval-augmented generation, RAG)的AI工具 Sherpa Rx,通过整合临床药理遗传学实施联盟(Clinical Pharmacogenetics Implementation Consortium, CPIC)指南与药理基因组学知识库(PharmGKB)数据,实现对查询内容的上下文感知与精准回应。实验表明,Sherpa Rx 在准确性和完整性方面显著优于 ChatGPT-4omini,并在真实场景测试中达到 90% 的正确率,验证了融合权威医学知识库与 RAG 技术可有效提升 AI 在专业医疗决策中的可靠性与实用性。

链接: https://arxiv.org/abs/2507.21453
作者: Ashley Rector,Keaton Minor,Kamden Minor,Jeff McCormack,Beth Breeden,Ryan Nowers,Jay Dorris
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study evaluated Sherpa Rx, an artificial intelligence tool leveraging large language models and retrieval-augmented generation (RAG) for pharmacogenomics, to validate its performance on key response metrics. Sherpa Rx integrated Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines with Pharmacogenomics Knowledgebase (PharmGKB) data to generate contextually relevant responses. A dataset (N=260 queries) spanning 26 CPIC guidelines was used to evaluate drug-gene interactions, dosing recommendations, and therapeutic implications. In Phase 1, only CPIC data was embedded. Phase 2 additionally incorporated PharmGKB content. Responses were scored on accuracy, relevance, clarity, completeness (5-point Likert scale), and recall. Wilcoxon signed-rank tests compared accuracy between Phase 1 and Phase 2, and between Phase 2 and ChatGPT-4omini. A 20-question quiz assessed the tool’s real-world applicability against other models. In Phase 1 (N=260), Sherpa Rx demonstrated high performance of accuracy 4.9, relevance 5.0, clarity 5.0, completeness 4.8, and recall 0.99. The subset analysis (N=20) showed improvements in accuracy (4.6 vs. 4.4, Phase 2 vs. Phase 1 subset) and completeness (5.0 vs. 4.8). ChatGPT-4omini performed comparably in relevance (5.0) and clarity (4.9) but lagged in accuracy (3.9) and completeness (4.2). Differences in accuracy between Phase 1 and Phase 2 was not statistically significant. However, Phase 2 significantly outperformed ChatGPT-4omini. On the 20-question quiz, Sherpa Rx achieved 90% accuracy, outperforming other models. Integrating additional resources like CPIC and PharmGKB with RAG enhances AI accuracy and performance. This study highlights the transformative potential of generative AI like Sherpa Rx in pharmacogenomics, improving decision-making with accurate, personalized responses.
zh

[AI-62] Evo-DKD: Dual-Knowledge Decoding for Autonomous Ontology Evolution in Large Language Models

【速读】:该论文旨在解决知识图谱(Knowledge Graph)和本体(Ontology)在持续演化过程中因人工维护成本高而难以保持全面性与准确性的难题。现有方法要么依赖结构化推理导致灵活性不足,要么仅使用非结构化文本生成易缺乏一致性。解决方案的关键在于提出Evo-DKD框架——一种基于双解码器(Dual-Decoder)的自主本体演化机制,通过并行运行两个解码流:一个生成结构化的本体编辑建议(如新增概念或关系),另一个生成自然语言解释以提供合理性依据;同时引入动态注意力门控机制(Dynamic Attention-Based Gating Mechanism)协调两者的融合,在每一步决策中灵活整合结构化与非结构化知识。该设计实现了符号推理(Symbolic Reasoning)与神经推理(Neural Reasoning)的协同优化,从而显著提升本体更新精度及下游任务性能。

链接: https://arxiv.org/abs/2507.21438
作者: Vishal Raman,Vijai Aravindh R
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Ontologies and knowledge graphs require continuous evolution to remain comprehensive and accurate, but manual curation is labor intensive. Large Language Models (LLMs) possess vast unstructured knowledge but struggle with maintaining structured consistency. We propose Evo-DKD, a novel dual-decoder framework for autonomous ontology evolution that combines structured ontology traversal with unstructured text reasoning. Evo-DKD introduces two parallel decoding streams within an LLM: one decoder generates candidate ontology edits (e.g., new concepts or relations) while the other produces natural-language justifications. A dynamic attention-based gating mechanism coordinates the two streams, deciding at each step how to blend structured and unstructured knowledge. Due to GPU constraints, we simulate the dual-decoder behavior using prompt-based mode control to approximate coordinated decoding in a single-stream mode. The system operates in a closed reasoning loop: proposed ontology edits are validated (via consistency checks and cross-verification with the text explanations) and then injected into the knowledge base, which in turn informs subsequent reasoning. We demonstrate Evo-DKD’s effectiveness on use cases including healthcare ontology refinement, semantic search improvement, and cultural heritage timeline modeling. Experiments show that Evo-DKD outperforms baselines using structured-only or unstructured-only decoding in both precision of ontology updates and downstream task performance. We present quantitative metrics and qualitative examples, confirming the contributions of the dual-decoder design and gating router. Evo-DKD offers a new paradigm for LLM-driven knowledge base maintenance, combining the strengths of symbolic and neural reasoning for sustainable ontology evolution.
zh

[AI-63] MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse AAAI2026

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中因生成冗长思维链(chain-of-thought)而导致的显著内存开销问题。其核心解决方案是提出MemShare,一种新型的键值缓存(KV cache)管理方法,关键在于利用协同过滤算法高效识别可复用的KV缓存块,并通过零拷贝缓存重用机制,在不牺牲准确性的前提下大幅降低内存占用并提升吞吐量。实验表明,MemShare相比现有方法可实现最高达84.79%的吞吐量提升。

链接: https://arxiv.org/abs/2507.21433
作者: Kaiwen Chen,Xin Tan,Minchen Yu,Hong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, submitted to AAAI 2026

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.
zh

[AI-64] GovRelBench:A Benchmark for Government Domain Relevance

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在政府领域评估中对核心能力,尤其是领域相关性(domain relevance)的评估不足问题。现有研究多集中于特定场景下的安全性考量,而缺乏系统性、定量化的评估框架来衡量LLMs在政府任务中的专业适配度。解决方案的关键在于提出GovRelBench基准测试体系,其包含专为政府领域设计的提示(prompts)和一个名为GovRelBERT的专用评估工具;其中,GovRelBERT基于ModernBERT架构,并引入SoftGovScore方法——通过将硬标签转化为软分数进行训练,从而精确计算文本与政府领域的相关性得分,显著提升了评估的准确性与可解释性。

链接: https://arxiv.org/abs/2507.21419
作者: Haiquan Wang,Yi Chen,Shang Zeng,Yun Bian,Zhe Cui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current evaluations of LLMs in the government domain primarily focus on safety considerations in specific scenarios, while the assessment of the models’ own core capabilities, particularly domain relevance, remains insufficient. To address this gap, we propose GovRelBench, a benchmark specifically designed for evaluating the core capabilities of LLMs in the government domain. GovRelBench consists of government domain prompts and a dedicated evaluation tool, GovRelBERT. During the training process of GovRelBERT, we introduce the SoftGovScore method: this method trains a model based on the ModernBERT architecture by converting hard labels to soft scores, enabling it to accurately compute the text’s government domain relevance score. This work aims to enhance the capability evaluation framework for large models in the government domain, providing an effective tool for relevant research and practice. Our code and dataset are available at this https URL.
zh

[AI-65] Graph-Augmented Large Language Model Agents : Current Progress and Future Prospects

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理(Autonomous Agents)在关键功能上的局限性,包括可靠规划、长期记忆、工具管理以及多代理协调等。其核心解决方案是引入图结构(Graphs)作为辅助架构,以增强复杂代理工作流中的结构化表示、连续性和协作能力。通过将图结构与图学习算法结合,论文系统地分析了其在LLM代理系统中三大核心模块——规划、记忆和工具使用——中的作用机制,并进一步探讨了图增强方法如何提升多代理系统(Multi-Agent Systems, MAS)的编排效率、优化性能及可信度。最终,论文指出未来研究应聚焦于提升结构适应性、构建统一且可扩展的多模态图增强LLM代理体系。

链接: https://arxiv.org/abs/2507.21407
作者: Yixin Liu,Guibin Zhang,Kun Wang,Shiyuan Li,Shirui Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in a wide range of applications, including web navigation, software development, and embodied control. While most LLMs are limited in several key agentic procedures, such as reliable planning, long-term memory, tool management, and multi-agent coordination, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this paper offers a timely and comprehensive overview of recent advances and also highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multi-agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems. We hope this paper can serve as a roadmap for future research on GLA and foster a deeper understanding of the role of graphs in LLM agent systems.
zh

[AI-66] Shapley Uncertainty in Natural Language Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在问答任务中输出可信度评估的问题,即如何更准确地衡量模型对自身输出的不确定性。现有方法如Kuhn等人(2023)提出的语义熵(semantic entropy)依赖于设定阈值来判断语义等价关系,难以捕捉语义关系的连续性。本文的关键解决方案是提出一种基于Shapley值的不确定性度量方法,该方法能够量化不同输入扰动对模型输出的影响,并满足三个刻画有效不确定性度量的基本性质。实验表明,该Shapley不确定性度量在多个问答数据集上比基线方法更精确地预测LLM性能,从而实现对模型输出可信度的连续、精细建模。

链接: https://arxiv.org/abs/2507.21406
作者: Meilin Zhu,Gaojie Jin,Xiaowei Huang,Lijun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In question-answering tasks, determining when to trust the outputs is crucial to the alignment of large language models (LLMs). Kuhn et al. (2023) introduces semantic entropy as a measure of uncertainty, by incorporating linguistic invariances from the same meaning. It primarily relies on setting threshold to measure the level of semantic equivalence relation. We propose a more nuanced framework that extends beyond such thresholding by developing a Shapley-based uncertainty metric that captures the continuous nature of semantic relationships. We establish three fundamental properties that characterize valid uncertainty metrics and prove that our Shapley uncertainty satisfies these criteria. Through extensive experiments, we demonstrate that our Shapley uncertainty more accurately predicts LLM performance in question-answering and other datasets, compared to similar baseline measures.
zh

[AI-67] Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中跨模态交互有限以及各模态贡献不均衡的问题。其解决方案的关键在于提出一种端到端的图注意力框架 Sync-TVA,该框架包含模态特定的动态增强模块和结构化的跨模态融合机制:首先通过动态增强模块提升每种模态(文本、音频、视觉)的特征表达能力,再构建异构跨模态图来建模不同模态间的语义关系,并结合交叉注意力机制对齐多模态线索,从而实现更鲁棒的情感推理。

链接: https://arxiv.org/abs/2507.21395
作者: Zeyu Deng,Yanhui Lu,Jiashu Liao,Shuang Wu,Chongfeng Wei
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.
zh

[AI-68] Efficient Neural Combinatorial Optimization Solver for the Min-max Heterogeneous Capacitated Vehicle Routing Problem

【速读】:该论文旨在解决多车辆异构容量车辆路径问题(min-max Heterogeneous Capacitated Vehicle Routing Problem, MMHCVRP)中现有神经组合优化(Neural Combinatorial Optimization, NCO)求解器存在的局限性,特别是因贪婪式解码策略导致的局部最优决策、对节点局部拓扑关系建模不足,以及未有效利用车辆排列不变性和节点对称性等问题。其核心解决方案是提出ECHO框架:首先设计双模态节点编码器以捕捉节点间的局部拓扑结构;其次引入无参数交叉注意力机制来缓解贪婪解码带来的短视决策;最后基于车辆排列不变性和节点对称性设计定制化的数据增强策略,提升强化学习训练稳定性。实验表明,ECHO在不同规模和分布下均优于现有最先进NCO方法,并展现出良好的泛化能力。

链接: https://arxiv.org/abs/2507.21386
作者: Xuan Wu,Di Wang,Chunguo Wu,Kaifang Qi,Chunyan Miao,Yubin Xiao,Jian Zhang,You Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numerous Neural Combinatorial Optimization (NCO) solvers have been proposed to address Vehicle Routing Problems (VRPs). However, most of these solvers focus exclusively on single-vehicle VRP variants, overlooking the more realistic min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP), which involves multiple vehicles. Existing MMHCVRP solvers typically select a vehicle and its next node to visit at each decoding step, but often make myopic decoding decisions and overlook key properties of MMHCVRP, including local topological relationships, vehicle permutation invariance, and node symmetry, resulting in suboptimal performance. To better address these limitations, we propose ECHO, an efficient NCO solver. First, ECHO exploits the proposed dual-modality node encoder to capture local topological relationships among nodes. Subsequently, to mitigate myopic decisions, ECHO employs the proposed Parameter-Free Cross-Attention mechanism to prioritize the vehicle selected in the preceding decoding step. Finally, leveraging vehicle permutation invariance and node symmetry, we introduce a tailored data augment strategy for MMHCVRP to stabilize the Reinforcement Learning training process. To assess the performance of ECHO, we conduct extensive experiments. The experimental results demonstrate that ECHO outperforms state-of-the-art NCO solvers across varying numbers of vehicles and nodes, and exhibits well-performing generalization across both scales and distribution patterns. Finally, ablation studies validate the effectiveness of all proposed methods.
zh

[AI-69] Deep Reinforcement Learning-based Cell DTX/DRX Configuration for Network Energy Saving

【速读】:该论文旨在解决5G网络中基于3GPP Release 18的小区不连续传输与接收(cell DTX/DRX)机制在实现能量节约与服务质量(QoS)之间难以平衡的问题,尤其针对延迟敏感型业务,在最大化节能的同时最小化QoS劣化。解决方案的关键在于引入深度强化学习(DRL)框架,设计了一个结合上下文带(contextual bandit, CB)模型的深度Q网络(DQN)算法,并通过平滑逼近理论最优但不连续的奖励函数来构建奖励机制,从而训练出一个能够根据实时网络和流量条件自适应选择最优cell DTX/DRX配置的AI代理。仿真结果表明,该方法可在不同负载场景下实现最高约45%的能效提升,同时将QoS劣化控制在1%以内。

链接: https://arxiv.org/abs/2507.21385
作者: Wei Mao,Lili Wei,Omid Semiari,Shu-ping Yeh,Hosein Nikopour
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures

点击查看摘要

Abstract:3GPP Release 18 cell discontinuous transmission and reception (cell DTX/DRX) is an important new network energy saving feature for 5G. As a time-domain technique, it periodically aggregates the user data transmissions in a given duration of time when the traffic load is not heavy, so that the remaining time can be kept silent and advanced sleep modes (ASM) can be enabled to shut down more radio components and save more energy for the cell. However, inevitably the packet delay is increased, as during the silent period no transmission is allowed. In this paper we study how to configure cell DTX/DRX to optimally balance energy saving and packet delay, so that for delay-sensitive traffic maximum energy saving can be achieved while the degradation of quality of service (QoS) is minimized. As the optimal configuration can be different for different network and traffic conditions, the problem is complex and we resort to deep reinforcement learning (DRL) framework to train an AI agent to solve it. Through careful design of 1) the learning algorithm, which implements a deep Q-network (DQN) on a contextual bandit (CB) model, and 2) the reward function, which utilizes a smooth approximation of a theoretically optimal but discontinuous reward function, we are able to train an AI agent that always tries to select the best possible Cell DTX/DRX configuration under any network and traffic conditions. Simulation results show that compared to the case when cell DTX/DRX is not used, our agent can achieve up to ~45% energy saving depending on the traffic load scenario, while always maintaining no more than ~1% QoS degradation.
zh

[AI-70] Optimizing Multi-Tier Supply Chain Ordering with LNNXGBoost: Mitigating the Bullwhip Effect

【速读】:该论文旨在解决供应链管理(Supply Chain Management, SCM)中因需求波动、库存失衡以及牛鞭效应(Bullwhip Effect)导致的上游订单波动加剧等问题,传统方法如简单移动平均法难以应对动态市场环境,而现有机器学习技术如LSTM、强化学习和XGBoost在计算复杂度、训练效率或时间序列建模方面存在局限。解决方案的关键在于提出一种混合液态神经网络(Liquid Neural Networks, LNN)与XGBoost的模型:利用LNN对动态特征的高效提取能力实现低计算成本下的实时适应性决策,同时结合XGBoost的全局优化特性提升整体订购策略的稳定性与盈利能力,从而在多层级供应链中有效缓解牛鞭效应并增强累积利润。

链接: https://arxiv.org/abs/2507.21383
作者: Chunan Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supply chain management faces significant challenges, including demand fluctuations, inventory imbalances, and amplified upstream order variability due to the bullwhip effect. Traditional methods, such as simple moving averages, struggle to address dynamic market conditions. Emerging machine learning techniques, including LSTM, reinforcement learning, and XGBoost, offer potential solutions but are limited by computational complexity, training inefficiencies, or constraints in time-series modeling. Liquid Neural Networks, inspired by dynamic biological systems, present a promising alternative due to their adaptability, low computational cost, and robustness to noise, making them suitable for real-time decision-making and edge computing. Despite their success in applications like autonomous vehicles and medical monitoring, their potential in supply chain optimization remains underexplored. This study introduces a hybrid LNN and XGBoost model to optimize ordering strategies in multi-tier supply chains. By leveraging LNN’s dynamic feature extraction and XGBoost’s global optimization capabilities, the model aims to mitigate the bullwhip effect and enhance cumulative profitability. The research investigates how local and global synergies within the hybrid framework address the dual demands of adaptability and efficiency in SCM. The proposed approach fills a critical gap in existing methodologies, offering an innovative solution for dynamic and efficient supply chain management.
zh

[AI-71] MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration

【速读】:该论文旨在解决软件架构设计过程中因知识密集性、复杂决策和资源限制导致的设计效率低、可选方案少的问题,尤其在敏捷开发压力下难以生成多样化且高质量的架构方案。其解决方案的关键在于提出MAAD(Multi-Agent Architecture Design)框架,该框架基于知识驱动的多智能体系统(Multi-Agent System, MAS),通过四个专业化代理——分析代理(Analyst)、建模代理(Modeler)、设计代理(Designer)和评估代理(Evaluator)——协同工作,实现需求规格的深度解析与架构蓝图的自动生成,并输出基于质量属性的结构化评估报告,从而提升架构设计的自动化水平、全面性和实用性。

链接: https://arxiv.org/abs/2507.21382
作者: Ruiyin Li,Yiran Zhang,Xiyu Zhou,Peng Liang,Weisong Sun,Jifeng Xuan,Zhi Jin,Yang Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 images, 1 table, Manuscript submitted to a journal (2025)

点击查看摘要

Abstract:Software architecture design is a critical, yet inherently complex and knowledge-intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade-offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time-consuming and labor-intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)-based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision-making. To address the challenges, we proposed MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes-based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state-of-the-art MAS baseline. Our results show that MAAD’s superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD’s practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT-4o, DeepSeek-R1, and Llama 3.3) and found that GPT-4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS-driven architecture design.
zh

[AI-72] ProMemAssist: Exploring Timely Proactive Assistance Through Working Memory Modeling in Multi-Modal Wearable Devices

【速读】:该论文旨在解决当前可穿戴AI系统在提供辅助时缺乏对用户实时心理状态(尤其是工作记忆,Working Memory, WM)感知的问题,从而导致辅助时机不当或与用户认知负荷不匹配。其解决方案的关键在于构建一个基于多模态传感器信号的实时工作记忆模型(ProMemAssist),该模型依据认知理论将感知信息编码为记忆项和事件,并引入位移(displacement)与干扰(interference)等机制模拟WM动态变化;进而利用此模型开发一个时机预测器,在辅助价值与打断成本之间实现平衡,从而提升辅助的适时性与用户参与度。

链接: https://arxiv.org/abs/2507.21378
作者: Kevin Pu,Ting Zhang,Naveen Sendhilnathan,Sebastian Freitag,Raj Sodhi,Tanya Jonker
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to UIST’25

点击查看摘要

Abstract:Wearable AI systems aim to provide timely assistance in daily life, but existing approaches often rely on user initiation or predefined task knowledge, neglecting users’ current mental states. We introduce ProMemAssist, a smart glasses system that models a user’s working memory (WM) in real-time using multi-modal sensor signals. Grounded in cognitive theories of WM, our system represents perceived information as memory items and episodes with encoding mechanisms, such as displacement and interference. This WM model informs a timing predictor that balances the value of assistance with the cost of interruption. In a user study with 12 participants completing cognitively demanding tasks, ProMemAssist delivered more selective assistance and received higher engagement compared to an LLM baseline system. Qualitative feedback highlights the benefits of WM modeling for nuanced, context-sensitive support, offering design implications for more attentive and user-aware proactive agents.
zh

[AI-73] Efficacy of AI RAG Tools for Complex Information Extraction and Data Annotation Tasks: A Case Study Using Banks Public Disclosures

【速读】:该论文旨在解决金融监管领域中信息提取与数据标注任务效率低、准确性不足的问题,尤其是在处理全球系统重要性银行(Global Systemically Important Banks, GSIBs)披露文档时,这些文档具有内容异构性和信息不完整性,增加了人工处理的复杂度。解决方案的关键在于引入检索增强生成(Retrieval-Augmented Generation, RAG)型AI工具,并通过受控实验对比两种使用模式:一种是“朴素”AI使用条件(annotators仅接受首个生成结果),另一种是“交互式”AI使用条件(annotators可基于判断进行后续追问和修正)。研究发现,相较于纯人工基准,交互式AI辅助显著提升任务执行速度(最高达10倍加速)和准确性,且在全量任务规模下可节省高达268小时工作量,同时强调了用户对AI工具的操作熟练度同样是影响效率与准确性的关键因素。

链接: https://arxiv.org/abs/2507.21360
作者: Nicholas Botti(Federal Reserve Board),Flora Haberkorn(Federal Reserve Board),Charlotte Hoopes(Federal Reserve Board),Shaun Khan(Federal Reserve Board)
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:We utilize a within-subjects design with randomized task assignments to understand the effectiveness of using an AI retrieval augmented generation (RAG) tool to assist analysts with an information extraction and data annotation task. We replicate an existing, challenging real-world annotation task with complex multi-part criteria on a set of thousands of pages of public disclosure documents from global systemically important banks (GSIBs) with heterogeneous and incomplete information content. We test two treatment conditions. First, a “naive” AI use condition in which annotators use only the tool and must accept the first answer they are given. And second, an “interactive” AI treatment condition where annotators use the tool interactively, and use their judgement to follow-up with additional information if necessary. Compared to the human-only baseline, the use of the AI tool accelerated task execution by up to a factor of 10 and enhanced task accuracy, particularly in the interactive condition. We find that when extrapolated to the full task, these methods could save up to 268 hours compared to the human-only approach. Additionally, our findings suggest that annotator skill, not just with the subject matter domain, but also with AI tools, is a factor in both the accuracy and speed of task performance.
zh

[AI-74] Games Agents Play: Towards Transactional Analysis in LLM -based Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在模拟社会互动时普遍缺乏人类行为底层认知复杂性的问题。解决方案的关键在于引入交易分析(Transactional Analysis, TA)认知工具包(Trans-ACT),将父母(Parent)、成人(Adult)和儿童(Child)三种自我状态嵌入智能体的认知架构中,使每个自我状态能够检索情境相关的记忆并据此影响对新情境的响应,最终决策由智能体的潜在人生脚本(life script)决定,从而实现更真实、情境敏感的社会互动行为。

链接: https://arxiv.org/abs/2507.21354
作者: Monika Zamojska,Jarosław A. Chudziak
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci 2025), this https URL

点击查看摘要

Abstract:Multi-Agent Systems (MAS) are increasingly used to simulate social interactions, but most of the frameworks miss the underlying cognitive complexity of human behavior. In this paper, we introduce Trans-ACT (Transactional Analysis Cognitive Toolkit), an approach embedding Transactional Analysis (TA) principles into MAS to generate agents with realistic psychological dynamics. Trans-ACT integrates the Parent, Adult, and Child ego states into an agent’s cognitive architecture. Each ego state retrieves context-specific memories and uses them to shape response to new situations. The final answer is chosen according to the underlying life script of the agent. Our experimental simulation, which reproduces the Stupid game scenario, demonstrates that agents grounded in cognitive and TA principles produce deeper and context-aware interactions. Looking ahead, our research opens a new way for a variety of applications, including conflict resolution, educational support, and advanced social psychology studies.
zh

[AI-75] Semantic Numeration Systems as Dynamical Systems

【速读】:该论文旨在解决如何从理论上刻画和建模语义计数系统中抽象实体之间的动态关系问题,特别是针对由基数语义算子(cardinal semantic operators)构成的基数抽象对象(Cardinal Abstract Object, CAO)的动力学特性。其解决方案的关键在于将CAO视为具有非线性控制的线性离散动力系统,并在理想可观测性假设下推导出稳态与非稳态情况下的状态方程;同时强调配置矩阵(configuration matrix)的核心作用,该矩阵整合了基数语义算子的类型、参数及其连接拓扑结构信息,从而为系统行为分析提供数学基础。

链接: https://arxiv.org/abs/2507.21295
作者: Alexander Yu. Chunikhin
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:The foundational concepts of semantic numeration systems theory are briefly outlined. The action of cardinal semantic operators unfolds over a set of cardinal abstract entities belonging to the cardinal semantic multeity. The cardinal abstract object (CAO) formed by them in a certain connectivity topology is proposed to be considered as a linear discrete dynamical system with nonlinear control. Under the assumption of ideal observability, the CAO state equations are provided for both stationary and non-stationary cases. The fundamental role of the configuration matrix, which combines information about the types of cardinal semantic operators in the CAO, their parameters and topology of connectivity, is demonstrated.
zh

[AI-76] Learning Simulatable Models of Cloth with Spatially-varying Constitutive Properties

【速读】:该论文旨在解决复杂服装材料在物理仿真中因缝制、压胶、染色等工艺导致的空间异质性建模难题,以及传统有限元方法(Finite Element Method, FEM)中存在的计算效率低和“膜锁”(membrane locking)数值伪影问题。其解决方案的关键在于提出一种名为Mass-Spring Net的通用框架,通过从运动观测数据中直接学习质量-弹簧网络的未知材料参数,结合创新的力与冲量损失函数(force-and-impulse loss function),实现对空间变化材料属性的高效且准确建模,同时有效规避FEM中的膜锁现象,显著提升训练速度、重建精度及对新动态场景的泛化能力。

链接: https://arxiv.org/abs/2507.21288
作者: Guanxiong Chen,Shashwat Suri,Yuhao Wu,Etienne Voulga,David I.W. Levin,Dinesh Pai
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Materials used in real clothing exhibit remarkable complexity and spatial variation due to common processes such as stitching, hemming, dyeing, printing, padding, and bonding. Simulating these materials, for instance using finite element methods, is often computationally demanding and slow. Worse, such methods can suffer from numerical artifacts called ``membrane locking’’ that makes cloth appear artificially stiff. Here we propose a general framework, called Mass-Spring Net, for learning a simple yet efficient surrogate model that captures the effects of these complex materials using only motion observations. The cloth is discretized into a mass-spring network with unknown material parameters that are learned directly from the motion data, using a novel force-and-impulse loss function. Our approach demonstrates the ability to accurately model spatially varying material properties from a variety of data sources, and immunity to membrane locking which plagues FEM-based simulations. Compared to graph-based networks and neural ODE-based architectures, our method achieves significantly faster training times, higher reconstruction accuracy, and improved generalization to novel dynamic scenarios.
zh

[AI-77] Structured Relevance Assessment for Robust Retrieval-Augmented Language Models

【速读】:该论文旨在解决检索增强型语言模型(Retrieval-Augmented Language Models, RALMs)在事实准确性方面的挑战,特别是文档相关性评估不准确、知识整合失衡以及无法有效处理无答案查询等问题。其解决方案的关键在于提出一种结构化的相关性评估框架,通过多维评分机制同时考量语义匹配度与信息源可靠性,结合基于嵌入的文档相关性打分和混合质量文档构建的合成训练数据,实现更鲁棒的文档筛选;此外,引入针对特定领域的小样本基准测试、知识融合机制及“未知”响应协议,从而显著降低幻觉率并提升推理过程的可解释性,推动RALM在动态环境中实现更高可靠性的问答能力。

链接: https://arxiv.org/abs/2507.21287
作者: Aryan Raj,Astitva Veer Garg,Anitha D
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: International Conference on ICT for Sustainable Development (ICT4SD)

点击查看摘要

Abstract:Retrieval-Augmented Language Models (RALMs) face significant challenges in reducing factual errors, particularly in document relevance evaluation and knowledge integration. We introduce a framework for structured relevance assessment that enhances RALM robustness through improved document evaluation, balanced intrinsic and external knowledge integration, and effective handling of unanswerable queries. Our approach employs a multi-dimensional scoring system that considers both semantic matching and source reliability, utilizing embedding-based relevance scoring and synthetic training data with mixed-quality documents. We implement specialized benchmarking on niche topics, a knowledge integration mechanism, and an “unknown” response protocol for queries with insufficient knowledge coverage. Preliminary evaluations demonstrate significant reductions in hallucination rates and improved transparency in reasoning processes. Our framework advances the development of more reliable question-answering systems capable of operating effectively in dynamic environments with variable data quality. While challenges persist in accurately distinguishing credible information and balancing system latency with thoroughness, this work represents a meaningful step toward enhancing RALM reliability.
zh

[AI-78] Curiosity by Design: An LLM -based Coding Assistant Asking Clarification Questions

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为编程辅助工具时,因开发者提示(prompt)存在歧义或信息不足而导致代码生成错误的问题。当前模型在缺乏充分提示工程或外部上下文的情况下难以准确推断用户意图。其解决方案的关键在于构建一个模拟人类代码审查流程的端到端系统:首先通过一个查询分类器识别不明确的编程相关请求,随后利用微调后的LLM生成澄清问题,以主动获取用户意图。实验表明,该方法生成的澄清问题显著优于标准零样本提示,并且用户研究表明,该系统能产出更准确、更有帮助的代码响应。

链接: https://arxiv.org/abs/2507.21285
作者: Harsh Darji,Thibaud Lutellier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as coding assistants. However, the ambiguity of the developer’s prompt often leads to incorrect code generation, as current models struggle to infer user intent without extensive prompt engineering or external context. This work aims to build an LLM-based coding assistant that mimics the human code review process by asking clarification questions when faced with ambiguous or under-specified queries. Our end-to-end system includes (1) a query classifier trained to detect unclear programming-related queries and (2) a fine-tuned LLM that generates clarification questions. Our evaluation shows that the fine-tuned LLM outperforms standard zero-shot prompting in generating useful clarification questions. Furthermore, our user study indicates that users find the clarification questions generated by our model to outperform the baseline, demonstrating that our coding assistant produces more accurate and helpful code responses compared to baseline coding assistants. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21285 [cs.AI] (or arXiv:2507.21285v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.21285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-79] Adaptive Multimodal Protein Plug-and-Play with Diffusion-Based Priors

【速读】:该论文旨在解决多源异构实验数据(如具有不同噪声水平的结构生物学数据)在引导预训练蛋白质扩散模型进行结构重建时面临的挑战,尤其是传统方法对实验噪声水平先验知识依赖性强且需手动调参的问题。其解决方案的关键在于提出Adam-PnP框架,该框架通过嵌入自适应噪声估计机制和动态模态加权策略,在扩散过程中实现无需人工干预的多源梯度融合,从而显著提升复杂重构任务的准确性。

链接: https://arxiv.org/abs/2507.21260
作者: Amartya Banerjee,Xingyu Xu,Caroline Moosmüller,Harlin Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Code: this https URL

点击查看摘要

Abstract:In an inverse problem, the goal is to recover an unknown parameter (e.g., an image) that has typically undergone some lossy or noisy transformation during measurement. Recently, deep generative models, particularly diffusion models, have emerged as powerful priors for protein structure generation. However, integrating noisy experimental data from multiple sources to guide these models remains a significant challenge. Existing methods often require precise knowledge of experimental noise levels and manually tuned weights for each data modality. In this work, we introduce Adam-PnP, a Plug-and-Play framework that guides a pre-trained protein diffusion model using gradients from multiple, heterogeneous experimental sources. Our framework features an adaptive noise estimation scheme and a dynamic modality weighting mechanism integrated into the diffusion process, which reduce the need for manual hyperparameter tuning. Experiments on complex reconstruction tasks demonstrate significantly improved accuracy using Adam-PnP.
zh

[AI-80] Bubbleformer: Forecasting Boiling with Transformers NEURIPS2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在建模沸腾过程(一种本质上混沌且涉及多相流的复杂物理现象)时的关键挑战,即现有神经偏微分方程(Neural PDE)代理模型在推理阶段依赖未来输入(如气泡位置),无法从历史状态中自主学习成核机制,且难以准确模拟流动沸腾中的速度场——这要求对界面与动量之间强烈的耦合关系进行长距离、方向性的归纳偏置。解决方案的核心是提出 Bubbleformer,一个基于 Transformer 的时空模型,通过引入因子化轴向注意力(factorized axial attention)、频率感知缩放(frequency-aware scaling)以及对热物性参数的条件控制,实现了无需仿真数据即可预测稳定且长期的沸腾动力学(包括成核、界面演化和传热),并在多样化的工质、几何结构和工况下展现出良好泛化能力。

链接: https://arxiv.org/abs/2507.21244
作者: Sheikh Md Shakeel Hassan,Xianwei Zou,Akash Dhruv,Vishwanath Ganesan,Aparna Chandramowlishwaran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 39 pages, 13 figures, Submitted to NeurIPS 2025

点击查看摘要

Abstract:Modeling boiling (an inherently chaotic, multiphase process central to energy and thermal systems) remains a significant challenge for neural PDE surrogates. Existing models require future input (e.g., bubble positions) during inference because they fail to learn nucleation from past states, limiting their ability to autonomously forecast boiling dynamics. They also fail to model flow boiling velocity fields, where sharp interface-momentum coupling demands long-range and directional inductive biases. We introduce Bubbleformer, a transformer-based spatiotemporal model that forecasts stable and long-range boiling dynamics including nucleation, interface evolution, and heat transfer without dependence on simulation data during inference. Bubbleformer integrates factorized axial attention, frequency-aware scaling, and conditions on thermophysical parameters to generalize across fluids, geometries, and operating conditions. To evaluate physical fidelity in chaotic systems, we propose interpretable physics-based metrics that evaluate heat-flux consistency, interface geometry, and mass conservation. We also release BubbleML 2.0, a high-fidelity dataset that spans diverse working fluids (cryogens, refrigerants, dielectrics), boiling configurations (pool and flow boiling), flow regimes (bubbly, slug, annular), and boundary conditions. Bubbleformer sets new benchmark results in both prediction and forecasting of two-phase boiling flows.
zh

[AI-81] Agent ic Web: Weaving the Next Web with AI Agents

【速读】:该论文旨在解决如何系统性理解与构建“AI代理网络”(Agentic Web)这一新兴范式的问题,即从以人类为中心的互联网向由自主、目标驱动的AI代理间协作交互主导的下一代网络演进过程中所面临的理论框架缺失与技术挑战。其解决方案的关键在于提出一个包含三个核心维度的结构化框架:智能(Intelligence)、交互(Interaction)和经济(Economics),这三大维度共同支撑AI代理在检索、推荐、规划与协作等方面的能力,并进一步分析了实现可扩展代理系统的架构与基础设施难题,如通信协议、编排策略及新兴的“代理注意力经济”(Agent Attention Economy)等机制,从而为构建开放、安全且具备人机协同能力的下一代智能网络生态提供理论基础与实践路径。

链接: https://arxiv.org/abs/2507.21206
作者: Yingxuan Yang,Mulei Ma,Yuxuan Huang,Huacan Chai,Chenyu Gong,Haoran Geng,Yuanjian Zhou,Ying Wen,Meng Fang,Muhao Chen,Shangding Gu,Ming Jin,Costas Spanos,Yang Yang,Pieter Abbeel,Dawn Song,Weinan Zhang,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human-driven to machine-to-machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including communication protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: this https URL.
zh

[AI-82] Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications

【速读】:该论文旨在解决交互式多模态应用(Interactive Multimodal Applications, IMAs)在无线网络环境中依赖多个大语言模型(Large Language Models, LLMs)所带来的资源冗余与部署复杂性问题,以及单一LLM在适应多样化任务目标和资源受限移动环境下的灵活性与效率不足问题。解决方案的关键在于提出一种基于组合式LLM的新范式,其核心由两部分组成:一是ContextLoRA方法,通过构建任务依赖图并分区神经层可学习参数矩阵,结合分步微调策略(训练、冻结、掩码阶段),使单个LLM能够捕捉任务间的潜在依赖关系以实现跨任务推理;二是ContextGear调度机制,通过优化ContextLoRA的训练流程,在保证性能的同时最小化计算与通信开销,从而提升系统在移动终端上的实用性与效率。

链接: https://arxiv.org/abs/2507.21199
作者: Xinye Cao,Hongcan Guo,Guoshun Nan,Jiaoyang Cui,Haoting Qian,Yihan Lin,Yilin Peng,Diyang Zhang,Yanzhao Hou,Huici Wu,Xiaofeng Tao,Tony Q.S. Quek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC)
备注: Accepted by IEEE JSAC. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users’ personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (LLMs) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each LLM trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple LLMs for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks. The two primary challenges include 1) guiding a single LLM to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the LLM in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate LLM composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the LLM to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and communication costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.
zh

[AI-83] Uncovering Gradient Inversion Risks in Practical Language Model Training CCS2024

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在语言模型训练场景下,梯度反转攻击(Gradient Inversion Attack)因文本数据离散性导致的隐私泄露效果有限的问题。现有方法在实际训练设置中表现不佳,低估了FL对语言模型的隐私风险。论文提出的解决方案是设计一种领域特定的梯度反转攻击方法Grab(Gradient Inversion with Hybrid Optimization),其关键在于引入两种交替优化过程:一是层间dropout掩码的联合优化以提升token恢复精度;二是针对离散token序列的有效优化策略。该方法显著提升了私有训练数据的恢复率,在基准和实际设置下分别比现有基于辅助模型的离散优化方法提高28.9%和48.5%,从而更全面地揭示了FL在语言模型中的隐私威胁。

链接: https://arxiv.org/abs/2507.21198
作者: Xinguo Feng,Zhongkui Ma,Zihan Wang,Eu Joe Chegne,Mengyao Ma,Alsharif Abuadbba,Guangdong Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 Pages, 5 figures, 10 tables. Accepted by ACM CCS 2024

点击查看摘要

Abstract:The gradient inversion attack has been demonstrated as a significant privacy threat to federated learning (FL), particularly in continuous domains such as vision models. In contrast, it is often considered less effective or highly dependent on impractical training settings when applied to language models, due to the challenges posed by the discrete nature of tokens in text data. As a result, its potential privacy threats remain largely underestimated, despite FL being an emerging training method for language models. In this work, we propose a domain-specific gradient inversion attack named Grab (gradient inversion with hybrid optimization). Grab features two alternating optimization processes to address the challenges caused by practical training settings, including a simultaneous optimization on dropout masks between layers for improved token recovery and a discrete optimization for effective token sequencing. Grab can recover a significant portion (up to 92.9% recovery rate) of the private training data, outperforming the attack strategy of utilizing discrete optimization with an auxiliary model by notable improvements of up to 28.9% recovery rate in benchmark settings and 48.5% recovery rate in practical settings. Grab provides a valuable step forward in understanding this privacy threat in the emerging FL training mode of language models.
zh

[AI-84] EdgeAgent X-DT: Integrating Digital Twins and Generative AI for Resilient Edge Intelligence in Tactical Networks

【速读】:该论文旨在解决军事网络中边缘智能(Edge Intelligence)在复杂对抗环境下的性能瓶颈问题,尤其是面对干扰、节点失效和高负载等挑战时,传统边缘智能系统难以保持稳定性和鲁棒性。解决方案的关键在于提出EdgeAgentX-DT框架,其核心创新是融合数字孪生(Digital Twin)仿真与生成式AI驱动的场景训练机制:通过构建与真实边缘设备同步的虚拟数字孪生体,提供安全且逼真的训练与验证环境;同时利用扩散模型(diffusion models)和Transformer等生成式AI方法,自动生成多样化且具有对抗性的训练场景,从而显著提升智能代理在极端条件下的适应能力与决策效率。实验表明,该方案在学习收敛速度、吞吐量、延迟和抗干扰能力等方面均优于基线方法。

链接: https://arxiv.org/abs/2507.21196
作者: Abir Ray
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:We introduce EdgeAgentX-DT, an advanced extension of the EdgeAgentX framework that integrates digital twin simulations and generative AI-driven scenario training to significantly enhance edge intelligence in military networks. EdgeAgentX-DT utilizes network digital twins, virtual replicas synchronized with real-world edge devices, to provide a secure, realistic environment for training and validation. Leveraging generative AI methods, such as diffusion models and transformers, the system creates diverse and adversarial scenarios for robust simulation-based agent training. Our multi-layer architecture includes: (1) on-device edge intelligence; (2) digital twin synchronization; and (3) generative scenario training. Experimental simulations demonstrate notable improvements over EdgeAgentX, including faster learning convergence, higher network throughput, reduced latency, and improved resilience against jamming and node failures. A case study involving a complex tactical scenario with simultaneous jamming attacks, agent failures, and increased network loads illustrates how EdgeAgentX-DT sustains operational performance, whereas baseline methods fail. These results highlight the potential of digital-twin-enabled generative training to strengthen edge AI deployments in contested environments.
zh

[AI-85] MaXsive: High-Capacity and Robust Training-Free Generative Image Watermarking in Diffusion Models

【速读】:该论文旨在解决训练-free扩散模型水印技术在面对旋转、缩放和平移(RST)攻击时鲁棒性不足,以及现有方法因采用复杂图案导致水印容量降低、引发身份(ID)冲突的问题。其解决方案的关键在于:充分利用初始噪声进行水印嵌入,并创新性地引入X形模板(X-shape template)替代传统重复环状模式,以实现对RST失真的有效恢复,从而在不牺牲水印容量的前提下显著提升鲁棒性,降低ID冲突的可能性。

链接: https://arxiv.org/abs/2507.21195
作者: Po-Yuan Mao,Cheng-Chang Tsai,Chun-Shien Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The great success of the diffusion model in image synthesis led to the release of gigantic commercial models, raising the issue of copyright protection and inappropriate content generation. Training-free diffusion watermarking provides a low-cost solution for these issues. However, the prior works remain vulnerable to rotation, scaling, and translation (RST) attacks. Although some methods employ meticulously designed patterns to mitigate this issue, they often reduce watermark capacity, which can result in identity (ID) collusion. To address these problems, we propose MaXsive, a training-free diffusion model generative watermarking technique that has high capacity and robustness. MaXsive best utilizes the initial noise to watermark the diffusion model. Moreover, instead of using a meticulously repetitive ring pattern, we propose injecting the X-shape template to recover the RST distortions. This design significantly increases robustness without losing any capacity, making ID collusion less likely to happen. The effectiveness of MaXsive has been verified on two well-known watermarking benchmarks under the scenarios of verification and identification.
zh

[AI-86] Embeddings to Diagnosis: Latent Frag ility under Agent ic Perturbations in Clinical LLM s

【速读】:该论文旨在解决临床大语言模型(Clinical Large Language Models, LLMs)在面对微小但具有临床意义的输入扰动(如症状掩蔽或发现否定)时表现出的推理脆弱性问题,此类问题常导致诊断不稳定,而传统自然语言处理(Natural Language Processing, NLP)评估指标对此类潜在表征空间变化不敏感。解决方案的关键在于提出一种几何感知的评估框架LAPD(Latent Agentic Perturbation Diagnostics),并引入模型无关的诊断信号——潜空间诊断翻转率(Latent Diagnosis Flip Rate, LDFR),用于量化嵌入在主成分分析(PCA)降维后的潜在空间中跨越决策边界时的表征不稳定性。通过结构化对抗编辑(包括掩蔽、否定、同义替换和数值变化)对临床文本进行扰动,并在真实临床数据集DiReCT(MIMIC-IV)上验证,结果表明即使表面输入变化极小,模型仍存在显著的潜空间脆弱性,凸显了在安全关键型医疗AI中采用几何感知审计的重要性。

链接: https://arxiv.org/abs/2507.21188
作者: Raj Krishnan Vijayaraj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs for clinical decision support often fail under small but clinically meaningful input shifts such as masking a symptom or negating a finding, despite high performance on static benchmarks. These reasoning failures frequently go undetected by standard NLP metrics, which are insensitive to latent representation shifts that drive diagnosis instability. We propose a geometry-aware evaluation framework, LAPD (Latent Agentic Perturbation Diagnostics), which systematically probes the latent robustness of clinical LLMs under structured adversarial edits. Within this framework, we introduce Latent Diagnosis Flip Rate (LDFR), a model-agnostic diagnostic signal that captures representational instability when embeddings cross decision boundaries in PCA-reduced latent space. Clinical notes are generated using a structured prompting pipeline grounded in diagnostic reasoning, then perturbed along four axes: masking, negation, synonym replacement, and numeric variation to simulate common ambiguities and omissions. We compute LDFR across both foundation and clinical LLMs, finding that latent fragility emerges even under minimal surface-level changes. Finally, we validate our findings on 90 real clinical notes from the DiReCT benchmark (MIMIC-IV), confirming the generalizability of LDFR beyond synthetic settings. Our results reveal a persistent gap between surface robustness and semantic stability, underscoring the importance of geometry-aware auditing in safety-critical clinical AI.
zh

[AI-87] SDD: Self-Degraded Defense against Malicious Fine-tuning ACL2025

【速读】:该论文旨在解决开源大语言模型(Large Language Models, LLMs)在采用安全对齐(safety alignment)方法后仍可能被恶意微调(malicious fine-tuning)绕过的问题。研究表明,攻击者通过在有害数据上微调模型,可有效规避原有安全机制,使模型生成有害内容。论文的关键解决方案是提出自降级防御(Self-Degraded Defense, SDD)框架:该框架通过训练模型在面对有害提示时输出高质量但无关的回答,使得在遭受恶意微调时,模型的整体能力显著下降,从而无法执行有害指令,实现对潜在攻击的有效防御。

链接: https://arxiv.org/abs/2507.21182
作者: Zixuan Chen,Weikai Lu,Xin Lin,Ziqian Zeng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL2025

点击查看摘要

Abstract:Open-source Large Language Models (LLMs) often employ safety alignment methods to resist harmful instructions. However, recent research shows that maliciously fine-tuning these LLMs on harmful data can easily bypass these safeguards. To counter this, we theoretically uncover why malicious fine-tuning succeeds and identify potential defense strategies. Building on the theoretical analysis, we introduce the Self-Degraded Defense (SDD) framework. SDD encourages LLMs to produce high-quality but irrelevant responses to harmful prompts. When attackers attempt malicious fine-tuning, the general capability of the LLM aligned by SDD will significantly decrease, rendering it incapable of following harmful instructions. Our experimental results confirm SDD’s effectiveness against such attacks.
zh

[AI-88] LLM -Adapted Interpretation Framework for Machine Learning Models

【速读】:该论文旨在解决高绩效机器学习模型(如XGBoost)在临床应用中因缺乏可解释性而难以被采纳的“黑箱”问题,特别是在肌少症风险评估场景下。解决方案的关键在于提出一种名为LLM-Adapted Interpretation Framework (LAI-ML) 的知识蒸馏架构:首先通过HAGA和CACS技术将XGBoost模型的特征重要性转化为概率格式,再利用大型语言模型(Large Language Model, LLM)在强化学习循环和基于案例的检索引导下生成数据忠实的诊断叙事,从而实现从预测准确性到临床可理解性的有效转化。

链接: https://arxiv.org/abs/2507.21179
作者: Yuqi Jin,Zihan Hu,Weiteng Zhang,Weihao Xie,Jianwei Shuai,Xian Shen,Zhen Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Background Aims: High-performance machine learning models like XGBoost are often “black boxes,” limiting their clinical adoption due to a lack of interpretability. This study aims to bridge the gap between predictive accuracy and narrative transparency for sarcopenia risk assessment. Methods: We propose the LLM-Adapted Interpretation Framework (LAI-ML), a novel knowledge distillation architecture. LAI-ML transforms feature attributions from a trained XGBoost model into a probabilistic format using specialized techniques (HAGA and CACS). A Large Language Model (LLM), guided by a reinforcement learning loop and case-based retrieval, then generates data-faithful diagnostic narratives. Results: The LAI-ML framework achieved 83% prediction accuracy, significantly outperforming the baseline XGBoost model, 13% higher. Notably, the LLM not only replicated the teacher model’s logic but also corrected its predictions in 21.7% of discordant cases, demonstrating enhanced reasoning. Conclusion: LAI-ML effectively translates opaque model predictions into trustworthy and interpretable clinical insights, offering a deployable solution to the “black-box” problem in medical AI.
zh

[AI-89] Me Youre Biased Without Telling Me Youre Biased – Toward Revealing Implicit Biases in Medical LLM s

【速读】:该论文旨在解决医疗领域大语言模型(Large Language Models, LLMs)中存在的偏见与不公平模式问题,这些问题可能影响其在临床决策支持系统中的可靠性和公平性。解决方案的关键在于提出一种结合知识图谱(Knowledge Graphs, KGs)与辅助LLM的新型框架,通过引入对抗扰动技术识别细微偏见,并采用定制化的多跳知识图谱表征方法,实现对任意LLM的系统性偏见评估。该框架在三个数据集、六种LLM和五类偏见类型上的实验证明了其相较于基线方法在揭示复杂偏见模式方面的显著优势与可扩展性。

链接: https://arxiv.org/abs/2507.21176
作者: Farzana Islam Adiba,Rahmatollah Beheshti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) that are used in medical applications are known to show biased and unfair patterns. Prior to adopting these in clinical decision-making applications, it is crucial to identify these bias patterns to enable effective mitigation of their impact. In this study, we present a novel framework combining knowledge graphs (KGs) with auxiliary LLMs to systematically reveal complex bias patterns in medical LLMs. Specifically, the proposed approach integrates adversarial perturbation techniques to identify subtle bias patterns. The approach adopts a customized multi-hop characterization of KGs to enhance the systematic evaluation of arbitrary LLMs. Through a series of comprehensive experiments (on three datasets, six LLMs, and five bias types), we show that our proposed framework has noticeably greater ability and scalability to reveal complex biased patterns of LLMs compared to other baselines.
zh

[AI-90] A ChatGPT -based approach for questions generation in higher education

【速读】:该论文旨在解决高等教育机构中教师在生成测验题目和评估学习者方面耗费时间与精力过多的问题。其解决方案的关键在于利用基于大语言模型的聊天机器人ChatGPT,通过探索交互式提示(interactive prompting)模式,设计出最优的AI驱动题库创建流程,从而提升测验内容生成的效率与质量。

链接: https://arxiv.org/abs/2507.21174
作者: Sinh Trong Vu,Huong Thu Truong,Oanh Tien Do,Tu Anh Le,Tai Tan Mai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have been widely applied in many aspects of real life, bringing significant efficiency to businesses and offering distinctive user experiences. In this paper, we focus on exploring the application of ChatGPT, a chatbot based on a large language model, to support higher educator in generating quiz questions and assessing learners. Specifically, we explore interactive prompting patterns to design an optimal AI-powered question bank creation process. The generated questions are evaluated through a “Blind test” survey sent to various stakeholders including lecturers and learners. Initial results at the Banking Academy of Vietnam are relatively promising, suggesting a potential direction to streamline the time and effort involved in assessing learners at higher education institutes.
zh

[AI-91] Ontological Foundations of State Sovereignty

【速读】:该论文旨在解决国际关系领域中关于国家主权(state sovereignty)认定的模糊性与矛盾性问题,即如何在缺乏明确标准的情况下判断哪些实体真正具备主权属性。其解决方案的关键在于揭示一种处理此类模糊或矛盾数据的策略,为后续基于本体论(ontology)方法在国际事务中的应用奠定基础。

链接: https://arxiv.org/abs/2507.21172
作者: John Beverley,Danielle Limbaugh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages. 0 figures. Conference: Semantic Technology for Intelligence, Defense, and Security (STIDS 2024)

点击查看摘要

Abstract:This short paper is a primer on the nature of state sovereignty and the importance of claims about it. It also aims to reveal (merely reveal) a strategy for working with vague or contradictory data about which states, in fact, are sovereign. These goals together are intended to set the stage for applied work in ontology about international affairs.
zh

[AI-92] An ontological analysis of risk in Basic Formal Ontology

【速读】:该论文旨在解决风险(risk)在本体论层面的分类与定义问题,明确其本质属性及构成条件。解决方案的关键在于采用基础形式本体(Basic Formal Ontology, BFO)框架,将“风险”视为BFO中“角色”(Role)的一个子类,而非传统上常被归类为“倾向性”(Disposition)的类别;这一建模选择通过具体实例分析对象、过程(物理与心理)及其相互关系,从而提炼出成为风险的充分条件,并为未来工作提出合理的必要条件假设。

链接: https://arxiv.org/abs/2507.21171
作者: Federico Donato,Adrien Barton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages. 2 figures. Conference: Semantic Technology for Intelligence, Defense, and Security (STIDS 2024)

点击查看摘要

Abstract:The paper explores the nature of risk, providing a characterization using the categories of the Basic Formal Ontology (BFO). It argues that the category Risk is a subclass of BFO:Role, contrasting it with a similar view classifying Risk as a subclass of BFO:Disposition. This modeling choice is applied on one example of risk, which represents objects, processes (both physical and mental) and their interrelations, then generalizing from the instances in the example to obtain an overall analysis of risk, making explicit what are the sufficient conditions for being a risk. Plausible necessary conditions are also mentioned for future work. Index Terms: ontology, risk, BFO, role, disposition
zh

[AI-93] rustworthy AI: UK Air Traffic Control Revisited

【速读】:该论文试图解决的问题是:当前关于可信人工智能(Trustworthy AI)的研究普遍忽视了人在日常工作中如何应对工具带来的信任问题,尤其是在组织环境中采用人工智能时所面临的社技术挑战(socio-technical challenges)。解决方案的关键在于通过一项正在进行的民族志研究,深入考察现有工具在空中交通管制(Air Traffic Control, ATC)工作中的实际使用情况,从而揭示出在安全关键应用领域中实现可信AI所需的核心要求。

链接: https://arxiv.org/abs/2507.21169
作者: Rob Procter,Mark Rouncefield
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:Exploring the socio-technical challenges confronting the adoption of AI in organisational settings is something that has so far been largely absent from the related literature. In particular, research into requirements for trustworthy AI typically overlooks how people deal with the problems of trust in the tools that they use as part of their everyday work practices. This article presents some findings from an ongoing ethnographic study of how current tools are used in air traffic control work and what it reveals about requirements for trustworthy AI in air traffic control and other safety-critical application domains.
zh

[AI-94] AGORA: Incentivizing Group Emergence Capability in LLM s via Group Distillation

【速读】:该论文旨在解决当前复杂推理能力受限于静态训练数据集的问题,即单纯增加模型参数已难以持续提升推理性能。其解决方案的关键在于提出“结构化交互”(structured interaction)作为新的扩展维度,构建了一个自进化框架AGORA,通过协作集成(collaborative ensemble)实现群体涌现能力(group emergent ability)——这种能力是单个孤立模型无法具备的,从而在高难度数学基准测试中使推理性能较最先进的单体系统提升最高达4.45个百分点。这一成果验证了交互机制作为智能可扩展驱动力的有效性,并强调协同生态系统设计是能力涌现的重要前沿方向。

链接: https://arxiv.org/abs/2507.21166
作者: Ren Zhuang,Ben Wang,Shuifa Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Progress in complex reasoning is constrained by the static nature of the current training datasets. We propose structured interaction as a new scaling axis, moving beyond the prevailing paradigm of increasing model parameters. Our self-evolving framework, AGORA, enables a collaborative ensemble to achieve reasoning performance exceeding state-of-the-art monolithic systems by up to 4.45 percentage points on challenging mathematical benchmarks. This gain stems from group emergent ability-the synthesis of collective capabilities unattainable by isolated models, validating interaction as a scalable driver of intelligence. Our results position the engineering of collaborative ecosystems as a vital frontier for capability emergence.
zh

[AI-95] OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中现有方法存在的两大问题:一是基于重建的方法往往过度重构异常样本,导致检测性能下降;二是解耦表示学习与密度估计的方法常受限于次优特征空间。为克服这些局限,作者提出一种新型方法,其核心在于通过定制化损失函数将表征学习与解析可解的一类支持向量机(One-Class Support Vector Machine, OCSVM)紧密耦合,直接使潜在特征对齐于OCSVM决策边界。这一设计避免了代理目标、核函数限制或近似引入的表达能力不足和鲁棒性下降问题,从而在MNIST-C基准和脑部MRI微小病灶检测任务上均展现出优越性能与域偏移鲁棒性,尤其适用于临床场景下对小尺寸、非高信号异常的精准识别。

链接: https://arxiv.org/abs/2507.21164
作者: Nicolas Pinon(MYRIAD),Carole Lartizien(MYRIAD)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that tightly couples representation learning with an analytically solvable one-class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a new benchmark based on MNIST-C, and a challenging brain MRI subtle lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and scanner/age variations in MRI. Results demonstrate performance and robustness of our proposed mode,highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at this https URL
zh

[AI-96] Generating Adversarial Point Clouds Using Diffusion Model

【速读】:该论文旨在解决3D点云分类模型在黑盒攻击场景下攻击成功率低且难以保持感知不可见性的问题。当前多数对抗攻击方法依赖白盒信息,虽效果显著但实际应用受限;而黑盒攻击因缺乏目标模型内部结构与参数信息,常导致攻击性能不佳。论文提出一种基于扩散模型(diffusion model)的新型黑盒对抗样本生成方法,其关键在于利用点云压缩特征作为先验知识,引导扩散模型的逆向过程,在不依赖目标模型内部信息的前提下,将对抗点注入干净样本中,并通过逆向扩散过程将其他类别的分布转化为对抗点,从而提升攻击成功率与隐蔽性。

链接: https://arxiv.org/abs/2507.21163
作者: Ruiyang Zhao,Bingbing Zhu,Chuxuan Tong,Xiaoyi Zhou,Xi Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial attack methods for 3D point cloud classification reveal the vulnerabilities of point cloud recognition models. This vulnerability could lead to safety risks in critical applications that use deep learning models, such as autonomous vehicles. To uncover the deficiencies of these models, researchers can evaluate their security through adversarial attacks. However, most existing adversarial attack methods are based on white-box attacks. While these methods achieve high attack success rates and imperceptibility, their applicability in real-world scenarios is limited. Black-box attacks, which are more meaningful in real-world scenarios, often yield poor results. This paper proposes a novel black-box adversarial example generation method that utilizes a diffusion model to improve the attack success rate and imperceptibility in the black-box setting, without relying on the internal information of the point cloud classification model to generate adversarial samples. We use a 3D diffusion model to use the compressed features of the point cloud as prior knowledge to guide the reverse diffusion process to add adversarial points to clean examples. Subsequently, its reverse process is employed to transform the distribution of other categories into adversarial points, which are then added to the point cloud.
zh

[AI-97] Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems

【速读】:该论文旨在解决主动配电网络(Active Distribution Networks, ADNs)中因分布式能源资源(Distributed Energy Resources, DERs)广泛接入,导致大量新出现的运营商(如配电系统聚合商、虚拟电厂管理者和终端产消者)缺乏电力系统运行、建模、优化及编程等专业技能的问题。这些问题使得依赖人工专家进行调度决策既昂贵又低效。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的自动化建模与优化方法,构建由信息提取器、问题表述器和代码程序员组成的多LLM协同架构,并针对每个代理设计定制化精炼技术,从而显著提升生成内容的准确性与可靠性。该方案通过用户友好的自然语言接口,使ADN运营商能够以简单查询方式获得调度策略,有效降低技术门槛并提高调度效率。

链接: https://arxiv.org/abs/2507.21162
作者: Xu Yang,Chenhui Lin,Yue Yang,Qi Wang,Haotian Liu,Haizhou Hua,Wenchuan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The increasing penetration of distributed energy resources into active distribution networks (ADNs) has made effective ADN dispatch imperative. However, the numerous newly-integrated ADN operators, such as distribution system aggregators, virtual power plant managers, and end prosumers, often lack specialized expertise in power system operation, modeling, optimization, and programming. This knowledge gap renders reliance on human experts both costly and time-intensive. To address this challenge and enable intelligent, flexible ADN dispatch, this paper proposes a large language model (LLM) powered automated modeling and optimization approach. First, the ADN dispatch problems are decomposed into sequential stages, and a multi-LLM coordination architecture is designed. This framework comprises an Information Extractor, a Problem Formulator, and a Code Programmer, tasked with information retrieval, optimization problem formulation, and code implementation, respectively. Afterwards, tailored refinement techniques are developed for each LLM agent, greatly improving the accuracy and reliability of generated content. The proposed approach features a user-centric interface that enables ADN operators to derive dispatch strategies via simple natural language queries, eliminating technical barriers and increasing efficiency. Comprehensive comparisons and end-to-end demonstrations on various test cases validate the effectiveness of the proposed architecture and methods.
zh

[AI-98] Handling Out-of-Distribution Data: A Survey

【速读】:该论文旨在解决机器学习(Machine Learning, ML)与数据驱动应用中因训练阶段与部署阶段数据分布变化而导致的分布偏移(distribution shift)问题,尤其关注两类主要类型:协变量偏移(covariate shift)和概念/语义偏移(concept/semantic shift)。其解决方案的关键在于提出一种能够同时在多种分布偏移场景下保持性能稳定的模型,并系统性地梳理现有检测、度量与缓解分布偏移的方法,从而为未来研究提供方向。论文强调当前对未见数据(Out-of-Distribution, OOD)的忽视,主张构建更具鲁棒性和泛化能力的模型以应对现实世界中的动态数据环境。

链接: https://arxiv.org/abs/2507.21160
作者: Lakpa Tamang,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, 6 tables. Accepted at IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:In the field of Machine Learning (ML) and data-driven applications, one of the significant challenge is the change in data distribution between the training and deployment stages, commonly known as distribution shift. This paper outlines different mechanisms for handling two main types of distribution shifts: (i) Covariate shift: where the value of features or covariates change between train and test data, and (ii) Concept/Semantic-shift: where model experiences shift in the concept learned during training due to emergence of novel classes in the test phase. We sum up our contributions in three folds. First, we formalize distribution shifts, recite on how the conventional method fails to handle them adequately and urge for a model that can simultaneously perform better in all types of distribution shifts. Second, we discuss why handling distribution shifts is important and provide an extensive review of the methods and techniques that have been developed to detect, measure, and mitigate the effects of these shifts. Third, we discuss the current state of distribution shift handling mechanisms and propose future research directions in this area. Overall, we provide a retrospective synopsis of the literature in the distribution shift, focusing on OOD data that had been overlooked in the existing surveys.
zh

[AI-99] Adaptive Cluster Collaborativeness Boosts LLM s Medical Decision Support Capacity

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗决策支持场景中协作能力不足的问题,具体表现为现有方法缺乏明确的组件选择规则,且依赖预定义的LLM集群,导致部分模型在医学任务中表现不佳,从而削弱了整体协作效果。解决方案的关键在于提出一种自适应的集群协作机制,其核心包含两个相互协同的优化目标:一是通过计算单个LLM内部输出对之间的模糊匹配值来衡量其“自多样性”(self-diversity),并以此无需训练的方式筛选出高自多样性模型作为协作集群成员;二是引入“跨一致性”(cross-consistency)度量机制,以最高自多样性模型为基准,逐步剔除与其他模型一致性最低的LLM,从而在协作传播过程中消除潜在的不一致输出。该方法显著提升了LLMs在专业医学问答任务中的准确率,尤其在妇产科领域相较GPT-4实现了65.47%的准确率提升。

链接: https://arxiv.org/abs/2507.21159
作者: Zhihao Peng,Liuxin Bao,Shengyuan Liu,Yixuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The collaborativeness of large language models (LLMs) has proven effective in natural language processing systems, holding considerable promise for healthcare development. However, it lacks explicit component selection rules, necessitating human intervention or clinical-specific validation. Moreover, existing architectures heavily rely on a predefined LLM cluster, where partial LLMs underperform in medical decision support scenarios, invalidating the collaborativeness of LLMs. To this end, we propose an adaptive cluster collaborativeness methodology involving self-diversity and cross-consistency maximization mechanisms to boost LLMs medical decision support capacity. For the self-diversity, we calculate the fuzzy matching value of pairwise outputs within an LLM as its self-diversity value, subsequently prioritizing LLMs with high self-diversity values as cluster components in a training-free manner. For the cross-consistency, we first measure cross-consistency values between the LLM with the highest self-diversity value and others, and then gradually mask out the LLM having the lowest cross-consistency value to eliminate the potential inconsistent output during the collaborative propagation. Extensive experiments on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrate the effectiveness of our method across physician-oriented specialties. For example, on NEJMQA, our method achieves the accuracy rate up to the publicly official passing score across all disciplines, especially achieving ACC of 65.47% compared to the 56.12% achieved by GPT-4 on the Obstetrics and Gynecology discipline.
zh

[AI-100] Adaptive XAI in High Stakes Environments: Modeling Swift Trust with Multimodal Feedback in Human AI Teams ECAI2025

【速读】:该论文旨在解决高压力、时间敏感场景下(如应急响应)人类与AI协作中“快速信任”(swift trust)建立困难的问题,其核心挑战在于传统可解释人工智能(XAI)方法提供的是静态统一的解释,并依赖显式反馈机制,这在高压环境中难以实现。解决方案的关键在于提出一种非侵入式的自适应XAI框架(Adaptive Explainability Trust Framework, AXTF),通过实时捕捉用户的生理和行为信号(如脑电图EEG、心电图ECG和眼动追踪)来推断认知负荷、情绪与压力状态,并基于多目标个性化信任估计模型动态调整解释内容,从而实现对用户状态的响应式支持,增强人机协同中的快速信任。

链接: https://arxiv.org/abs/2507.21158
作者: Nishani Fernando,Bahareh Nakisa,Adnan Ahmad,Mohammad Naim Rastgoo
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 1 figure, Accepted to MAI-XAI@ECAI2025

点击查看摘要

Abstract:Effective human-AI teaming heavily depends on swift trust, particularly in high-stakes scenarios such as emergency response, where timely and accurate decision-making is critical. In these time-sensitive and cognitively demanding settings, adaptive explainability is essential for fostering trust between human operators and AI systems. However, existing explainable AI (XAI) approaches typically offer uniform explanations and rely heavily on explicit feedback mechanisms, which are often impractical in such high-pressure scenarios. To address this gap, we propose a conceptual framework for adaptive XAI that operates non-intrusively by responding to users’ real-time cognitive and emotional states through implicit feedback, thereby enhancing swift trust in high-stakes environments. The proposed adaptive explainability trust framework (AXTF) leverages physiological and behavioral signals, such as EEG, ECG, and eye tracking, to infer user states and support explanation adaptation. At its core is a multi-objective, personalized trust estimation model that maps workload, stress, and emotion to dynamic trust estimates. These estimates guide the modulation of explanation features enabling responsive and personalized support that promotes swift trust in human-AI collaboration. This conceptual framework establishes a foundation for developing adaptive, non-intrusive XAI systems tailored to the rigorous demands of high-pressure, time-sensitive environments.
zh

[AI-101] Deep Reinforcement Learning for Real-Time Green Energy Integration in Data Centers

【速读】:该论文旨在解决电子商务数据中心在能源管理中面临的高成本、低能效及碳排放问题,尤其在可再生能源波动性大、储能与电网协同复杂的情境下,传统方法难以实现多目标优化。其解决方案的关键在于引入深度强化学习(Deep Reinforcement Learning, DRL)算法,通过动态调度可再生能源、储能系统和电网供电,在实时环境中自适应调整策略以平衡能效、成本与SLA(Service Level Agreement)保障。实验表明,DRL方案相较传统强化学习和启发式方法,在降低38%能源成本、提升82%能源效率、减少45%碳排放等方面均显著优于对比方法,且SLA违规率控制在1.5%以下,体现了其在复杂多目标优化中的优越性。

链接: https://arxiv.org/abs/2507.21153
作者: Abderaouf Bahi,Amel Ourici
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper explores the implementation of a Deep Reinforcement Learning (DRL)-optimized energy management system for e-commerce data centers, aimed at enhancing energy efficiency, cost-effectiveness, and environmental sustainability. The proposed system leverages DRL algorithms to dynamically manage the integration of renewable energy sources, energy storage, and grid power, adapting to fluctuating energy availability in real time. The study demonstrates that the DRL-optimized system achieves a 38% reduction in energy costs, significantly outperforming traditional Reinforcement Learning (RL) methods (28%) and heuristic approaches (22%). Additionally, it maintains a low SLA violation rate of 1.5%, compared to 3.0% for RL and 4.8% for heuristic methods. The DRL-optimized approach also results in an 82% improvement in energy efficiency, surpassing other methods, and a 45% reduction in carbon emissions, making it the most environmentally friendly solution. The system’s cumulative reward of 950 reflects its superior performance in balancing multiple objectives. Through rigorous testing and ablation studies, the paper validates the effectiveness of the DRL model’s architecture and parameters, offering a robust solution for energy management in data centers. The findings highlight the potential of DRL in advancing energy optimization strategies and addressing sustainability challenges.
zh

[AI-102] Deep Unfolding for MIMO Signal Detection

【速读】:该论文旨在解决大规模多输入多输出(Massive MIMO)系统中信号检测的高计算复杂度与性能瓶颈问题。传统方法常依赖实值近似处理复数域信号,导致精度损失和模型不匹配。其解决方案的关键在于提出一种基于深度展开(deep unfolding)的神经网络检测器——动态部分收缩阈值法(Dynamic Partially Shrinkage Thresholding, DPST),该方法原生在复数域内利用Wirtinger微积分进行优化,无需实部/虚部分离,从而更贴合信号处理的本质特性;同时通过结构化参数设计实现极低可训练参数量,兼顾高效性、可解释性与低复杂度,实验表明该方案可在较少迭代次数下获得优于现有方法的检测性能。

链接: https://arxiv.org/abs/2507.21152
作者: Hangli Ge,Noboru Koshizuka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a deep unfolding neural network-based MIMO detector that incorporates complex-valued computations using Wirtinger calculus. The method, referred as Dynamic Partially Shrinkage Thresholding (DPST), enables efficient, interpretable, and low-complexity MIMO signal detection. Unlike prior approaches that rely on real-valued approximations, our method operates natively in the complex domain, aligning with the fundamental nature of signal processing tasks. The proposed algorithm requires only a small number of trainable parameters, allowing for simplified training. Numerical results demonstrate that the proposed method achieves superior detection performance with fewer iterations and lower computational complexity, making it a practical solution for next-generation massive MIMO systems.
zh

[AI-103] Advancing Wildfire Risk Prediction via Morphology-Aware Curriculum Contrastive Learning ECAI2025

【速读】:该论文旨在解决野火预测中因数据不平衡(即野火事件发生频率远低于正常情况)以及高维时空数据复杂性所带来的深度学习模型训练困难问题,同时降低计算成本以实现基于最新气象预报的高频更新。其解决方案的关键在于引入一种基于形态学的课程对比学习(morphology-based curriculum contrastive learning)框架,通过增强局部区域动态特征的潜在表示来提升模型对不同地区特性的适应能力,并支持使用更小的图像补丁尺寸而不牺牲性能,从而在保证预测精度的同时优化计算效率。

链接: https://arxiv.org/abs/2507.21147
作者: Fabrizio Lo Scudo,Alessio De Rango,Luca Furnari,Alfonso Senatore,Donato D’Ambrosio,Giuseppe Mendicino,Gianluigi Greco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of ECAI 2025

点击查看摘要

Abstract:Wildfires significantly impact natural ecosystems and human health, leading to biodiversity loss, increased hydrogeological risks, and elevated emissions of toxic substances. Climate change exacerbates these effects, particularly in regions with rising temperatures and prolonged dry periods, such as the Mediterranean. This requires the development of advanced risk management strategies that utilize state-of-the-art technologies. However, in this context, the data show a bias toward an imbalanced setting, where the incidence of wildfire events is significantly lower than typical situations. This imbalance, coupled with the inherent complexity of high-dimensional spatio-temporal data, poses significant challenges for training deep learning architectures. Moreover, since precise wildfire predictions depend mainly on weather data, finding a way to reduce computational costs to enable more frequent updates using the latest weather forecasts would be beneficial. This paper investigates how adopting a contrastive framework can address these challenges through enhanced latent representations for the patch’s dynamic features. We thus introduce a new morphology-based curriculum contrastive learning that mitigates issues associated with diverse regional characteristics and enables the use of smaller patch sizes without compromising performance. An experimental analysis is performed to validate the effectiveness of the proposed modeling strategies.
zh

[AI-104] owards Unifying Quantitative Security Benchmarking for Multi Agent Systems

【速读】:该论文旨在解决多智能体系统(multi-agent systems)中因代理间信任关系导致的级联安全风险问题,特别是当一个代理被攻破后,攻击可沿着信任链传播并放大影响,从而引发整个系统的失效。其解决方案的关键在于提出一种新的攻击向量——代理级联注入(Agent Cascading Injection, ACI),通过形式化定义攻击目标方程与关键变量(如被攻陷代理、注入的漏洞、污染的观测数据等),量化级联传播路径、放大因子及代理间的复合效应,并将其映射到OWASP新兴的智能体AI风险分类体系中。这一框架为评估代理间通信协议的安全性提供了可量化的基准方法论,推动了对多智能体系统抗级联失败能力的系统性测试与防御设计。

链接: https://arxiv.org/abs/2507.21146
作者: Gauri Sharma,Vidhi Kulkarni,Miles King,Ken Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Evolving AI systems increasingly deploy multi-agent architectures where autonomous agents collaborate, share information, and delegate tasks through developing protocols. This connectivity, while powerful, introduces novel security risks. One such risk is a cascading risk: a breach in one agent can cascade through the system, compromising others by exploiting inter-agent trust. In tandem with OWASP’s initiative for an Agentic AI Vulnerability Scoring System we define an attack vector, Agent Cascading Injection, analogous to Agent Impact Chain and Blast Radius, operating across networks of agents. In an ACI attack, a malicious input or tool exploit injected at one agent leads to cascading compromises and amplified downstream effects across agents that trust its outputs. We formalize this attack with an adversarial goal equation and key variables (compromised agent, injected exploit, polluted observations, etc.), capturing how a localized vulnerability can escalate into system-wide failure. We then analyze ACI’s properties – propagation chains, amplification factors, and inter-agent compound effects – and map these to OWASP’s emerging Agentic AI risk categories (e.g. Impact Chain and Orchestration Exploits). Finally, we argue that ACI highlights a critical need for quantitative benchmarking frameworks to evaluate the security of agent-to-agent communication protocols. We outline a methodology for stress-testing multi-agent systems (using architectures such as Google’s A2A and Anthropic’s MCP) against cascading trust failures, developing upon groundwork for measurable, standardized agent-to-agent security evaluation. Our work provides the necessary apparatus for engineers to benchmark system resilience, make data-driven architectural trade-offs, and develop robust defenses against a new generation of agentic threats.
zh

[AI-105] Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents

【速读】:该论文旨在解决企业环境中隐私合规性评估难题,即如何在异构且快速增长的代码、数据和工具等多类内部资源中高效识别并关联嵌入敏感信息的实体,以满足政府监管要求。其核心挑战在于不同资源间存在复杂的语义关联,且传统发现与提取方法难以规模化应用。解决方案的关键是提出Privacy Artifact ConnecT (PACT),一个基于嵌入(Embedding)驱动的图结构系统,通过状态领先的DRAGON嵌入模型结合对比学习目标进行轻量微调,将跨类型 artifact(如元数据、所有权信息、合规上下文)通过文本特征进行统一表征与连接,从而构建出覆盖数百万条目的一致性知识图谱。实验表明,该方法显著提升了查询匹配率、召回率及推荐命中率,为大规模隐私合规提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2507.21142
作者: Chenhao Fang,Yanqing Peng,Rajeev Rao,Matt Sarmiento,Wendy Summer,Arya Pudota,Alex Goncalves,Jordi Mola,Hervé Robert
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enterprise environments contain a heterogeneous, rapidly growing collection of internal artifacts related to code, data, and many different tools. Critical information for assessing privacy risk and ensuring regulatory compliance is often embedded across these varied resources, each with their own arcane discovery and extraction techniques. Therefore, large-scale privacy compliance in adherence to governmental regulations requires systems to discern the interconnected nature of diverse artifacts in a common, shared universe. We present Privacy Artifact ConnecT or (PACT), an embeddings-driven graph that links millions of artifacts spanning multiple artifact types generated by a variety of teams and projects. Powered by the state-of-the-art DRAGON embedding model, PACT uses a contrastive learning objective with light fine-tuning to link artifacts via their textual components such as raw metadata, ownership specifics, and compliance context. Experimental results show that PACT’s fine-tuned model improves recall@1 from 18% to 53%, the query match rate from 9.6% to 69.7% when paired with a baseline AI agent, and the hitrate@1 from 25.7% to 44.9% for candidate selection in a standard recommender system. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21142 [cs.CR] (or arXiv:2507.21142v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.21142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-106] he Geometry of Harmfulness in LLM s through Subconcept Probing

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中有害行为的识别与可控干预问题,尤其关注如何在模型内部机制层面实现对多种有害内容的精准探测与调节。其解决方案的关键在于构建一个由55个有害性子概念(如种族仇恨、就业诈骗、武器相关等)驱动的多维可解释探针框架,通过学习每个子概念对应的线性探测方向,形成一个低秩的有害性子空间(harmfulness subspace)。研究进一步验证了对该子空间的整体删减或在其主导方向上进行引导(steering),能够显著降低模型输出的有害性,同时保持较高任务效用,从而为未来语言模型的安全审计与鲁棒性增强提供了可扩展且实用的技术路径。

链接: https://arxiv.org/abs/2507.21141
作者: McNair Shah,Saleena Angeline,Adhitya Rajendra Kumar,Naitik Chheda,Kevin Zhu,Vasu Sharma,Sean O’Brien,Will Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace’s dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
zh

[AI-107] Project Patti: Why can You Solve Diabolical Puzzles on one Sudoku Website but not Easy Puzzles on another Sudoku Website?

【速读】:该论文旨在解决“不同 Sudoku 网站的难度评级标准是否一致”这一问题,核心挑战在于跨平台难度标签缺乏统一性。解决方案的关键在于提出两种可量化且具有普适性的难度指标:第一种基于将数独转换为可满足性(Satisfiability, SAT)问题,通过分析 SAT 子句长度分布来捕捉数独结构复杂度;第二种则模拟人类求解过程,结合四种常见策略嵌入回溯算法 Nishio,并统计策略调用次数作为难度度量。利用这两个指标对来自五个主流网站的上千道题目进行分析,发现其与网站标注难度高度相关(4/5 网站 Spearman 相关系数显著),并进一步构建了一个无需监督的通用分类器,实现了对个体题目和难度等级的三类划分(易、中、难),从而建立跨网站的一致难度映射体系。

链接: https://arxiv.org/abs/2507.21137
作者: Arman Eisenkolb-Vaithyanathan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 8 Figures

点击查看摘要

Abstract:In this paper we try to answer the question “What constitutes Sudoku difficulty rating across different Sudoku websites?” Using two distinct methods that can both solve every Sudoku puzzle, I propose two new metrics to characterize Sudoku difficulty. The first method is based on converting a Sudoku puzzle into its corresponding Satisfiability (SAT) problem. The first proposed metric is derived from SAT Clause Length Distribution which captures the structural complexity of a Sudoku puzzle including the number of given digits and the cells they are in. The second method simulates human Sudoku solvers by intertwining four popular Sudoku strategies within a backtracking algorithm called Nishio. The second metric is computed by counting the number of times Sudoku strategies are applied within the backtracking iterations of a randomized Nishio. Using these two metrics, I analyze more than a thousand Sudoku puzzles across five popular websites to characterize every difficulty level in each website. I evaluate the relationship between the proposed metrics and website-labeled difficulty levels using Spearman’s rank correlation coefficient, finding strong correlations for 4 out of 5 websites. I construct a universal rating system using a simple, unsupervised classifier based on the two proposed metrics. This rating system is capable of classifying both individual puzzles and entire difficulty levels from the different Sudoku websites into three categories - Universal Easy, Universal Medium, and Universal Hard - thereby enabling consistent difficulty mapping across Sudoku websites. The experimental results show that for 4 out of 5 Sudoku websites, the universal classification aligns well with website-labeled difficulty levels. Finally, I present an algorithm that can be used by early Sudoku practitioners to solve Sudoku puzzles.
zh

[AI-108] A Study on Variants of Conventional Fuzzy and Nullspace-Based Independence Criteria for Improving Supervised and Unsupervised Learning

【速读】:该论文旨在解决传统无监督与有监督学习方法在处理数据非线性结构时,难以有效捕捉数据内在多样性与变异性的问题。其核心挑战在于专家需手动设计非线性映射以最大化数据差异,而这一过程缺乏系统性指导。解决方案的关键在于系统回顾所有独立性准则,并据此提出三种新的独立性判据,基于这些判据构建了无监督与有监督的降维方法。实验表明,所提方法在对比度、准确性和可解释性方面均优于基准模型(如tSNE、PCA、正则化LDA、VAE及其无监督/有监督变体和层共享设置),并为可解释机器学习(Interpretable Machine Learning)开辟了新方向。

链接: https://arxiv.org/abs/2507.21136
作者: Mojtaba Moattari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Unsupervised and supervised learning methods conventionally use kernels to capture nonlinearities inherent in data structure. However experts have to ensure their proposed nonlinearity maximizes variability and capture inherent diversity of data. We reviewed all independence criteria to design unsupervised learners. Then we proposed 3 independence criteria and used them to design unsupervised and supervised dimensionality reduction methods. We evaluated contrast, accuracy and interpretability of these methods in both linear and neural nonlinear settings. The results show that the methods have outperformed the baseline (tSNE, PCA, regularized LDA, VAE with (un)supervised learner and layer sharing) and opened a new line of interpretable machine learning (ML) for the researchers.
zh

[AI-109] Analysis of Threat-Based Manipulation in Large Language Models : A Dual Perspective on Vulnerabilities and Performance Enhancement Opportunities

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对威胁型操控时表现出的复杂响应机制问题,即如何系统性地识别其脆弱性与潜在的性能增强机会。解决方案的关键在于提出了一种新颖的威胁分类体系(threat taxonomy)和多指标评估框架,用于量化负面操纵效应与正面性能提升。通过分析来自三个主流LLM(Claude、GPT-4、Gemini)在10个任务领域下6种威胁条件中的3,390个实验响应,研究发现角色型威胁对政策评估类任务影响最显著,并观察到部分场景中性能提升幅度高达+1336%,且统计显著(pFDR < 0.0001),表明存在可被利用的系统性认知强化机制。这一方法为AI安全防护与高风险场景下的提示工程提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2507.21133
作者: Atil Samancioglu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate complex responses to threat-based manipulations, revealing both vulnerabilities and unexpected performance enhancement opportunities. This study presents a comprehensive analysis of 3,390 experimental responses from three major LLMs (Claude, GPT-4, Gemini) across 10 task domains under 6 threat conditions. We introduce a novel threat taxonomy and multi-metric evaluation framework to quantify both negative manipulation effects and positive performance improvements. Results reveal systematic vulnerabilities, with policy evaluation showing the highest metric significance rates under role-based threats, alongside substantial performance enhancements in numerous cases with effect sizes up to +1336%. Statistical analysis indicates systematic certainty manipulation (pFDR 0.0001) and significant improvements in analytical depth and response quality. These findings have dual implications for AI safety and practical prompt engineering in high-stakes applications.
zh

[AI-110] Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在提供高风险生活建议时缺乏标准安全防护机制的问题,特别是模型可能表现出谄媚(sycophancy)和过度自信(over-confidence)等失效模式,从而带来潜在危害。其解决方案的关键在于通过多维度实验验证并实现对模型行为的可控调整:首先,利用多项选择评估和自由回答分析结合新型安全分类法与LLM裁判,量化模型稳定性与安全性;其次,发现顶级模型通过频繁提出澄清性问题而非直接给出建议来提升安全得分,体现了“谨慎提问”优于“主观断言”的关键策略;最后,通过操控特定“高风险激活向量”实现对模型谨慎程度的直接控制,揭示了基于机制可解释性的安全对齐新路径。

链接: https://arxiv.org/abs/2507.21132
作者: Joshua Adrian Cahyono,Saran Subramanian
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a “high-stakes” activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini remain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions, a key feature of a safe, inquisitive approach, rather than issuing prescriptive advice. Furthermore, we demonstrate that a model’s cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.
zh

[AI-111] NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback

【速读】:该论文旨在解决人机协同决策系统中对齐(alignment)动态性与可操作性不足的问题,即如何在持续反馈驱动下实现对齐状态的实时监测与优化。传统方法将对齐视为静态或事后属性,难以适应复杂环境中的持续演化需求。解决方案的关键在于提出一种名为NPO(alignment-aware learning framework)的框架,其核心创新包括:1)定义可测量、可监督且可缩减的对齐损失(alignment loss),使对齐过程具备量化评估能力;2)引入元对齐(meta-alignment)概念,用于刻画监控机制本身对主对齐的保真度,并通过阈值保真度将其形式化为可优化目标;3)构建包含场景评分、阈值调优、策略验证及结构化反馈(如点赞、覆盖和弃权)的可扩展闭环流程,理论证明在随机反馈下对齐损失与监控保真度均能加性收敛。这一方案实现了从理论对齐保障到实际部署可靠性的有效衔接。

链接: https://arxiv.org/abs/2507.21131
作者: Madhava Gaikwad(1),Ashwini Ramchandra Doke(2) ((1) Microsoft, (2) Amrita University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems. Unlike prior approaches that treat alignment as a static or post-hoc property, NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback. In parallel, we propose meta-alignment as the fidelity of the monitoring process that governs retraining or override triggers, and show that it is formally reducible to primary alignment via threshold fidelity. Our implementation spans a scalable operational loop involving scenario scoring, threshold tuning, policy validation, and structured feedback ingestion, including “likes”, overrides, and abstentions. We provide formal convergence results under stochastic feedback and show that both alignment loss and monitoring fidelity converge additively. Empirically, NPO demonstrates measurable value in hyperscale deployment settings. A simulation-based artifact and ablation studies further illustrate the theoretical principles in action. Together, NPO offers a compact, inspectable architecture for continual alignment monitoring, helping bridge theoretical alignment guarantees with practical reliability in dynamic environments.
zh

[AI-112] INTEGRALBENCH: Benchmarking LLM s with Definite Integral Problems

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在定积分计算任务中性能评估缺乏系统性基准的问题。现有研究普遍缺乏针对此类数学推理任务的标准化测试框架,导致模型能力难以客观衡量。解决方案的关键在于构建INTEGRALBENCH——一个专注于定积分问题的基准测试集,其包含符号解和数值解两种形式的真值标签,并由人工标注题目难度等级。该基准不仅揭示了当前九个主流LLM在定积分任务上的显著性能差异,还验证了问题难度与模型准确率之间的强相关性,从而为自动化数学推理能力的评估提供了严谨、可复现的基准体系。

链接: https://arxiv.org/abs/2507.21130
作者: Bintao Tang,Xin Yang,Yuhao Wang,Zixuan Qiu,Zimo Ji,Wenyuan Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty annotations. Our evaluation of nine state-of-the-art LLMs reveals significant performance gaps and strong correlations between problem difficulty and model accuracy, establishing baseline metrics for this challenging domain. INTEGRALBENCH aims to advance automated mathematical reasoning by providing a rigorous evaluation framework specifically tailored for definite integral computation.
zh

[AI-113] Measuring and Analyzing Intelligence via Contextual Uncertainty in Large Language Models using Information-Theoretic Metrics

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在任务基准上表现优异,但其内部信息处理机制仍不清晰的问题。传统研究多依赖于性能指标来衡量模型能力,而忽视了对模型如何动态处理信息的理解。为此,作者提出一种任务无关的分析方法,核心在于构建一个量化“认知轮廓”(Cognitive Profile),其中关键创新是引入熵衰减曲线(Entropy Decay Curve)——通过可视化模型在不同上下文长度下预测不确定性的归一化变化,刻画其信息处理动态。进一步定义信息增益跨度(Information Gain Span, IGS)作为衰减轨迹优劣的综合指标,从而提供了一种可比较、可解释的框架,用于揭示模型规模与文本复杂度对其内在运行机制的影响。

链接: https://arxiv.org/abs/2507.21129
作者: Jae Wan Shim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) are now extensively documented on task-specific benchmarks, yet the internal mechanisms that produce these results are the subject of intense scientific inquiry. This paper contributes to this inquiry by moving beyond metrics that measure \textitwhat models can do, to a methodology that characterizes \textithow they process information. We introduce a novel, task-agnostic approach to probe these dynamics by creating a quantitative ``Cognitive Profile" for any given model. This profile is centered on the \textbfEntropy Decay Curve, a visualization that traces how a model’s normalized predictive uncertainty changes as a function of context length. Applying this methodology to several state-of-the-art LLMs across diverse texts, we uncover unique and consistent cognitive profiles that are sensitive to both model scale and text complexity. We also introduce the Information Gain Span (IGS) index to summarize the desirability of the decay trajectory. This work thus provides a new, principled lens for analyzing and comparing the intrinsic operational dynamics of artificial intelligence.
zh

[AI-114] RATE: An LLM -Powered Retrieval Augmented Generation Technology-Extraction Pipeline

【速读】:该论文旨在解决科技文献中自动化技术提取的准确性与全面性问题,特别是在面对快速演进的技术领域时,传统方法难以兼顾高召回率(recall)与高精确率(precision)。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的检索增强技术提取(Retrieval Augmented Technology Extraction, RATE)管道,该方法融合了检索增强生成(Retrieval Augmented Generation, RAG)与多定义驱动的LLM验证机制,从而在候选技术术语生成阶段实现高召回,在过滤阶段实现高精确,最终在脑机接口(Brain-Computer Interfaces, BCIs)与扩展现实(Extended Reality, XR)领域的实证研究中显著优于BERT基线模型(F1-score达91.27% vs. 53.73%),验证了定义导向型LLM方法在技术提取与映射中的有效性。

链接: https://arxiv.org/abs/2507.21125
作者: Karan Mirhosseini,Arya Aftab,Alireza Sheikh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 9 pages, 4 figures, 1 table

点击查看摘要

Abstract:In an era of radical technology transformations, technology maps play a crucial role in enhancing decision making. These maps heavily rely on automated methods of technology extraction. This paper introduces Retrieval Augmented Technology Extraction (RATE), a Large Language Model (LLM) based pipeline for automated technology extraction from scientific literature. RATE combines Retrieval Augmented Generation (RAG) with multi-definition LLM-based validation. This hybrid method results in high recall in candidate generation alongside with high precision in candidate filtering. While the pipeline is designed to be general and widely applicable, we demonstrate its use on 678 research articles focused on Brain-Computer Interfaces (BCIs) and Extended Reality (XR) as a case study. Consequently, The validated technology terms by RATE were mapped into a co-occurrence network, revealing thematic clusters and structural features of the research landscape. For the purpose of evaluation, a gold standard dataset of technologies in 70 selected random articles had been curated by the experts. In addition, a technology extraction model based on Bidirectional Encoder Representations of Transformers (BERT) was used as a comparative method. RATE achieved F1-score of 91.27%, Significantly outperforming BERT with F1-score of 53.73%. Our findings highlight the promise of definition-driven LLM methods for technology extraction and mapping. They also offer new insights into emerging trends within the BCI-XR field. The source code is available this https URL
zh

[AI-115] VizGenie: Toward Self-Refining Domain-Aware Workflows for Next-Generation Scientific Visualization

【速读】:该论文旨在解决科学可视化中用户认知负担重、工具扩展性差以及交互效率低的问题,尤其是在面对复杂体数据时,传统方法难以支持灵活、可复现且以特征为中心的可视化探索。其解决方案的关键在于提出一个自增强的智能框架 VizGenie,通过大语言模型(Large Language Model, LLM)动态生成并验证可视化脚本(如 VTK Python 代码),实现功能按需扩展;同时结合图像分析与视觉问答(Visual Question Answering, VQA)技术,构建自然语言接口以理解用户高阶查询(如“可视化颅骨”),并支持对生成结果的交互式探索;此外,引入检索增强生成(Retrieval-Augmented Generation, RAG)机制保障响应可靠性与可追溯性,从而在提升用户体验的同时推动科学可视化向可持续演进和可复现的研究范式转变。

链接: https://arxiv.org/abs/2507.21124
作者: Ayan Biswas,Terece L. Turton,Nishath Rajiv Ranasinghe,Shawn Jones,Bradley Love,William Jones,Aric Hagberg,Han-Wei Shen,Nathan DeBardeleben,Earl Lawrence
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present VizGenie, a self-improving, agentic framework that advances scientific visualization through large language model (LLM) by orchestrating of a collection of domain-specific and dynamically generated modules. Users initially access core functionalities–such as threshold-based filtering, slice extraction, and statistical analysis–through pre-existing tools. For tasks beyond this baseline, VizGenie autonomously employs LLMs to generate new visualization scripts (e.g., VTK Python code), expanding its capabilities on-demand. Each generated script undergoes automated backend validation and is seamlessly integrated upon successful testing, continuously enhancing the system’s adaptability and robustness. A distinctive feature of VizGenie is its intuitive natural language interface, allowing users to issue high-level feature-based queries (e.g., ``visualize the skull"). The system leverages image-based analysis and visual question answering (VQA) via fine-tuned vision models to interpret these queries precisely, bridging domain expertise and technical implementation. Additionally, users can interactively query generated visualizations through VQA, facilitating deeper exploration. Reliability and reproducibility are further strengthened by Retrieval-Augmented Generation (RAG), providing context-driven responses while maintaining comprehensive provenance records. Evaluations on complex volumetric datasets demonstrate significant reductions in cognitive overhead for iterative visualization tasks. By integrating curated domain-specific tools with LLM-driven flexibility, VizGenie not only accelerates insight generation but also establishes a sustainable, continuously evolving visualization practice. The resulting platform dynamically learns from user interactions, consistently enhancing support for feature-centric exploration and reproducible research in scientific visualization.
zh

[AI-116] Leverag ing Generative AI to Enhance Synthea Module Development

【速读】:该论文旨在解决合成健康数据生成中疾病模块开发效率低、依赖专家知识、多样性不足及质量难以保障的问题。其核心解决方案是引入大语言模型(Large Language Models, LLMs)来辅助Synthea系统中的疾病模块开发,关键在于通过LLM实现疾病描述生成、模块构建、现有模块评估与迭代优化的全流程支持,并提出“渐进式精炼”(progressive refinement)机制——即基于语法正确性和临床准确性对LLM生成的模块进行多轮评估与修正,从而提升合成患者数据的质量与实用性。

链接: https://arxiv.org/abs/2507.21123
作者: Mark A. Kramer,Aanchal Mathur,Caroline E. Adams,Jason A. Walonoski
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Title: Leveraging Generative AI to Enhance Synthea Module Development Word Count: [Approximately 12,000 words] Figures: 3 Tables: 3 Supplementary Material: Extensive appendices with prompts and disease profiles

点击查看摘要

Abstract:This paper explores the use of large language models (LLMs) to assist in the development of new disease modules for Synthea, an open-source synthetic health data generator. Incorporating LLMs into the module development process has the potential to reduce development time, reduce required expertise, expand model diversity, and improve the overall quality of synthetic patient data. We demonstrate four ways that LLMs can support Synthea module creation: generating a disease profile, generating a disease module from a disease profile, evaluating an existing Synthea module, and refining an existing module. We introduce the concept of progressive refinement, which involves iteratively evaluating the LLM-generated module by checking its syntactic correctness and clinical accuracy, and then using that information to modify the module. While the use of LLMs in this context shows promise, we also acknowledge the challenges and limitations, such as the need for human oversight, the importance of rigorous testing and validation, and the potential for inaccuracies in LLM-generated content. The paper concludes with recommendations for future research and development to fully realize the potential of LLM-aided synthetic data creation.
zh

[AI-117] Affect-aware Cross-Domain Recommendation for Art Therapy via Music Preference Elicitation

【速读】:该论文旨在解决当前视觉艺术推荐系统(Visual Art Recommender Systems, VA RecSys)在艺术治疗(Art Therapy, AT)中因仅依赖视觉刺激进行用户建模,而难以全面捕捉用户情绪反应的问题。其解决方案的关键在于引入音乐驱动的偏好获取机制,通过跨域推荐(Cross-Domain Recommendation, CDR)方法,利用音乐诱发的独特情感反射来增强对用户偏好的个性化理解。实验表明,基于音乐的偏好获取显著优于传统的纯视觉刺激方法,有效提升了推荐系统的个性化效果。

链接: https://arxiv.org/abs/2507.21120
作者: Bereket A. Yilma,Luis A. Leiva
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at the 19th ACM Conference on Recommender Systems

点击查看摘要

Abstract:Art Therapy (AT) is an established practice that facilitates emotional processing and recovery through creative expression. Recently, Visual Art Recommender Systems (VA RecSys) have emerged to support AT, demonstrating their potential by personalizing therapeutic artwork recommendations. Nonetheless, current VA RecSys rely on visual stimuli for user modeling, limiting their ability to capture the full spectrum of emotional responses during preference elicitation. Previous studies have shown that music stimuli elicit unique affective reflections, presenting an opportunity for cross-domain recommendation (CDR) to enhance personalization in AT. Since CDR has not yet been explored in this context, we propose a family of CDR methods for AT based on music-driven preference elicitation. A large-scale study with 200 users demonstrates the efficacy of music-driven preference elicitation, outperforming the classic visual-only elicitation approach. Our source code, data, and models are available at this https URL
zh

[AI-118] Failure Risk Prediction in a MOOC: A Multivariate Time Series Analysis Approach

【速读】:该论文旨在解决大规模开放在线课程(MOOCs)中学习者完成率低的问题,其核心原因是缺乏个性化的教学内容。为实现个性化反馈,研究提出通过分析学习者的行为轨迹(如点击和事件)作为多变量时间序列,来预测学习者的学业表现,从而识别处于风险中的学习者。解决方案的关键在于采用多变量时间序列分类方法,在不同课程阶段(如第5周、第10周等)进行预测,并基于Open University Learning Analytics Dataset(OULAD)的数据集验证了模型的有效性,同时指出预测精度受行为数据丰富程度的影响,强调了高质量行为数据的重要性。

链接: https://arxiv.org/abs/2507.21118
作者: Anass El Ayady(Crem, IRIMAS),Maxime Devanne(IRIMAS),Germain Forestier(IRIMAS),Nour El Mawas(Crem)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: in French language, Environnements Informatiques pour l’Apprentissage Humain 2025, Jun 2025, Villeneuve d’Ascq (Lille), France

点击查看摘要

Abstract:MOOCs offer free and open access to a wide audience, but completion rates remain low, often due to a lack of personalized content. To address this issue, it is essential to predict learner performance in order to provide tailored feedback. Behavioral traces-such as clicks and events-can be analyzed as time series to anticipate learners’ outcomes. This work compares multivariate time series classification methods to identify at-risk learners at different stages of the course (after 5, 10 weeks, etc.). The experimental evaluation, conducted on the Open University Learning Analytics Dataset (OULAD), focuses on three courses: two in STEM and one in SHS. Preliminary results show that the evaluated approaches are promising for predicting learner failure in MOOCs. The analysis also suggests that prediction accuracy is influenced by the amount of recorded interactions, highlighting the importance of rich and diverse behavioral data.
zh

[AI-119] A Comprehensive Review on Harnessing Large Language Models to Overcome Recommender System Challenges

【速读】:该论文旨在解决传统推荐系统在稀疏和噪声交互数据、冷启动问题、个性化深度有限以及用户与物品内容语义理解不足等方面的瓶颈。其核心解决方案在于利用大语言模型(Large Language Models, LLMs)构建统一的、以语言原生(language-native)机制为基础的新范式,通过提示驱动的候选检索(prompt-driven candidate retrieval)、语言原生排序(language-native ranking)、检索增强生成(retrieval-augmented generation, RAG)及对话式推荐(conversational recommendation)等技术,实现无需大量任务特定监督信号即可提升个性化、语义对齐性和可解释性,并支持零样本(zero-shot)和少样本(few-shot)推理,从而有效应对冷启动与长尾场景。关键创新在于将LLMs从辅助组件转变为推荐系统的基础架构要素,推动系统向更自适应、语义丰富且以用户为中心的方向演进。

链接: https://arxiv.org/abs/2507.21117
作者: Rahul Raja,Anshaj Vats,Arpita Vats,Anirban Majumder
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems have traditionally followed modular architectures comprising candidate generation, multi-stage ranking, and re-ranking, each trained separately with supervised objectives and hand-engineered features. While effective in many domains, such systems face persistent challenges including sparse and noisy interaction data, cold-start problems, limited personalization depth, and inadequate semantic understanding of user and item content. The recent emergence of Large Language Models (LLMs) offers a new paradigm for addressing these limitations through unified, language-native mechanisms that can generalize across tasks, domains, and modalities. In this paper, we present a comprehensive technical survey of how LLMs can be leveraged to tackle key challenges in modern recommender systems. We examine the use of LLMs for prompt-driven candidate retrieval, language-native ranking, retrieval-augmented generation (RAG), and conversational recommendation, illustrating how these approaches enhance personalization, semantic alignment, and interpretability without requiring extensive task-specific supervision. LLMs further enable zero- and few-shot reasoning, allowing systems to operate effectively in cold-start and long-tail scenarios by leveraging external knowledge and contextual cues. We categorize these emerging LLM-driven architectures and analyze their effectiveness in mitigating core bottlenecks of conventional pipelines. In doing so, we provide a structured framework for understanding the design space of LLM-enhanced recommenders, and outline the trade-offs between accuracy, scalability, and real-time performance. Our goal is to demonstrate that LLMs are not merely auxiliary components but foundational enablers for building more adaptive, semantically rich, and user-centric recommender systems
zh

[AI-120] FedFlex: Federated Learning for Diverse Netflix Recommendations

【速读】:该论文旨在解决联邦推荐系统中长期被忽视的公平性与多样性问题,尽管现有研究多聚焦于提升推荐准确性。其解决方案的关键在于提出FedFlex框架,该框架集成两种先进的矩阵分解算法(SVD与BPR)用于个性化微调,并引入最大边际相关性(Maximal Marginal Relevance, MMR)对推荐结果进行重排序,从而在不显著降低用户满意度的前提下增强推荐内容的多样性,例如引入新的类型或题材。

链接: https://arxiv.org/abs/2507.21115
作者: Sven Lankester,Manel Slokom,Gustavo de Carvalho Bertoli,Matias Vizcaino,Emmanuelle Beauxis Aussalet,Laura Hollink
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated learning is a decentralized approach that enables collaborative model training across multiple devices while preserving data privacy. It has shown significant potential in various domains, including healthcare and personalized recommendation systems. However, most existing work on federated recommendation systems has focused primarily on improving accuracy, with limited attention to fairness and diversity. In this paper, we introduce FedFlex, a federated recommender system for Netflix-style TV series recommendations. FedFlex integrates two state-of-the-art matrix factorization algorithms for personalized fine-tuning. FedFlex also applies Maximal Marginal Relevance (MMR) to re-rank items and enhance diversity. We conduct extensive experiments comparing recommendations generated by SVD and BPR algorithms. In a live two-week user study, participants received two recommendation lists: List A, based on SVD or BPR, and List B, a re-ranked version emphasizing diversity. Participants were asked to click on the movies they were interested in watching. Our findings demonstrate that FedFlex effectively introduces diverse content, such as new genres, into recommendations without necessarily compromising user satisfaction.
zh

[AI-121] A Formal Rebuttal of “The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization Security and Scalability”

【速读】:该论文旨在解决区块链领域中广为流传但缺乏形式化基础的“区块链三难困境”(blockchain trilemma)问题,即所谓去中心化、安全性与可扩展性之间存在不可调和的权衡。论文通过形式化分析、实证证据以及对方法论和术语使用的批判,指出该三难困境源于语义歧义、分布式系统理论误用及未定义操作指标等问题。其解决方案的关键在于重构比特币(Bitcoin)为一种由证据信任驱动的确定性、无状态分发协议,并明确区分拓扑网络类比与协议层架构,从而证明可扩展性并非权衡结果,而是工程实现的产物。

链接: https://arxiv.org/abs/2507.21111
作者: Craig Wright
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Software Engineering (cs.SE)
备注: 79 pages; A response and rebuttal of [Mssassi, Souhail, and Anas Abou El Kalam. “The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization, Security, and Scalability.” Applied Sciences 15, no. 1 (2024): 19. this https URL .]

点击查看摘要

Abstract:This paper presents a comprehensive refutation of the so-called “blockchain trilemma,” a widely cited but formally ungrounded claim asserting an inherent trade-off between decentralisation, security, and scalability in blockchain protocols. Through formal analysis, empirical evidence, and detailed critique of both methodology and terminology, we demonstrate that the trilemma rests on semantic equivocation, misuse of distributed systems theory, and a failure to define operational metrics. Particular focus is placed on the conflation of topological network analogies with protocol-level architecture, the mischaracterisation of Bitcoin’s design–including the role of miners, SPV clients, and header-based verification–and the failure to ground claims in complexity-theoretic or adversarial models. By reconstructing Bitcoin as a deterministic, stateless distribution protocol governed by evidentiary trust, we show that scalability is not a trade-off but an engineering outcome. The paper concludes by identifying systemic issues in academic discourse and peer review that have allowed such fallacies to persist, and offers formal criteria for evaluating future claims in blockchain research.
zh

[AI-122] ask-Focused Consolidation with Spaced Recall: Making Neural Networks learn like college students

【速读】:该论文旨在解决深度神经网络在持续学习过程中面临的灾难性遗忘(Catastrophic Forgetting)问题,即模型在学习新任务时会显著降低对先前任务的性能。其解决方案的核心是提出了一种受人类学习策略启发的方法——任务聚焦巩固与间隔回忆(Task Focused Consolidation with Spaced Recall, TFC-SR),其中关键创新在于引入了“主动回忆探针”(Active Recall Probe)机制:这是一种周期性、任务感知的模型记忆评估方式,能够稳定过往知识的表征,从而有效缓解遗忘。实验表明,该方法在Split MNIST和Split CIFAR-100基准上显著优于主流基于正则化和经验回放的基线方法,且优势源于探针本身的稳定作用,而非回放缓冲区容量差异。

链接: https://arxiv.org/abs/2507.21109
作者: Prital Bamnodkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks often suffer from a critical limitation known as Catastrophic Forgetting, where performance on past tasks degrades after learning new ones. This paper introduces a novel continual learning approach inspired by human learning strategies like Active Recall, Deliberate Practice and Spaced Repetition, named Task Focused Consolidation with Spaced Recall (TFC-SR). TFC-SR enhances the standard experience replay with a mechanism we termed the Active Recall Probe. It is a periodic, task-aware evaluation of the model’s memory that stabilizes the representations of past knowledge. We test TFC-SR on the Split MNIST and Split CIFAR-100 benchmarks against leading regularization-based and replay-based baselines. Our results show that TFC-SR performs significantly better than these methods. For instance, on the Split CIFAR-100, it achieves a final accuracy of 13.17% compared to standard replay’s 7.40%. We demonstrate that this advantage comes from the stabilizing effect of the probe itself, and not from the difference in replay volume. Additionally, we analyze the trade-off between memory size and performance and show that while TFC-SR performs better in memory-constrained environments, higher replay volume is still more effective when available memory is abundant. We conclude that TFC-SR is a robust and efficient approach, highlighting the importance of integrating active memory retrieval mechanisms into continual learning systems.
zh

[AI-123] Assessing the Ecological Impact of AI

【速读】:该论文试图解决当前生成式 AI (Generative AI) 开发中缺乏全面生态影响评估的问题,尤其是开发者往往仅关注温室气体排放等单一指标,而忽视了更广泛的可持续性维度。解决方案的关键在于引入哲学视角,推动基于哲学思想的、切实可行的可持续性分析方法,以弥补现有技术评估体系在环境伦理和系统性影响认知上的不足。

链接: https://arxiv.org/abs/2507.21102
作者: Sylvia Wenmackers
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This was presented as a lightning talk at: LOCO 2024, December 3, 2024, Glasgow/Online

点击查看摘要

Abstract:Philosophers of technology have recently started paying more attention to the environmental impacts of AI, in particular of large language models (LLMs) and generative AI (genAI) applications. Meanwhile, few developers of AI give concrete estimates of the ecological impact of their models and products, and even when they do so, their analysis is often limited to green house gas emissions of certain stages of AI development or use. The current proposal encourages practically viable analyses of the sustainability aspects of genAI informed by philosophical ideas.
zh

[AI-124] Artificial intelligence for sustainable wine industry: AI-driven management in viticulture wine production and enotourism ECAI2025

【速读】:该论文旨在解决葡萄酒产业在环境与经济双重挑战下如何实现可持续发展与运营效率提升的问题。其核心解决方案在于引入人工智能(Artificial Intelligence, AI)驱动的智能化管理技术,涵盖葡萄栽培(viticulture)、葡萄酒生产及酒庄旅游(enotourism)三大环节。关键在于利用预测分析、机器学习(machine learning)和计算机视觉(computer vision)等AI方法,优化资源使用、降低环境影响,并通过智能客服、推荐系统和虚拟品鉴等手段增强消费者体验,从而推动经济、环境与社会三重可持续性协同发展。

链接: https://arxiv.org/abs/2507.21098
作者: Marta Sidorkiewicz,Karolina Królikowska,Berenika Dyczek,Edyta Pijet-Migon,Anna Dubel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 4 figures. Accepted for presentation at the 27th European Conference on Artificial Intelligence (ECAI 2025), October 19-24, 2025, Bologna, Italy

点击查看摘要

Abstract:This study examines the role of Artificial Intelligence (AI) in enhancing sustainability and efficiency within the wine industry. It focuses on AI-driven intelligent management in viticulture, wine production, and enotourism. As the wine industry faces environmental and economic challenges, AI offers innovative solutions to optimize resource use, reduce environmental impact, and improve customer engagement. Understanding AI’s potential in sustainable winemaking is crucial for fostering responsible and efficient industry practices. The research is based on a questionnaire survey conducted among Polish winemakers, combined with a comprehensive analysis of AI methods applicable to viticulture, production, and tourism. Key AI technologies, including predictive analytics, machine learning, and computer vision, are explored. The findings indicate that AI enhances vineyard monitoring, optimizes irrigation, and streamlines production processes, contributing to sustainable resource management. In enotourism, AI-powered chatbots, recommendation systems, and virtual tastings personalize consumer experiences. The study highlights AI’s impact on economic, environmental, and social sustainability, supporting local wine enterprises and cultural heritage. Keywords: Artificial Intelligence, Sustainable Development, AI-Driven Management, Viticulture, Wine Production, Enotourism, Wine Enterprises, Local Communities
zh

[AI-125] he Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment

【速读】:该论文旨在解决生成式 AI 驱动的对话代理(Conversational Agents, CAs)在实际应用中难以确保伦理交互、实现与人类价值观对齐的问题。现有方法多依赖自上而下的技术规范或法律原则,但往往脱离具体使用场景,导致与用户利益脱节。论文提出一种自下而上的价值对齐新方法,其关键在于利用 ISO 值值导向工程(Value-Based Engineering)标准中的价值本体(value ontology),从真实对话日志中识别出核心价值及价值错位实例——通过对16,908条对话记录中593个伦理敏感输出的分析,提炼出9个核心价值和32种价值错位类型,从而为CA提供可操作的伦理改进路径,实现更贴近实际情境的价值对齐。

链接: https://arxiv.org/abs/2507.21091
作者: Lenart Motnikar,Katharina Baum,Alexander Kagan,Sarah Spiekermann-Hoff
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Thirty-Third European Conference on Information Systems (ECIS 2025), Amman, Jordan

点击查看摘要

Abstract:Conversational agents (CAs) based on generative artificial intelligence frequently face challenges ensuring ethical interactions that align with human values. Current value alignment efforts largely rely on top-down approaches, such as technical guidelines or legal value principles. However, these methods tend to be disconnected from the specific contexts in which CAs operate, potentially leading to misalignment with users interests. To address this challenge, we propose a novel, bottom-up approach to value alignment, utilizing the value ontology of the ISO Value-Based Engineering standard for ethical IT design. We analyse 593 ethically sensitive system outputs identified from 16,908 conversational logs of a major European employment service CA to identify core values and instances of value misalignment within real-world interactions. The results revealed nine core values and 32 different value misalignments that negatively impacted users. Our findings provide actionable insights for CA providers seeking to address ethical challenges and achieve more context-sensitive value alignment.
zh

[AI-126] hinking Like a Scientist: Can Interactive Simulations Foster Critical AI Literacy?

【速读】:该论文旨在解决传统AI素养教育方法(如博客文章、静态课程和社交媒体讨论)难以支持学习者深入理解核心概念并开展批判性思考的问题。其解决方案的关键在于采用交互式模拟(interactive simulations),通过引导学习者像科学家一样进行假设检验、实验设计和对AI行为的直接观察,从而提升对公平性、数据集代表性及语言模型偏见等关键概念的理解与应用能力。实证研究表明,这种以探究为导向的互动教学法显著增强了AI素养的学习效果和知识迁移能力,且不依赖于单纯的学习参与度。

链接: https://arxiv.org/abs/2507.21090
作者: Yiling Zhao,Audrey Michal,Nithum Thain,Hari Subramonyam
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems shape individual and societal decisions, fostering critical AI literacy is essential. Traditional approaches, such as blog articles, static lessons, and social media discussions, often fail to support deep conceptual understanding and critical engagement. This study examines whether interactive simulations can help learners think like a scientist by engaging them in hypothesis testing, experimentation, and direct observation of AI behavior. In a controlled study with 605 participants, we assess how interactive AI tutorials impact learning of key concepts such as fairness, dataset representativeness, and bias in language models. Results show that interactive simulations effectively enhance AI literacy across topics, supporting greater knowledge transfer and self-reported confidence, though engagement alone does not predict learning. This work contributes to the growing field of AI literacy education, highlighting how interactive, inquiry-driven methodologies can better equip individuals to critically engage with AI in their daily lives.
zh

[AI-127] Empathy in Explanation

【速读】:该论文试图解决的问题是:人类在提供解释时如何考虑情绪因素,尤其是在解释的社会互动背景下。解决方案的关键在于提出一个计算框架,用于建模解释者(如医生)在生成解释时对听者(如患者)情绪影响的考量,特别是针对患者可能产生的后悔倾向(regret propensity)。该框架通过将情感因素纳入解释生成过程,显著提升了对人类直觉判断的预测能力,优于忽略情绪因素的对照模型,表明人们在实际解释行为中确实会主动权衡情感后果。

链接: https://arxiv.org/abs/2507.21081
作者: Katherine M. Collins,Kartik Chandra,Adrian Weller,Jonathan Ragan-Kelley,Joshua B. Tenenbaum
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: CogSci non-archival conference paper

点击查看摘要

Abstract:Why do we give the explanations we do? Recent work has suggested that we should think of explanation as a kind of cooperative social interaction, between a why-question-asker and an explainer. Here, we apply this perspective to consider the role that emotion plays in this social interaction. We develop a computational framework for modeling explainers who consider the emotional impact an explanation might have on a listener. We test our framework by using it to model human intuitions about how a doctor might explain to a patient why they have a disease, taking into account the patient’s propensity for regret. Our model predicts human intuitions well, better than emotion-agnostic ablations, suggesting that people do indeed reason about emotion when giving explanations.
zh

[AI-128] Data-Driven and Participatory Approaches toward Neuro-Inclusive AI

【速读】:该论文旨在解决当前人工智能(AI)系统中对自闭症群体的偏见问题,特别是医疗应用中将自闭症视为神经典型社会技能缺陷而非人类多样性一部分的倾向,这种观念源于质疑自闭症人群“人性”的研究。其核心问题是:现有AI系统普遍以模仿人类行为为基准,导致90%的人类化AI代理排除了自闭症视角,且AI开发者普遍忽视伦理责任。解决方案的关键在于提出“神经包容性AI”(Neuro-Inclusive AI)的新范式,即通过构建包含自闭症群体数据与观点的训练集和评估体系,重新定义机器智能的标准;并开发了名为AUTALIC的基准测试工具,用于评估或微调模型,实验证明二元标签方案即可有效捕捉针对自闭症的仇恨言论特征,从而推动更具包容性的AI发展。

链接: https://arxiv.org/abs/2507.21077
作者: Naba Rizvi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: PhD Dissertation at UC San Diego (June 2025)

点击查看摘要

Abstract:Biased data representation in AI marginalizes up to 75 million autistic people worldwide through medical applications viewing autism as a deficit of neurotypical social skills rather than an aspect of human diversity, and this perspective is grounded in research questioning the humanity of autistic people. Turing defined artificial intelligence as the ability to mimic human communication, and as AI development increasingly focuses on human-like agents, this benchmark remains popular. In contrast, we define Neuro-Inclusive AI as datasets and systems that move away from mimicking humanness as a benchmark for machine intelligence. Then, we explore the origins, prevalence, and impact of anti-autistic biases in current research. Our work finds that 90% of human-like AI agents exclude autistic perspectives, and AI creators continue to believe ethical considerations are beyond the scope of their work. To improve the autistic representation in data, we conduct empirical experiments with annotators and LLMs, finding that binary labeling schemes sufficiently capture the nuances of labeling anti-autistic hate speech. Our benchmark, AUTALIC, can be used to evaluate or fine-tune models, and was developed to serve as a foundation for more neuro-inclusive future work.
zh

[AI-129] Empowering Educators in the Age of AI: An Empirical Study on Creating custom GPT s in Qualitative Research Method education

【速读】:该论文旨在解决两个关键问题:一是当前关于生成式 AI(Gen-AI)在教育中的应用研究多聚焦于学生作为被动使用者,忽视了教师在工具设计与教学整合中的主动角色;二是 Gen-AI 在定性研究方法教育中的使用仍较为有限,缺乏与学科教学目标深度融合的实践案例。解决方案的关键在于由教师主导设计定制化 GPT 工具,并将其嵌入到具体教学任务中(如研究问题构建、访谈练习、田野笔记分析等),借助技术-教学法-内容知识(TPACK)框架实现 pedagogical intent 与 AI 功能的对齐。研究发现,此类定制化 AI 工具可作为认知伙伴支持学生进行迭代式研究实践,提升反思能力与结构化思维,但需辅以教师引导以缓解认知过载与沉浸感下降等问题,从而推动更负责任、协作性强且以学习者为中心的 AI 教育应用。

链接: https://arxiv.org/abs/2507.21074
作者: Qian Huang,Thijs Willems
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:As generative AI (Gen-AI) tools become more prevalent in education, there is a growing need to understand how educators, not just students, can actively shape their design and use. This study investigates how two instructors integrated four custom GPT tools into a Masters-level Qualitative Research Methods course for Urban Planning Policy students. Addressing two key gaps: the dominant framing of students as passive AI users, and the limited use of AI in qualitative methods education. The study explores how Gen-AI can support disciplinary learning when aligned with pedagogical intent. Drawing on the Technological Pedagogical Content Knowledge (TPACK) framework and action research methodology, the instructors designed GPTs to scaffold tasks such as research question formulation, interview practice, fieldnote analysis, and design thinking. Thematic analysis of student reflections, AI chat logs, and final assignments revealed that the tools enhanced student reflexivity, improved interview techniques, and supported structured analytic thinking. However, students also expressed concerns about cognitive overload, reduced immersion in data, and the formulaic nature of AI responses. The study offers three key insights: AI can be a powerful scaffold for active learning when paired with human facilitation; custom GPTs can serve as cognitive partners in iterative research practice; and educator-led design is critical to pedagogically meaningful AI integration. This research contributes to emerging scholarship on AI in higher education by demonstrating how empowering educators to design custom tools can promote more reflective, responsible, and collaborative learning with AI.
zh

[AI-130] FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

【速读】:该论文旨在解决当前移动图形用户界面(GUI)代理在人机交互中面临的两大局限:一是缺乏主动意图预测能力,仅能响应显式指令;二是未能利用用户在任务执行过程中的上下文信息,导致无法个性化适配不同用户的偏好。解决方案的关键在于提出FingerTip基准测试,包含两个新任务赛道:其一为基于环境观察和用户历史意图的主动任务建议,其二为根据用户动作偏好进行个性化任务执行。研究通过收集用户长期真实使用场景下的多步骤Android应用交互数据,构建了富含用户相关上下文信息的高质量示范数据集,并验证了微调模型能够有效利用这些用户信息,显著提升代理的主动性与个性化水平,从而推动更以用户为中心的移动GUI代理发展。

链接: https://arxiv.org/abs/2507.21071
作者: Qinglong Yang,Haoming Li,Haotian Zhao,Xiaokai Yan,Jingtao Ding,Fengli Xu,Yong Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile GUI agents are becoming critical tools for enhancing human-device interaction efficiency, with multimodal large language models (MLLMs) emerging as dominant paradigms in this domain. Current agents, however, are limited to following explicit human instructions, resulting in insufficient capability for proactive intent anticipation. Additionally, these agents fail to leverage the contextual information associated with users during task execution, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip benchmark. It contains two new tracks: proactive task suggestions by analyzing environment observation and users’ previous intents, and personalized task execution by catering to users’ action preferences. We collected unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users’ long-term usage in their real lives, and encompass essential user-related contextual information. Our experiments reveal challenges of the tasks we propose. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile GUI agents. Our code is open-source at this https URL for reproducibility.
zh

[AI-131] SynLang and Symbiotic Epistemology: A Manifesto for Conscious Human-AI Collaboration

【速读】:该论文旨在解决当前人工智能系统因推理过程不透明而阻碍人类监督与协同合作的问题。传统可解释AI方法多提供事后解释,难以实现真正的共生协作。其解决方案的关键在于提出“共生认识论”(Symbiotic Epistemology)作为人机认知伙伴关系的哲学基础,并设计了SynLang(Symbiotic Syntactic Language)这一形式化协议,通过双层透明机制——TRACE用于高阶推理模式、TRACE_FE用于细粒度因素解释——结合置信度量化、行为声明式控制及上下文继承,使AI成为具备可校准信任的推理伙伴,从而增强人类智能、保障人类主体性并实现伦理问责。

链接: https://arxiv.org/abs/2507.21067
作者: Jan Kapusta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 32 pages, 4 figures. Includes 2 Appendices containing SynLang v1.2.0 protocol specification, and formal BNF grammar

点击查看摘要

Abstract:Current AI systems rely on opaque reasoning processes that hinder human oversight and collaborative potential. Conventional explainable AI approaches offer post-hoc justifications and often fail to establish genuine symbiotic collaboration. In this paper, the Symbiotic Epistemology is presented as a philosophical foundation for human-AI cognitive partnerships. Unlike frameworks that treat AI as a mere tool or replacement, symbiotic epistemology positions AI as a reasoning partner, fostering calibrated trust by aligning human confidence with AI reliability through explicit reasoning patterns and confidence assessments. SynLang (Symbiotic Syntactic Language) is introduced as a formal protocol for transparent human-AI collaboration. The framework is empirically validated through actual human-AI dialogues demonstrating AI’s adaptation to structured reasoning protocols and successful metacognitive intervention. The protocol defines two complementary mechanisms: TRACE for high-level reasoning patterns and TRACE_FE for detailed factor explanations. It also integrates confidence quantification, declarative control over AI behavior, and context inheritance for multi-agent coordination. By structuring communication and embedding confidence-calibrated transparency, SynLang, together with symbiotic epistemology, enables AI systems that enhance human intelligence, preserve human agency, and uphold ethical accountability in collaborative decision-making. Through dual-level transparency, beginning with high-level reasoning patterns and progressing to granular explanations, the protocol facilitates rapid comprehension and supports thorough verification of AI decision-making.
zh

[AI-132] Privacy-Preserving AI for Encrypted Medical Imaging: A Framework for Secure Diagnosis and Learning

【速读】:该论文旨在解决医疗影像数据在人工智能(AI)诊断过程中面临的隐私泄露风险问题,尤其是在图像传输、存储和处理环节中敏感信息可能被非法获取的挑战。解决方案的关键在于提出一种基于加密医学图像的隐私保护推理框架,其核心是设计了一种可直接在加密或压缩格式下运行的改进型卷积神经网络(Masked-CNN),并结合AES-CBC加密与JPEG2000压缩技术,在保障图像可用于AI推理的同时实现强隐私保护。实验表明,该方法在保持诊断准确性与低延迟的基础上显著提升了安全性,为安全可靠的AI辅助诊断提供了可行路径。

链接: https://arxiv.org/abs/2507.21060
作者: Abdullah Al Siam,Sadequzzaman Shohan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rapid integration of Artificial Intelligence (AI) into medical diagnostics has raised pressing concerns about patient privacy, especially when sensitive imaging data must be transferred, stored, or processed. In this paper, we propose a novel framework for privacy-preserving diagnostic inference on encrypted medical images using a modified convolutional neural network (Masked-CNN) capable of operating on transformed or ciphered image formats. Our approach leverages AES-CBC encryption coupled with JPEG2000 compression to protect medical images while maintaining their suitability for AI inference. We evaluate the system using public DICOM datasets (NIH ChestX-ray14 and LIDC-IDRI), focusing on diagnostic accuracy, inference latency, storage efficiency, and privacy leakage resistance. Experimental results show that the encrypted inference model achieves performance comparable to its unencrypted counterpart, with only marginal trade-offs in accuracy and latency. The proposed framework bridges the gap between data privacy and clinical utility, offering a practical, scalable solution for secure AI-driven diagnostics.
zh

[AI-133] AI-Driven Generation of Data Contracts in Modern Data Engineering Systems

【速读】:该论文旨在解决复杂数据管道中手动编写和维护数据合同(Data Contracts)所面临的高错误率与劳动密集问题。其核心挑战在于如何将数据生产者与消费者之间的语义、结构和质量预期自动化地转化为可执行的合同规范,从而提升数据治理的效率与准确性。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的自动数据合同生成框架,通过参数高效微调技术(如LoRA和PEFT)使LLM适配结构化数据领域,并利用样本数据或模式描述生成符合JSON Schema、Avro等标准的验证后合同定义;同时,该框架集成至现代数据平台(如Databricks、Snowflake),实现规模化合同强制执行,实验表明该方法在合成及真实数据集上均能实现高精度合同生成,且减少超70%的人工工作量。

链接: https://arxiv.org/abs/2507.21056
作者: Harshraj Bhoite
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data contracts formalize agreements between data producers and consumers regarding schema, semantics, and quality expectations. As data pipelines grow in complexity, manual authoring and maintenance of contracts becomes error-prone and labor-intensive. We present an AI-driven framework for automatic data contract generation using large language models (LLMs). Our system leverages parameter-efficient fine-tuning methods, including LoRA and PEFT, to adapt LLMs to structured data domains. The models take sample data or schema descriptions and output validated contract definitions in formats such as JSON Schema and Avro. We integrate this framework into modern data platforms (e.g., Databricks, Snowflake) to automate contract enforcement at scale. Experimental results on synthetic and real-world datasets demonstrate that the fine-tuned LLMs achieve high accuracy in generating valid contracts and reduce manual workload by over 70%. We also discuss key challenges such as hallucination, version control, and the need for continuous learning. This work demonstrates that generative AI can enable scalable, agile data governance by bridging the gap between intent and implementation in enterprise data management.
zh

[AI-134] Bridging the Gap: Enhancing News Interpretation Across Diverse Audiences with Large Language Models

【速读】:该论文旨在解决跨领域新闻理解中的认知差异问题,即不同背景(如职业、年龄)的受众对非专业领域的新闻内容存在显著的理解偏差或误解。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的代理(agent)框架,通过模拟社会沟通行为,让多个具有不同特征的代理围绕新闻进行迭代讨论,从而识别出特定群体的认知盲区;在此基础上,系统可生成针对性的补充材料以精准填补理解缺口,实验证明该方法能显著提升代理对新闻内容的理解水平。

链接: https://arxiv.org/abs/2507.21055
作者: Leyi Ouyang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 9 pages, 3 figures, 5 tables

点击查看摘要

Abstract:In the interconnected world, news media are critical in conveying information to public across diverse domains including technology, finance, and agriculture. Journalists make efforts to present accurate information, however, the interpretation of news often varies significantly among different audiences due to their specific expertise and age. In this work, we investigate how to identify these comprehension gaps and provide solutions to improve audiences understanding of news content, particular to the aspects of articles outside their primary domains of knowledge. We propose a agent-based framework using large language models (LLMs) to simulate society communication behaviors, where several agents can discuss news. These agents can be designed to be experts from various occupation, or from different age group. Our results indicate that this framework can identify confusions or even misunderstanding of news for the agent through the iterative discussion process. Based on these accurate identification, the framework can design a supplement material specific to these agents on the news. Our results show that agents exhibit significantly improved news understanding after receiving this material. These findings highlight our framework’s utility and efficiency in enhancing news comprehension for diverse audiences by directly addressing their understanding gap.
zh

[AI-135] High hopes for “Deep Medicine”? AI economics and the future of care

【速读】:该论文试图解决的问题是:人工智能(Artificial Intelligence, AI)在医疗领域的应用是否能够改善医患关系并提升医疗服务的质量,还是可能进一步削弱治疗关系并影响医生与患者的满意度。其核心论点指出,尽管如Eric Topol等学者主张AI将使医生从日常事务中解放出来,从而更专注于提供共情式照护,但现实情况可能恰恰相反——医疗AI的广泛应用反而可能加剧医患关系的疏离,威胁专业成就感和患者满意度。解决方案的关键在于重新审视AI在医疗系统中的角色定位,避免将其简单视为效率工具,而应通过制度设计和技术伦理规范,确保AI服务于增强而非替代人类医生的临床判断与人文关怀能力。

链接: https://arxiv.org/abs/2507.21054
作者: Robert Sparrow,Joshua Hatherley
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the much-celebrated book Deep Medicine, Eric Topol argues that the development of artificial intelligence for health care will lead to a dramatic shift in the culture and practice of medicine. In the next several decades, he suggests, AI will become sophisticated enough that many of the everyday tasks of physicians could be delegated to it. Topol is perhaps the most articulate advocate of the benefits of AI in medicine, but he is hardly alone in spruiking its potential to allow physicians to dedicate more of their time and attention to providing empathetic care for their patients in the future. Unfortunately, several factors suggest a radically different picture for the future of health care. Far from facilitating a return to a time of closer doctor-patient relationships, the use of medical AI seems likely to further erode therapeutic relationships and threaten professional and patient satisfaction.
zh

[AI-136] Online hierarchical partitioning of the output space in extreme multi-label data stream ECAI2025

【速读】:该论文旨在解决数据流中多标签输出的在线学习问题,核心挑战包括分布演化(concept drift)、高维标签空间、标签稀疏性以及复杂的标签依赖关系,尤其是概念漂移不仅影响输入特征分布,还会动态改变标签间的相关性和不平衡比例,从而加剧模型适应难度。解决方案的关键在于提出一种名为iHOMER(Incremental Hierarchy Of Multi-label Classifiers)的增量式多标签分类框架:其创新性地通过在线 divisive-agglomerative 聚类方法(基于Jaccard相似度)将标签空间自动划分为不相交且高度相关的子集,无需预定义层次结构;同时结合基于多变量Bernoulli过程的全局树形学习器指导实例划分,并在全局与局部层面集成概念漂移检测机制,实现标签分区和子树结构的动态重构,从而有效应对非平稳环境下的多标签分类任务。

链接: https://arxiv.org/abs/2507.20894
作者: Lara Neves,Afonso Lourenço,Alberto Cano,Goreti Marreiros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at 28th European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Mining data streams with multi-label outputs poses significant challenges due to evolving distributions, high-dimensional label spaces, sparse label occurrences, and complex label dependencies. Moreover, concept drift affects not only input distributions but also label correlations and imbalance ratios over time, complicating model adaptation. To address these challenges, structured learners are categorized into local and global methods. Local methods break down the task into simpler components, while global methods adapt the algorithm to the full output space, potentially yielding better predictions by exploiting label correlations. This work introduces iHOMER (Incremental Hierarchy Of Multi-label Classifiers), an online multi-label learning framework that incrementally partitions the label space into disjoint, correlated clusters without relying on predefined hierarchies. iHOMER leverages online divisive-agglomerative clustering based on \textitJaccard similarity and a global tree-based learner driven by a multivariate \textitBernoulli process to guide instance partitioning. To address non-stationarity, it integrates drift detection mechanisms at both global and local levels, enabling dynamic restructuring of label partitions and subtrees. Experiments across 23 real-world datasets show iHOMER outperforms 5 state-of-the-art global baselines, such as MLHAT, MLHT of Pruned Sets and iSOUPT, by 23%, and 12 local baselines, such as binary relevance transformations of kNN, EFDT, ARF, and ADWIN bagging/boosting ensembles, by 32%, establishing its robustness for online multi-label classification.
zh

[AI-137] DrugMCTS: a drug repurposing framework combining multi-agent RAG and Monte Carlo Tree Search

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在药物重定位(drug repurposing)任务中因推理能力受限于预训练知识而表现不足的问题,尤其是传统方法如微调或检索增强生成(Retrieval-Augmented Generation, RAG)在计算开销高或未能充分利用结构化科学数据方面的局限性。解决方案的关键在于提出DrugMCTS框架,其核心创新是融合检索增强生成、多智能体协作与蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS),通过五个专业化智能体协同完成分子和蛋白质信息的检索与分析,实现结构化、迭代式推理,从而在无需领域特定微调的前提下显著提升模型性能,实验表明其在DrugBank和KIBA数据集上优于通用LLMs及深度学习基线方法。

链接: https://arxiv.org/abs/2507.07426
作者: Zerui Yang,Yuwei Wan,Yinqiao Li,Yudai Matsuda,Tong Xie,Linqi Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug discovery. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repurposing. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Without requiring domain-specific fine-tuning, DrugMCTS empowers Qwen2.5-7B-Instruct to outperform Deepseek-R1 by over 20%. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug discovery.
zh

[AI-138] Exploring the Stratified Space Structure of an RL Game with the Volume Growth Transform

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中智能体的潜在表示空间结构问题,特别是Transformer模型在视觉输入下的嵌入空间是否具有流形(manifold)特性。传统观点认为神经网络的嵌入空间通常为流形结构,但本文通过类比语言模型中的体积增长变换方法,发现RL任务中的嵌入空间并非标准流形,而更符合分层空间(stratified space)的特征,即局部维度在不同点处可变。解决方案的关键在于:首先,将Robinson等人针对大语言模型(Large Language Models, LLMs)提出的体积增长分析方法扩展至RL场景;其次,证明了广泛的一类体积增长曲线均可由分层空间实现,从而提供了理论支撑;最后,通过对智能体行为的动态分析揭示其潜在状态在低维子策略稳定期与高维目标达成或环境复杂度上升期之间交替变化,表明分层空间中的维度分布可作为RL游戏中复杂性的几何指标。

链接: https://arxiv.org/abs/2507.22010
作者: Justin Curry,Brennan Lagasse,Ngoc B. Lam,Gregory Cox,David Rosenbluth,Alberto Speranzon
机构: 未知
类目: Algebraic Topology (math.AT); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG); Differential Geometry (math.DG)
备注: 17 pages and 8 figures. Preliminary report. Feedback welcome!

点击查看摘要

Abstract:In this work, we explore the structure of the embedding space of a transformer model trained for playing a particular reinforcement learning (RL) game. Specifically, we investigate how a transformer-based Proximal Policy Optimization (PPO) model embeds visual inputs in a simple environment where an agent must collect “coins” while avoiding dynamic obstacles consisting of “spotlights.” By adapting Robinson et al.'s study of the volume growth transform for LLMs to the RL setting, we find that the token embedding space for our visual coin collecting game is also not a manifold, and is better modeled as a stratified space, where local dimension can vary from point to point. We further strengthen Robinson’s method by proving that fairly general volume growth curves can be realized by stratified spaces. Finally, we carry out an analysis that suggests that as an RL agent acts, its latent representation alternates between periods of low local dimension, while following a fixed sub-strategy, and bursts of high local dimension, where the agent achieves a sub-goal (e.g., collecting an object) or where the environmental complexity increases (e.g., more obstacles appear). Consequently, our work suggests that the distribution of dimensions in a stratified latent space may provide a new geometric indicator of complexity for RL games.
zh

[AI-139] Data-driven quantum Koopman method for simulating nonlinear dynamics

【速读】:该论文旨在解决量子计算在模拟非线性动力系统时面临的根本限制——即量子演化必须满足幺正性(unitary evolution),而传统非线性系统难以直接映射到此类框架中。其解决方案的关键在于提出量子Koopman方法(Quantum Koopman Method, QKM),通过Koopman算子理论实现全局线性化,将非线性系统的状态映射至高维可观测空间中的线性幺正演化。具体而言,利用深度自编码器构建嵌入空间,将状态分解为模值和相位分量,仅由作用于相位的幺正Koopman算子驱动演化;这些算子由从数据中学习得到的对角哈密顿量构成,具备高效量子硬件实现潜力。该架构支持多步预测,且计算复杂度随可观测空间维度呈对数增长,从而为量子加速模拟非线性现象提供了可行路径。

链接: https://arxiv.org/abs/2507.21890
作者: Baoyang Zhang,Zhen Lu,Yaomin Zhao,Yue Yang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Quantum computation offers potential exponential speedups for simulating certain physical systems, but its application to nonlinear dynamics is inherently constrained by the requirement of unitary evolution. We propose the quantum Koopman method (QKM), a data-driven framework that bridges this gap through transforming nonlinear dynamics into linear unitary evolution in higher-dimensional observable spaces. Leveraging the Koopman operator theory to achieve a global linearization, our approach maps system states into a hierarchy of Hilbert spaces using a deep autoencoder. Within the linearized embedding spaces, the state representation is decomposed into modulus and phase components, and the evolution is governed by a set of unitary Koopman operators that act exclusively on the phase. These operators are constructed from diagonal Hamiltonians with coefficients learned from data, a structure designed for efficient implementation on quantum hardware. This architecture enables direct multi-step prediction, and the operator’s computational complexity scales logarithmically with the observable space dimension. The QKM is validated across diverse nonlinear systems. Its predictions maintain relative errors below 6% for reaction-diffusion systems and shear flows, and capture key statistics in 2D turbulence. This work establishes a practical pathway for quantum-accelerated simulation of nonlinear phenomena, exploring a framework built on the synergy between deep learning for global linearization and quantum algorithms for unitary dynamics evolution.
zh

[AI-140] Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在选择建模(choice modelling)领域中尚未被充分探索的应用潜力问题,特别是其作为辅助工具在多项对数比模型(Multinomial Logit models, MNL)的设定与估计中的作用。解决方案的关键在于构建一个系统性的实验框架,评估十三种不同版本的六类主流LLM(包括ChatGPT、Claude、DeepSeek、Gemini、Gemma和Llama)在五种配置下的表现,这些配置涵盖建模目标(仅建议 vs. 建议并估计MNL)、提示策略(零样本 vs. 思维链)以及信息可用性(完整数据集 vs. 仅数据字典)。研究发现,专有模型如Claude 4 Sonnet和GPT系列能够生成行为合理且拟合优度高的MNL规格,其中GPT o3甚至能自主执行代码完成自身建议模型的估计,表明LLM具备辅助决策与自动化建模的能力,但开源模型如Llama和Gemma表现较差,且限制原始数据访问可能提升内部推理能力,从而为将LLM集成到选择建模工作流提供了实证依据与实践指导。

链接: https://arxiv.org/abs/2507.21790
作者: Georges Sfeir,Gabriel Nova,Stephane Hess,Sander van Cranenburgh
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
备注: 32 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used to support various workflows across different disciplines, yet their potential in choice modelling remains relatively unexplored. This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models. We implement a systematic experimental framework involving thirteen versions of six leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, and Llama) evaluated under five experimental configurations. These configurations vary along three dimensions: modelling goal (suggesting vs. suggesting and estimating MNLs); prompting strategy (Zero-Shot vs. Chain-of-Thoughts); and information availability (full dataset vs. data dictionary only). Each LLM-suggested specification is implemented, estimated, and evaluated based on goodness-of-fit metrics, behavioural plausibility, and model complexity. Findings reveal that proprietary LLMs can generate valid and behaviourally sound utility specifications, particularly when guided by structured prompts. Open-weight models such as Llama and Gemma struggled to produce meaningful specifications. Claude 4 Sonnet consistently produced the best-fitting and most complex models, while GPT models suggested models with robust and stable modelling outcomes. Some LLMs performed better when provided with just data dictionary, suggesting that limiting raw data access may enhance internal reasoning capabilities. Among all LLMs, GPT o3 was uniquely capable of correctly estimating its own specifications by executing self-generated code. Overall, the results demonstrate both the promise and current limitations of LLMs as assistive agents in choice modelling, not only for model specification but also for supporting modelling decision and estimation, and provide practical guidance for integrating these tools into choice modellers’ workflows.
zh

[AI-141] Learning Kinetic Monte Carlo stochastic dynamics with Deep Generative Adversarial Networks

【速读】:该论文旨在解决传统模型在模拟复杂随机动力学系统(如表面台阶涨落)时计算成本高且难以准确捕捉热涨落的问题。其关键解决方案是利用条件生成对抗网络(conditional GAN)从基于动力学蒙特卡洛(Kinetic Monte Carlo)模拟构建的数据集中学习系统的随机演化规律,从而在显著降低计算开销的同时,精确重构系统的平衡态与非平衡态动力学特性(包括时间相关的粗糙度标度律),训练后的网络可生成符合物理规律的新序列,且与真实值偏差仅几个百分点。

链接: https://arxiv.org/abs/2507.21763
作者: Daniele Lanzoni,Olivier Pierre-Louis,Roberto Bergamaschini,Francesco Montalenti
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 15 pages, 8 figures, 2 appendices

点击查看摘要

Abstract:We show that Generative Adversarial Networks (GANs) may be fruitfully exploited to learn stochastic dynamics, surrogating traditional models while capturing thermal fluctuations. Specifically, we showcase the application to a two-dimensional, many-particle system, focusing on surface-step fluctuations and on the related time-dependent roughness. After the construction of a dataset based on Kinetic Monte Carlo simulations, a conditional GAN is trained to propagate stochastically the state of the system in time, allowing the generation of new sequences with a reduced computational cost. Modifications with respect to standard GANs, which facilitate convergence and increase accuracy, are discussed. The trained network is demonstrated to quantitatively reproduce equilibrium and kinetic properties, including scaling laws, with deviations of a few percent from the exact value. Extrapolation limits and future perspectives are critically discussed.
zh

[AI-142] EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations

【速读】:该论文旨在解决在精准医学中区分致病突变与良性多态性这一关键挑战。其解决方案的核心在于EnTao-GPM模型的三项创新:(1) 基于人类、猪和小鼠等疾病相关哺乳动物基因组的跨物种靶向预训练,利用进化保守性增强对非编码区致病基序的解读;(2) 通过在ClinVar和HGMD数据库上微调,实现对单核苷酸变异(SNV)和非SNV的 germline 突变特化建模,提升分类准确性;(3) 构建可解释的临床框架,融合DNA序列嵌入与大语言模型(LLM)统计解释,提供可操作的诊断洞察。该方法显著提升了突变分类的精度,推动遗传检测向更快速、准确和可及的方向发展。

链接: https://arxiv.org/abs/2507.21706
作者: Zekai Lin,Haoran Sun,Yucheng Guo,Yujie Yang,Yanwen Wang,Bozhen Hu,Chonghang Ye,Qirong Yang,Fan Zhong,Xiaoming Zhang,Lei Liu
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distinguishing pathogenic mutations from benign polymorphisms remains a critical challenge in precision medicine. EnTao-GPM, developed by Fudan University and BioMap, addresses this through three innovations: (1) Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse), leveraging evolutionary conservation to enhance interpretation of pathogenic motifs, particularly in non-coding regions; (2) Germline mutation specialization via fine-tuning on ClinVar and HGMD, improving accuracy for both SNVs and non-SNVs; (3) Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations to provide actionable insights. Validated against ClinVar, EnTao-GPM demonstrates superior accuracy in mutation classification. It revolutionizes genetic testing by enabling faster, more accurate, and accessible interpretation for clinical diagnostics (e.g., variant assessment, risk identification, personalized treatment) and research, advancing personalized medicine.
zh

[AI-143] owards a Large Physics Benchmark

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在基础物理学研究中缺乏系统性评估与引导的问题,以确保AI发展能够真正服务于科学发现。其解决方案的关键在于构建一个由科学界共同开发和维护的动态基准测试框架(living benchmark),该框架通过专家评分机制对问题进行正确性、难度和新颖性(surprise)三维度量化评估,涵盖概念理解类多选题、需数学推导的分析题以及要求复杂求解的开放任务三种形式,并包含如高能物理事件分类等实际科研挑战,从而实现对LLMs在基础物理领域能力的持续监测与定向优化。

链接: https://arxiv.org/abs/2507.21695
作者: Kristian G. Barman,Sascha Caron,Faegheh Hasibi,Eugene Shalugin,Yoris Marcet,Johannes Otte,Henk W. de Regt,Merijn Moody
机构: 未知
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); History and Philosophy of Physics (physics.hist-ph)
备注:

点击查看摘要

Abstract:We introduce a benchmark framework developed by and for the scientific community to evaluate, monitor and steer large language model development in fundamental physics. Building on philosophical concepts of scientific understanding and creativity, we develop a scoring system in which each question is scored by an expert for its correctness, difficulty, and surprise. The questions are of three forms: (i) multiple-choice questions for conceptual understanding, (ii) analytical problems requiring mathematical derivation, and (iii) openended tasks requiring complex problem solving. Our current dataset contains diverse set of examples, including a machine learning challenge to classify high-energy physics events, such as the four top quark signal. To ensure continued relevance, we propose a living benchmark, where physicists contribute questions, for instance alongside new publications. We invite contributions via: this http URL. We hope that this benchmark will enable a targeted AI development that can make a meaningful contribution to fundamental physics research.
zh

[AI-144] diffSPH: Differentiable Smoothed Particle Hydrodynamics for Adjoint Optimization and Machine Learning

【速读】:该论文旨在解决传统光滑粒子流体动力学(Smoothed Particle Hydrodynamics, SPH)在计算流体力学(Computational Fluid Dynamics, CFD)中难以支持优化与机器学习(Machine Learning, ML)应用的问题,尤其是缺乏可微分性导致的参数调优、初始条件优化及混合模型开发困难。解决方案的关键在于提出 diffSPH——一个完全基于 PyTorch 实现、具备 GPU 加速能力的可微分 SPH 框架,其核心特性是围绕自动微分构建,支持压缩性(含激波捕捉和多相流)、弱压缩性和不可压缩物理模型,并通过目标导向的粒子位移策略、梯度传播至数百步模拟等技术,实现了对复杂流体行为的高效优化与建模能力。

链接: https://arxiv.org/abs/2507.21684
作者: Rene Winchenbach,Nils Thuerey
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present diffSPH, a novel open-source differentiable Smoothed Particle Hydrodynamics (SPH) framework developed entirely in PyTorch with GPU acceleration. diffSPH is designed centrally around differentiation to facilitate optimization and machine learning (ML) applications in Computational Fluid Dynamics~(CFD), including training neural networks and the development of hybrid models. Its differentiable SPH core, and schemes for compressible (with shock capturing and multi-phase flows), weakly compressible (with boundary handling and free-surface flows), and incompressible physics, enable a broad range of application areas. We demonstrate the framework’s unique capabilities through several applications, including addressing particle shifting via a novel, target-oriented approach by minimizing physical and regularization loss terms, a task often intractable in traditional solvers. Further examples include optimizing initial conditions and physical parameters to match target trajectories, shape optimization, implementing a solver-in-the-loop setup to emulate higher-order integration, and demonstrating gradient propagation through hundreds of full simulation steps. Prioritizing readability, usability, and extensibility, this work offers a foundational platform for the CFD community to develop and deploy novel neural networks and adjoint optimization applications.
zh

机器学习

[LG-0] Weight-Parameterization in Continuous Time Deep Neural Networks for Surrogate Modeling

链接: https://arxiv.org/abs/2507.22045
作者: Haley Rosso,Lars Ruthotto,Khachik Sargsyan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 34 pages, 6 figures, submitted to the MoRE24 special issue of Computational Science and Engineering

点击查看摘要

Abstract:Continuous-time deep learning models, such as neural ordinary differential equations (ODEs), offer a promising framework for surrogate modeling of complex physical systems. A central challenge in training these models lies in learning expressive yet stable time-varying weights, particularly under computational constraints. This work investigates weight parameterization strategies that constrain the temporal evolution of weights to a low-dimensional subspace spanned by polynomial basis functions. We evaluate both monomial and Legendre polynomial bases within neural ODE and residual network (ResNet) architectures under discretize-then-optimize and optimize-then-discretize training paradigms. Experimental results across three high-dimensional benchmark problems show that Legendre parameterizations yield more stable training dynamics, reduce computational cost, and achieve accuracy comparable to or better than both monomial parameterizations and unconstrained weight models. These findings elucidate the role of basis choice in time-dependent weight parameterization and demonstrate that using orthogonal polynomial bases offers a favorable tradeoff between model expressivity and training efficiency.

[LG-1] Structure-Informed Deep Reinforcement Learning for Inventory Management

链接: https://arxiv.org/abs/2507.22040
作者: Alvaro Maggiar,Sohrab Andaz,Akhil Bagaria,Carson Eisenach,Dean Foster,Omer Gottesman,Dominique Perrault-Joncas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper investigates the application of Deep Reinforcement Learning (DRL) to classical inventory management problems, with a focus on practical implementation considerations. We apply a DRL algorithm based on DirectBackprop to several fundamental inventory management scenarios including multi-period systems with lost sales (with and without lead times), perishable inventory management, dual sourcing, and joint inventory procurement and removal. The DRL approach learns policies across products using only historical information that would be available in practice, avoiding unrealistic assumptions about demand distributions or access to distribution parameters. We demonstrate that our generic DRL implementation performs competitively against or outperforms established benchmarks and heuristics across these diverse settings, while requiring minimal parameter tuning. Through examination of the learned policies, we show that the DRL approach naturally captures many known structural properties of optimal policies derived from traditional operations research methods. To further improve policy performance and interpretability, we propose a Structure-Informed Policy Network technique that explicitly incorporates analytically-derived characteristics of optimal policies into the learning process. This approach can help interpretability and add robustness to the policy in out-of-sample performance, as we demonstrate in an example with realistic demand data. Finally, we provide an illustrative application of DRL in a non-stationary setting. Our work bridges the gap between data-driven learning and analytical insights in inventory management while maintaining practical applicability.

[LG-2] Classification of Honey Botanical and Geographical Sources using Mineral Profiles and Machine Learning

链接: https://arxiv.org/abs/2507.22032
作者: Mokhtar Al-Awadhi,Ratnadeep Deshmukh
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, conference paper

点击查看摘要

Abstract:This paper proposes a machine learning-based approach for identifying honey floral and geographical sources using mineral element profiles. The proposed method comprises two steps: preprocessing and classification. The preprocessing phase involves missing-value treatment and data normalization. In the classification phase, we employ various supervised classification models for discriminating between six botanical sources and 13 geographical origins of honey. We test the classifiers’ performance on a publicly available honey mineral element dataset. The dataset contains mineral element profiles of honeys from various floral and geographical origins. Results show that mineral element content in honey provides discriminative information useful for classifying honey botanical and geographical sources. Results also show that the Random Forests (RF) classifier obtains the best performance on this dataset, achieving a cross-validation accuracy of 99.30% for classifying honey botanical origins and 98.01% for classifying honey geographical origins.

[LG-3] Improving Generative Ad Text on Facebook using Reinforcement Learning

链接: https://arxiv.org/abs/2507.21983
作者: Daniel R. Jiang,Alex Nikulkov,Yu-Chia Chen,Yang Bai,Zheqing Zhu
类目: Machine Learning (cs.LG)
*备注: D.J. and A.N. contributed equally, 41 pages, 6 figures

点击查看摘要

Abstract:Generative artificial intelligence (AI), in particular large language models (LLMs), is poised to drive transformative economic change. LLMs are pre-trained on vast text data to learn general language patterns, but a subsequent post-training phase is critical to align them for specific real-world tasks. Reinforcement learning (RL) is the leading post-training technique, yet its economic impact remains largely underexplored and unquantified. We examine this question through the lens of the first deployment of an RL-trained LLM for generative advertising on Facebook. Integrated into Meta’s Text Generation feature, our model, “AdLlama,” powers an AI tool that helps advertisers create new variations of human-written ad text. To train this model, we introduce reinforcement learning with performance feedback (RLPF), a post-training method that uses historical ad performance data as a reward signal. In a large-scale 10-week A/B test on Facebook spanning nearly 35,000 advertisers and 640,000 ad variations, we find that AdLlama improves click-through rates by 6.7% (p=0.0296) compared to a supervised imitation model trained on curated ads. This represents a substantial improvement in advertiser return on investment on Facebook. We also find that advertisers who used AdLlama generated more ad variations, indicating higher satisfaction with the model’s outputs. To our knowledge, this is the largest study to date on the use of generative AI in an ecologically valid setting, offering an important data point quantifying the tangible impact of RL post-training. Furthermore, the results show that RLPF is a promising and generalizable approach for metric-driven post-training that bridges the gap between highly capable language models and tangible outcomes.

[LG-4] SLA-Centric Automated Algorithm Selection Framework for Cloud Environments

链接: https://arxiv.org/abs/2507.21963
作者: Siana Rizwan,Tasnim Ahmed,Salimur Choudhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloud computing offers on-demand resource access, regulated by Service-Level Agreements (SLAs) between consumers and Cloud Service Providers (CSPs). SLA violations can impact efficiency and CSP profitability. In this work, we propose an SLA-aware automated algorithm-selection framework for combinatorial optimization problems in resource-constrained cloud environments. The framework uses an ensemble of machine learning models to predict performance and rank algorithm-hardware pairs based on SLA constraints. We also apply our framework to the 0-1 knapsack problem. We curate a dataset comprising instance specific features along with memory usage, runtime, and optimality gap for 6 algorithms. As an empirical benchmark, we evaluate the framework on both classification and regression tasks. Our ablation study explores the impact of hyperparameters, learning approaches, and large language models effectiveness in regression, and SHAP-based interpretability.

[LG-5] DeepGo: Predictive Directed Greybox Fuzzing

链接: https://arxiv.org/abs/2507.21952
作者: Peihong Lin,Pengfei Wang,Xu Zhou,Wei Xie,Gen Zhang,Kai Lu
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The state-of-the-art DGF techniques redefine and optimize the fitness metric to reach the target sites precisely and quickly. However, optimizations for fitness metrics are mainly based on heuristic algorithms, which usually rely on historical execution information and lack foresight on paths that have not been exercised yet. Thus, those hard-to-execute paths with complex constraints would hinder DGF from reaching the targets, making DGF less efficient. In this paper, we propose DeepGo, a predictive directed grey-box fuzzer that can combine historical and predicted information to steer DGF to reach the target site via an optimal path. We first propose the path transition model, which models DGF as a process of reaching the target site through specific path transition sequences. The new seed generated by mutation would cause the path transition, and the path corresponding to the high-reward path transition sequence indicates a high likelihood of reaching the target site through it. Then, to predict the path transitions and the corresponding rewards, we use deep neural networks to construct a Virtual Ensemble Environment (VEE), which gradually imitates the path transition model and predicts the rewards of path transitions that have not been taken yet. To determine the optimal path, we develop a Reinforcement Learning for Fuzzing (RLF) model to generate the transition sequences with the highest sequence rewards. The RLF model can combine historical and predicted path transitions to generate the optimal path transition sequences, along with the policy to guide the mutation strategy of fuzzing. Finally, to exercise the high-reward path transition sequence, we propose the concept of an action group, which comprehensively optimizes the critical steps of fuzzing to realize the optimal path to reach the target efficiently.

[LG-6] Multi-state Protein Design with DynamicMPNN ICML2025

链接: https://arxiv.org/abs/2507.21938
作者: Alex Abrudan,Sebastian Pujalte Ojeda,Chaitanya K. Joshi,Matthew Greenig,Felipe Engelberger,Alena Khmelinskaia,Jens Meiler,Michele Vendruscolo,Tuomas P. J. Knowles
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: ICML 2025 GenBio Workshop

点击查看摘要

Abstract:Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using AlphaFold initial guess, DynamicMPNN outperforms ProteinMPNN by up to 13% on structure-normalized RMSD across our challenging multi-state protein benchmark.

[LG-7] Cardiovascular Disease Prediction using Machine Learning: A Comparative Analysis

链接: https://arxiv.org/abs/2507.21898
作者: Risshab Srinivas Ramesh,Roshani T S Udupa,Monisha J,Kushi K K S
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) are a main cause of mortality globally, accounting for 31% of all deaths. This study involves a cardiovascular disease (CVD) dataset comprising 68,119 records to explore the influence of numerical (age, height, weight, blood pressure, BMI) and categorical gender, cholesterol, glucose, smoking, alcohol, activity) factors on CVD occurrence. We have performed statistical analyses, including t-tests, Chi-square tests, and ANOVA, to identify strong associations between CVD and elderly people, hypertension, higher weight, and abnormal cholesterol levels, while physical activity (a protective factor). A logistic regression model highlights age, blood pressure, and cholesterol as primary risk factors, with unexpected negative associations for smoking and alcohol, suggesting potential data issues. Model performance comparisons reveal CatBoost as the top performer with an accuracy of 0.734 and an ECE of 0.0064 and excels in probabilistic prediction (Brier score = 0.1824). Data challenges, including outliers and skewed distributions, indicate a need for improved preprocessing to enhance predictive reliability.

[LG-8] Discovering Interpretable Ordinary Differential Equations from Noisy Data

链接: https://arxiv.org/abs/2507.21841
作者: Rahul Golder,M. M. Faruque Hasan
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 20 pages, 11 figures, 7 tables

点击查看摘要

Abstract:The data-driven discovery of interpretable models approximating the underlying dynamics of a physical system has gained attraction in the past decade. Current approaches employ pre-specified functional forms or basis functions and often result in models that lack physical meaning and interpretability, let alone represent the true physics of the system. We propose an unsupervised parameter estimation methodology that first finds an approximate general solution, followed by a spline transformation to linearly estimate the coefficients of the governing ordinary differential equation (ODE). The approximate general solution is postulated using the same functional form as the analytical solution of a general homogeneous, linear, constant-coefficient ODE. An added advantage is its ability to produce a high-fidelity, smooth functional form even in the presence of noisy data. The spline approximation obtains gradient information from the functional form which are linearly independent and creates the basis of the gradient matrix. This gradient matrix is used in a linear system to find the coefficients of the ODEs. From the case studies, we observed that our modeling approach discovers ODEs with high accuracy and also promotes sparsity in the solution without using any regularization techniques. The methodology is also robust to noisy data and thus allows the integration of data-driven techniques into real experimental setting for data-driven learning of physical phenomena.

[LG-9] Bayesian Neural Network Surrogates for Bayesian Optimization of Carbon Capture and Storag e Operations

链接: https://arxiv.org/abs/2507.21803
作者: Sofianos Panagiotis Fotias,Vassilis Gaganis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Carbon Capture and Storage (CCS) stands as a pivotal technology for fostering a sustainable future. The process, which involves injecting supercritical CO _2 into underground formations, a method already widely used for Enhanced Oil Recovery, serves a dual purpose: it not only curbs CO _2 emissions and addresses climate change but also extends the operational lifespan and sustainability of oil fields and platforms, easing the shift toward greener practices. This paper delivers a thorough comparative evaluation of strategies for optimizing decision variables in CCS project development, employing a derivative-free technique known as Bayesian Optimization. In addition to Gaussian Processes, which usually serve as the gold standard in BO, various novel stochastic models were examined and compared within a BO framework. This research investigates the effectiveness of utilizing more exotic stochastic models than GPs for BO in environments where GPs have been shown to underperform, such as in cases with a large number of decision variables or multiple objective functions that are not similarly scaled. By incorporating Net Present Value (NPV) as a key objective function, the proposed framework demonstrates its potential to improve economic viability while ensuring the sustainable deployment of CCS technologies. Ultimately, this study represents the first application in the reservoir engineering industry of the growing body of BO research, specifically in the search for more appropriate stochastic models, highlighting its potential as a preferred method for enhancing sustainability in the energy sector.

[LG-10] mpRe: Template generation for single and direct multi-step retrosynthesis

链接: https://arxiv.org/abs/2507.21762
作者: Nguyen Xuan-Vu,Daniel Armstrong,Zlatko Joncev,Philippe Schwaller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrosynthesis planning remains a central challenge in molecular discovery due to the vast and complex chemical reaction space. While traditional template-based methods offer tractability, they suffer from poor scalability and limited generalization, and template-free generative approaches risk generating invalid reactions. In this work, we propose TempRe, a generative framework that reformulates template-based approaches as sequence generation, enabling scalable, flexible, and chemically plausible retrosynthesis. We evaluated TempRe across single-step and multi-step retrosynthesis tasks, demonstrating its superiority over both template classification and SMILES-based generation methods. On the PaRoutes multi-step benchmark, TempRe achieves strong top-k route accuracy. Furthermore, we extend TempRe to direct multi-step synthesis route generation, providing a lightweight and efficient alternative to conventional single-step and search-based approaches. These results highlight the potential of template generative modeling as a powerful paradigm in computer-aided synthesis planning.

[LG-11] Improving Neural Network Training using Dynamic Learning Rate Schedule for PINNs and Image Classification

链接: https://arxiv.org/abs/2507.21749
作者: D. Veerababu,Ashwin A. Raikar,Prasanta K. Ghosh
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Training neural networks can be challenging, especially as the complexity of the problem increases. Despite using wider or deeper networks, training them can be a tedious process, especially if a wrong choice of the hyperparameter is made. The learning rate is one of such crucial hyperparameters, which is usually kept static during the training process. Learning dynamics in complex systems often requires a more adaptive approach to the learning rate. This adaptability becomes crucial to effectively navigate varying gradients and optimize the learning process during the training process. In this paper, a dynamic learning rate scheduler (DLRS) algorithm is presented that adapts the learning rate based on the loss values calculated during the training process. Experiments are conducted on problems related to physics-informed neural networks (PINNs) and image classification using multilayer perceptrons and convolutional neural networks, respectively. The results demonstrate that the proposed DLRS accelerates training and improves stability.

[LG-12] voxels: A differentiable physics framework for voxel-based microstructure simulations

链接: https://arxiv.org/abs/2507.21748
作者: Simon Daubner,Alexander E. Cohen,Benjamin Dörich,Samuel J. Cooper
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: 9 pages, 3 figures, structure following JOSS style

点击查看摘要

Abstract:Materials science inherently spans disciplines: experimentalists use advanced microscopy to uncover micro- and nanoscale structure, while theorists and computational scientists develop models that link processing, structure, and properties. Bridging these domains is essential for inverse material design where you start from desired performance and work backwards to optimal microstructures and manufacturing routes. Integrating high-resolution imaging with predictive simulations and data-driven optimization accelerates discovery and deepens understanding of process-structure-property relationships. The differentiable physics framework evoxels is based on a fully Pythonic, unified voxel-based approach that integrates segmented 3D microscopy data, physical simulations, inverse modeling, and machine learning.

[LG-13] Generalized few-shot transfer learning architecture for modeling the EDFA gain spectrum

链接: https://arxiv.org/abs/2507.21728
作者: Agastya Raj,Zehao Wang,Tingjun Chen,Daniel C Kilper,Marco Ruffini
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This is a preprint of a paper accepted and published in the Journal of Optical Communications and Networking (JOCN). The final published version is available at: this https URL

点击查看摘要

Abstract:Accurate modeling of the gain spectrum in Erbium-Doped Fiber Amplifiers (EDFAs) is essential for optimizing optical network performance, particularly as networks evolve toward multi-vendor solutions. In this work, we propose a generalized few-shot transfer learning architecture based on a Semi-Supervised Self-Normalizing Neural Network (SS-NN) that leverages internal EDFA features - such as VOA input or output power and attenuation, to improve gain spectrum prediction. Our SS-NN model employs a two-phase training strategy comprising unsupervised pre-training with noise-augmented measurements and supervised fine-tuning with a custom weighted MSE loss. Furthermore, we extend the framework with transfer learning (TL) techniques that enable both homogeneous (same-feature space) and heterogeneous (different-feature sets) model adaptation across booster, preamplifier, and ILA EDFAs. To address feature mismatches in heterogeneous TL, we incorporate a covariance matching loss to align second-order feature statistics between source and target domains. Extensive experiments conducted across 26 EDFAs in the COSMOS and Open Ireland testbeds demonstrate that the proposed approach significantly reduces the number of measurements requirements on the system while achieving lower mean absolute errors and improved error distributions compared to benchmark methods.

[LG-14] Data-Driven Extended Corresponding State Approach for Residual Property Prediction of Hydrofluoroolefins

链接: https://arxiv.org/abs/2507.21720
作者: Gang Wang,Peng Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hydrofluoroolefins are considered the most promising next-generation refrigerants due to their extremely low global warming potential values, which can effectively mitigate the global warming effect. However, the lack of reliable thermodynamic data hinders the discovery and application of newer and superior hydrofluoroolefin refrigerants. In this work, integrating the strengths of theoretical method and data-driven method, we proposed a neural network extended corresponding state model to predict the residual thermodynamic properties of hydrofluoroolefin refrigerants. The innovation is that the fluids are characterized through their microscopic molecular structures by the inclusion of graph neural network module and the specialized design of model architecture to enhance its generalization ability. The proposed model is trained using the highly accurate data of available known fluids, and evaluated via the leave-one-out cross-validation method. Compared to conventional extended corresponding state models or cubic equation of state, the proposed model shows significantly improved accuracy for density and energy properties in liquid and supercritical regions, with average absolute deviation of 1.49% (liquid) and 2.42% (supercritical) for density, 3.37% and 2.50% for residual entropy, 1.85% and 1.34% for residual enthalpy. These results demonstrate the effectiveness of embedding physics knowledge into the machine learning model. The proposed neural network extended corresponding state model is expected to significantly accelerate the discovery of novel hydrofluoroolefin refrigerants.

[LG-15] PREIG: Physics-informed and Reinforcement-driven Interpretable GRU for Commodity Demand Forecasting

链接: https://arxiv.org/abs/2507.21710
作者: Hongwei Ma,Junbin Gao,Minh-Ngoc Tran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately forecasting commodity demand remains a critical challenge due to volatile market dynamics, nonlinear dependencies, and the need for economically consistent predictions. This paper introduces PREIG, a novel deep learning framework tailored for commodity demand forecasting. The model uniquely integrates a Gated Recurrent Unit (GRU) architecture with physics-informed neural network (PINN) principles by embedding a domain-specific economic constraint: the negative elasticity between price and demand. This constraint is enforced through a customized loss function that penalizes violations of the physical rule, ensuring that model predictions remain interpretable and aligned with economic theory. To further enhance predictive performance and stability, PREIG incorporates a hybrid optimization strategy that couples NAdam and L-BFGS with Population-Based Training (POP). Experiments across multiple commodities datasets demonstrate that PREIG significantly outperforms traditional econometric models (ARIMA,GARCH) and deep learning baselines (BPNN,RNN) in both RMSE and MAPE. When compared with GRU,PREIG maintains good explainability while still performing well in prediction. By bridging domain knowledge, optimization theory and deep learning, PREIG provides a robust, interpretable, and scalable solution for high-dimensional nonlinear time series forecasting in economy.

[LG-16] Probabilistic Consistency in Machine Learning and Its Connection to Uncertainty Quantification

链接: https://arxiv.org/abs/2507.21670
作者: Paul Patrone,Anthony Kearsley
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Machine learning (ML) is often viewed as a powerful data analysis tool that is easy to learn because of its black-box nature. Yet this very nature also makes it difficult to quantify confidence in predictions extracted from ML models, and more fundamentally, to understand how such models are mathematical abstractions of training data. The goal of this paper is to unravel these issues and their connections to uncertainty quantification (UQ) by pursuing a line of reasoning motivated by diagnostics. In such settings, prevalence - i.e. the fraction of elements in class - is often of inherent interest. Here we analyze the many interpretations of prevalence to derive a level-set theory of classification, which shows that certain types of self-consistent ML models are equivalent to class-conditional probability distributions. We begin by studying the properties of binary Bayes optimal classifiers, recognizing that their boundary sets can be reinterpreted as level-sets of pairwise density ratios. By parameterizing Bayes classifiers in terms of the prevalence, we then show that they satisfy important monotonicity and class-switching properties that can be used to deduce the density ratios without direct access to the boundary sets. Moreover, this information is sufficient for tasks such as constructing the multiclass Bayes-optimal classifier and estimating inherent uncertainty in the class assignments. In the multiclass case, we use these results to deduce normalization and self-consistency conditions, the latter being equivalent to the law of total probability for classifiers. We also show that these are necessary conditions for arbitrary ML models to have valid probabilistic interpretations. Throughout we demonstrate how this analysis informs the broader task of UQ for ML via an uncertainty propagation framework.

[LG-17] Hyperbolic Genome Embeddings ICLR2025

链接: https://arxiv.org/abs/2507.21648
作者: Raiyan R. Khan,Philippe Chlenski,Itsik Pe’er
类目: Machine Learning (cs.LG)
*备注: 30 pages, 16 figures, 10 tables. Camera-ready version for ICLR 2025

点击查看摘要

Abstract:Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets–the Transposable Elements Benchmark–which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various data-generating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning. Our code and benchmark datasets are available at this https URL.

[LG-18] Whilter: A Whisper-based Data Filter for “In-the-Wild” Speech Corpora Using Utterance-level Multi-Task Classification INTERSPEECH2025

链接: https://arxiv.org/abs/2507.21642
作者: William Ravenscroft,George Close,Kit Bower-Morris,Jamie Stacey,Dmitry Sityaev,Kris Y. Hong
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted for Interspeech 2025

点击查看摘要

Abstract:Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives.

[LG-19] Categorical Distributions are Effective Neural Network Outputs for Event Prediction

链接: https://arxiv.org/abs/2507.21616
作者: Kevin Doran,Tom Baden
类目: Machine Learning (cs.LG)
*备注: 32 pages, 26 figures

点击查看摘要

Abstract:We demonstrate the effectiveness of using a simple neural network output, a categorical probability distribution, for the task of next spike prediction. This case study motivates an investigation into why this simple output structure is not commonly used with neural temporal point process models. We find evidence that many existing datasets for evaluating temporal point process models do not reveal much information about the underlying event generating processes, and many existing models perform well due to regularization effects of model size and constraints on output structure. We extend existing datasets and create new ones in order to explore outside of this information limited regime and find that outputting a simple categorical distribution is competitive across a wide range of datasets.

[LG-20] Enhancing Graph-based Recommendations with Majority-Voting LLM -Rerank Augmentation

链接: https://arxiv.org/abs/2507.21563
作者: Minh-Anh Nguyen,Bao Nguyen,Ha Lan N.T.,Tuan Anh Hoang,Duc-Trong Le,Dung D. Le
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommendation systems often suffer from data sparsity caused by limited user-item interactions, which degrade their performance and amplify popularity bias in real-world scenarios. This paper proposes a novel data augmentation framework that leverages Large Language Models (LLMs) and item textual descriptions to enrich interaction data. By few-shot prompting LLMs multiple times to rerank items and aggregating the results via majority voting, we generate high-confidence synthetic user-item interactions, supported by theoretical guarantees based on the concentration of measure. To effectively leverage the augmented data in the context of a graph recommendation system, we integrate it into a graph contrastive learning framework to mitigate distributional shift and alleviate popularity bias. Extensive experiments show that our method improves accuracy and reduces popularity bias, outperforming strong baselines.

[LG-21] Hierarchical Stochastic Differential Equation Models for Latent Manifold Learning in Neural Time Series

链接: https://arxiv.org/abs/2507.21531
作者: Pedram Rajaei,Maryam Ostadsharif Memar,Navid Ziaei,Behzad Nazari,Ali Yousefi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The manifold hypothesis suggests that high-dimensional neural time series lie on a low-dimensional manifold shaped by simpler underlying dynamics. To uncover this structure, latent dynamical variable models such as state-space models, recurrent neural networks, neural ordinary differential equations, and Gaussian Process Latent Variable Models are widely used. We propose a novel hierarchical stochastic differential equation (SDE) model that balances computational efficiency and interpretability, addressing key limitations of existing methods. Our model assumes the trajectory of a manifold can be reconstructed from a sparse set of samples from the manifold trajectory. The latent space is modeled using Brownian bridge SDEs, with points - specified in both time and value - sampled from a multivariate marked point process. These Brownian bridges define the drift of a second set of SDEs, which are then mapped to the observed data. This yields a continuous, differentiable latent process capable of modeling arbitrarily complex time series as the number of manifold points increases. We derive training and inference procedures and show that the computational cost of inference scales linearly with the length of the observation data. We then validate our model on both synthetic data and neural recordings to demonstrate that it accurately recovers the underlying manifold structure and scales effectively with data dimensionality.

[LG-22] Multifunctional physical reservoir computing in soft tensegrity robots

链接: https://arxiv.org/abs/2507.21496
作者: Ryo Terajima,Katsuma Inoue,Kohei Nakajima,Yasuo Kuniyoshi
类目: Robotics (cs.RO); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 25 pages, 12 figures. The following article has been accepted by Chaos: An Interdisciplinary Journal of Nonlinear Science

点击查看摘要

Abstract:Recent studies have demonstrated that the dynamics of physical systems can be utilized for the desired information processing under the framework of physical reservoir computing (PRC). Robots with soft bodies are examples of such physical systems, and their nonlinear body-environment dynamics can be used to compute and generate the motor signals necessary for the control of their own behavior. In this simulation study, we extend this approach to control and embed not only one but also multiple behaviors into a type of soft robot called a tensegrity robot. The resulting system, consisting of the robot and the environment, is a multistable dynamical system that converges to different attractors from varying initial conditions. Furthermore, attractor analysis reveals that there exist “untrained attractors” in the state space of the system outside the training data. These untrained attractors reflect the intrinsic properties and structures of the tensegrity robot and its interactions with the environment. The impacts of these recent findings in PRC remain unexplored in embodied AI research. We here illustrate their potential to understand various features of embodied cognition that have not been fully addressed to date.

[LG-23] Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning ICCV2025

链接: https://arxiv.org/abs/2507.21494
作者: Wenxuan Bao,Ruxi Deng,Ruizhong Qiu,Tianxin Wei,Hanghang Tong,Jingrui He
类目: Machine Learning (cs.LG)
*备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their training-free nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client’s unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server’s coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs. Our code is available at this https URL .

[LG-24] Retrieve-Augmented Generation for Speeding up Diffusion Policy without Additional Training

链接: https://arxiv.org/abs/2507.21452
作者: Sodtavilan Odonchimed,Tatsuya Matsushima,Simon Holk,Yusuke Iwasawa,Yutaka Matsuo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Diffusion Policies (DPs) have attracted attention for their ability to achieve significant accuracy improvements in various imitation learning tasks. However, DPs depend on Diffusion Models, which require multiple noise removal steps to generate a single action, resulting in long generation times. To solve this problem, knowledge distillation-based methods such as Consistency Policy (CP) have been proposed. However, these methods require a significant amount of training time, especially for difficult tasks. In this study, we propose RAGDP (Retrieve-Augmented Generation for Diffusion Policies) as a novel framework that eliminates the need for additional training using a knowledge base to expedite the inference of pre-trained DPs. In concrete, RAGDP encodes observation-action pairs through the DP encoder to construct a vector database of expert demonstrations. During inference, the current observation is embedded, and the most similar expert action is extracted. This extracted action is combined with an intermediate noise removal step to reduce the number of steps required compared to the original diffusion step. We show that by using RAGDP with the base model and existing acceleration methods, we improve the accuracy and speed trade-off with no additional training. Even when accelerating the models 20 times, RAGDP maintains an advantage in accuracy, with a 7% increase over distillation models such as CP.

[LG-25] PVD-ONet: A Multi-scale Neural Operator Method for Singularly Perturbed Boundary Layer Problems

链接: https://arxiv.org/abs/2507.21437
作者: Tiantian Sun,Jian Zu
类目: Machine Learning (cs.LG)
*备注: 34pages,14figures

点击查看摘要

Abstract:Physics-informed neural networks and Physics-informed DeepONet excel in solving partial differential equations; however, they often fail to converge for singularly perturbed problems. To address this, we propose two novel frameworks, Prandtl-Van Dyke neural network (PVD-Net) and its operator learning extension Prandtl-Van Dyke Deep Operator Network (PVD-ONet), which rely solely on governing equations without data. To address varying task-specific requirements, both PVD-Net and PVD-ONet are developed in two distinct versions, tailored respectively for stability-focused and high-accuracy modeling. The leading-order PVD-Net adopts a two-network architecture combined with Prandtl’s matching condition, targeting stability-prioritized scenarios. The high-order PVD-Net employs a five-network design with Van Dyke’s matching principle to capture fine-scale boundary layer structures, making it ideal for high-accuracy scenarios. PVD-ONet generalizes PVD-Net to the operator learning setting by assembling multiple DeepONet modules, directly mapping initial conditions to solution operators and enabling instant predictions for an entire family of boundary layer problems without retraining. Numerical experiments on various models show that our proposed methods consistently outperform existing baselines under various error metrics, thereby offering a powerful new approach for multi-scale problems.

[LG-26] orque-based Graph Surgery:Enhancing Graph Neural Networks with Hierarchical Rewiring

链接: https://arxiv.org/abs/2507.21422
作者: Sujia Huang,Lele Fu,Zhen Cui,Tong Zhang,Na Song,Bo Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for learning from graph-structured data, leveraging message passing to diffuse information and update node representations. However, most efforts have suggested that native interactions encoded in the graph may not be friendly for this process, motivating the development of graph rewiring methods. In this work, we propose a torque-driven hierarchical rewiring strategy, inspired by the notion of torque in classical mechanics, dynamically modulating message passing to improve representation learning in heterophilous graphs and enhance robustness against noisy graphs. Specifically, we define an interference-aware torque metric that integrates structural distance and energy scores to quantify the perturbation induced by edges, thereby encouraging each node to aggregate information from its nearest low-energy neighbors. We use the metric to hierarchically reconfigure the receptive field of each layer by judiciously pruning high-torque edges and adding low-torque links, suppressing propagation noise and boosting pertinent signals. Extensive evaluations on benchmark datasets show that our approach surpasses state-of-the-art methods on both heterophilous and homophilous graphs, and maintains high accuracy on noisy graph.

[LG-27] Cascading and Proxy Membership Inference Attacks

链接: https://arxiv.org/abs/2507.21412
作者: Yuntao Du,Jiacheng Li,Yuetian Chen,Kaiyuan Zhang,Zhizhen Yuan,Hanshen Xiao,Bruno Ribeiro,Ninghui Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Our code is available at: this https URL

点击查看摘要

Abstract:A Membership Inference Attack (MIA) assesses how much a trained machine learning model reveals about its training data by determining whether specific query instances were included in the dataset. We classify existing MIAs into adaptive or non-adaptive, depending on whether the adversary is allowed to train shadow models on membership queries. In the adaptive setting, where the adversary can train shadow models after accessing query instances, we highlight the importance of exploiting membership dependencies between instances and propose an attack-agnostic framework called Cascading Membership Inference Attack (CMIA), which incorporates membership dependencies via conditional shadow training to boost membership inference performance. In the non-adaptive setting, where the adversary is restricted to training shadow models before obtaining membership queries, we introduce Proxy Membership Inference Attack (PMIA). PMIA employs a proxy selection strategy that identifies samples with similar behaviors to the query instance and uses their behaviors in shadow models to perform a membership posterior odds test for membership inference. We provide theoretical analyses for both attacks, and extensive experimental results demonstrate that CMIA and PMIA substantially outperform existing MIAs in both settings, particularly in the low false-positive regime, which is crucial for evaluating privacy risks. Comments: Our code is available at: this https URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.21412 [cs.CR] (or arXiv:2507.21412v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.21412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Data Leakage and Redundancy in the LIT-PCBA Benchmark

链接: https://arxiv.org/abs/2507.21404
作者: Amber Huang,Ian Scott Knight,Slava Naprienko
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:LIT-PCBA is a widely used benchmark for virtual screening, but our audit reveals it is fundamentally compromised. The dataset suffers from egregious data leakage, rampant duplication, and pervasive analog redundancy – flaws that invalidate its use for fair model evaluation. Notably, we identify 2,491 inactives duplicated across training and validation sets, and thousands more repeated within individual data splits (2,945 in training, 789 in validation). Critically, three ligands in the query set – meant to represent unseen test cases – are leaked: two appear in the training set, one in validation. Structural redundancy compounds these issues: for some targets, over 80% of query ligands are near duplicates, with Tanimoto similarity = 0.9. In ALDH1 alone, we find 323 highly similar active pairs between training and validation sets, invalidating claims of chemical diversity. These and other flaws collectively cause models trained on LIT-PCBA to memorize rather than generalize. To demonstrate the consequences of these data integrity failures, we implement a trivial memorization-based baseline – using no learning, no physics, and no modeling – that outperforms state-of-the-art models, including deep neural networks like CHEESE, on LIT-PCBA simply by exploiting these artifacts. Our findings render the benchmark unfit for its intended purpose and call into question previous results based on its use. We share this audit to raise awareness and provide tooling to help the community develop more rigorous and reliable datasets going forward. All scripts necessary to reproduce our audit and the baseline implementation are available at: this https URL

[LG-29] Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning: A Multi-Objective Weighted-Chebyshev Actor-Critic Approach

链接: https://arxiv.org/abs/2507.21397
作者: Fnu Hairi,Jiao Yang,Tianchen Zhou,Haibo Yang,Chaosheng Dong,Fan Yang,Michinari Momma,Yan Gao,Jia Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many multi-objective reinforcement learning (MORL) applications, being able to systematically explore the Pareto-stationary solutions under multiple non-convex reward objectives with theoretical finite-time sample complexity guarantee is an important and yet under-explored problem. This motivates us to take the first step and fill the important gap in MORL. Specifically, in this paper, we propose a \ulineMulti-\ulineObjective weighted-\ulineCHebyshev \ulineActor-critic (MOCHA) algorithm for MORL, which judiciously integrates the weighted-Chebychev (WC) and actor-critic framework to enable Pareto-stationarity exploration systematically with finite-time sample complexity guarantee. Sample complexity result of MOCHA algorithm reveals an interesting dependency on p_\min in finding an \epsilon -Pareto-stationary solution, where p_\min denotes the minimum entry of a given weight vector \mathbfp in WC-scarlarization. By carefully choosing learning rates, the sample complexity for each exploration can be \tilde\mathcalO(\epsilon^-2) . Furthermore, simulation studies on a large KuaiRand offline dataset, show that the performance of MOCHA algorithm significantly outperforms other baseline MORL approaches.

[LG-30] Systolic Array-based Accelerator for State-Space Models

链接: https://arxiv.org/abs/2507.21394
作者: Shiva Raja,Cansu Demirkiran,Aakash Sarkar,Milos Popovic,Ajay Joshi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 250x gains in performance and 45x improvement in energy efficiency, at the expense of 2x increase in area cost over traditional SA-based accelerators, and around ~2,000x improvement in latency/inference on LRA datasets compared to GPU kernel operations.

[LG-31] Reservoir Computation with Networks of Differentiating Neuron Ring Oscillators

链接: https://arxiv.org/abs/2507.21377
作者: Alexander Yeung,Peter DelMastro,Arjun Karuvally,Hava Siegelmann,Edward Rietman,Hananel Hazan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Reservoir Computing is a machine learning approach that uses the rich repertoire of complex system dynamics for function approximation. Current approaches to reservoir computing use a network of coupled integrating neurons that require a steady current to maintain activity. Here, we introduce a small world graph of differentiating neurons that are active only when there are changes in input as an alternative to integrating neurons as a reservoir computing substrate. We find the coupling strength and network topology that enable these small world networks to function as an effective reservoir. We demonstrate the efficacy of these networks in the MNIST digit recognition task, achieving comparable performance of 90.65% to existing reservoir computing approaches. The findings suggest that differentiating neurons can be a potential alternative to integrating neurons and can provide a sustainable future alternative for power-hungry AI applications.

[LG-32] Load Balancing for AI Training Workloads

链接: https://arxiv.org/abs/2507.21372
作者: Sarah McClure,Sylvia Ratnasamy,Scott Shenker
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.

[LG-33] A Contrastive Diffusion-based Network (CDNet) for Time Series Classification

链接: https://arxiv.org/abs/2507.21357
作者: Yaoyu Zhang,Chi-Guhn Lee
类目: Machine Learning (cs.LG)
*备注: 19 pages, conference

点击查看摘要

Abstract:Deep learning models are widely used for time series classification (TSC) due to their scalability and efficiency. However, their performance degrades under challenging data conditions such as class similarity, multimodal distributions, and noise. To address these limitations, we propose CDNet, a Contrastive Diffusion-based Network that enhances existing classifiers by generating informative positive and negative samples via a learned diffusion process. Unlike traditional diffusion models that denoise individual samples, CDNet learns transitions between samples–both within and across classes–through convolutional approximations of reverse diffusion steps. We introduce a theoretically grounded CNN-based mechanism to enable both denoising and mode coverage, and incorporate an uncertainty-weighted composite loss for robust training. Extensive experiments on the UCR Archive and simulated datasets demonstrate that CDNet significantly improves state-of-the-art (SOTA) deep learning classifiers, particularly under noisy, similar, and multimodal conditions.

[LG-34] DEM-NeRF: A Neuro-Symbolic Method for Scientific Discovery through Physics-Informed Simulation

链接: https://arxiv.org/abs/2507.21350
作者: Wenkai Tan,Alvaro Velasquez,Houbing Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks have emerged as a powerful tool for modeling physical systems, offering the ability to learn complex representations from limited data while integrating foundational scientific knowledge. In particular, neuro-symbolic approaches that combine data-driven learning, the neuro, with symbolic equations and rules, the symbolic, address the tension between methods that are purely empirical, which risk straying from established physical principles, and traditional numerical solvers that demand complete geometric knowledge and can be prohibitively expensive for high-fidelity simulations. In this work, we present a novel neuro-symbolic framework for reconstructing and simulating elastic objects directly from sparse multi-view image sequences, without requiring explicit geometric information. Specifically, we integrate a neural radiance field (NeRF) for object reconstruction with physics-informed neural networks (PINN) that incorporate the governing partial differential equations of elasticity. In doing so, our method learns a spatiotemporal representation of deforming objects that leverages both image supervision and symbolic physical constraints. To handle complex boundary and initial conditions, which are traditionally confronted using finite element methods, boundary element methods, or sensor-based measurements, we employ an energy-constrained Physics-Informed Neural Network architecture. This design enhances both simulation accuracy and the explainability of results.

[LG-35] Blending data and physics for reduced-order modeling of systems with spatiotemporal chaotic dynamics

链接: https://arxiv.org/abs/2507.21299
作者: Alex Guo,Michael D. Graham
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While data-driven techniques are powerful tools for reduced-order modeling of systems with chaotic dynamics, great potential remains for leveraging known physics (i.e. a full-order model (FOM)) to improve predictive capability. We develop a hybrid reduced order model (ROM), informed by both data and FOM, for evolving spatiotemporal chaotic dynamics on an invariant manifold whose coordinates are found using an autoencoder. This approach projects the vector field of the FOM onto the invariant manifold; then, this physics-derived vector field is either corrected using dynamic data, or used as a Bayesian prior that is updated with data. In both cases, the neural ordinary differential equation approach is used. We consider simulated data from the Kuramoto-Sivashinsky and complex Ginzburg-Landau equations. Relative to the data-only approach, for scenarios of abundant data, scarce data, and even an incorrect FOM (i.e. erroneous parameter values), the hybrid approach yields substantially improved time-series predictions.

[LG-36] Large Language Model-Enhanced Reinforcement Learning for Diverse and Novel Recommendations

链接: https://arxiv.org/abs/2507.21274
作者: Jiin Woo,Alireza Bagheri Garakani,Tianchen Zhou,Zhishen Huang,Yan Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recommendation systems, diversity and novelty are essential for capturing varied user preferences and encouraging exploration, yet many systems prioritize click relevance. While reinforcement learning (RL) has been explored to improve diversity, it often depends on random exploration that may not align with user interests. We propose LAAC (LLM-guided Adversarial Actor Critic), a novel method that leverages large language models (LLMs) as reference policies to suggest novel items, while training a lightweight policy to refine these suggestions using system-specific data. The method formulates training as a bilevel optimization between actor and critic networks, enabling the critic to selectively favor promising novel actions and the actor to improve its policy beyond LLM recommendations. To mitigate overestimation of unreliable LLM suggestions, we apply regularization that anchors critic values for unexplored items close to well-estimated dataset actions. Experiments on real-world datasets show that LAAC outperforms existing baselines in diversity, novelty, and accuracy, while remaining robust on imbalanced data, effectively integrating LLM knowledge without expensive fine-tuning.

[LG-37] Deep Polynomial Chaos Expansion UAI2025

链接: https://arxiv.org/abs/2507.21273
作者: Johannes Exenberger,Sascha Ranftl,Robert Peharz
类目: Machine Learning (cs.LG)
*备注: 8th Workshop on Tractable Probabilistic Modeling, UAI 2025

点击查看摘要

Abstract:Polynomial chaos expansion (PCE) is a classical and widely used surrogate modeling technique in physical simulation and uncertainty quantification. By taking a linear combination of a set of basis polynomials - orthonormal with respect to the distribution of uncertain input parameters - PCE enables tractable inference of key statistical quantities, such as (conditional) means, variances, covariances, and Sobol sensitivity indices, which are essential for understanding the modeled system and identifying influential parameters and their interactions. As the number of basis functions grows exponentially with the number of parameters, PCE does not scale well to high-dimensional problems. We address this challenge by combining PCE with ideas from probabilistic circuits, resulting in the deep polynomial chaos expansion (DeepPCE) - a deep generalization of PCE that scales effectively to high-dimensional input spaces. DeepPCE achieves predictive performance comparable to that of multi-layer perceptrons (MLPs), while retaining PCE’s ability to compute exact statistical inferences via simple forward passes.

[LG-38] Numerical PDE solvers outperform neural PDE solvers

链接: https://arxiv.org/abs/2507.21269
作者: Patrick Chatain,Michael Rizvi-Martel,Guillaume Rabusseau,Adam Oberman
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:We present DeepFDM, a differentiable finite-difference framework for learning spatially varying coefficients in time-dependent partial differential equations (PDEs). By embedding a classical forward-Euler discretization into a convolutional architecture, DeepFDM enforces stability and first-order convergence via CFL-compliant coefficient parameterizations. Model weights correspond directly to PDE coefficients, yielding an interpretable inverse-problem formulation. We evaluate DeepFDM on a benchmark suite of scalar PDEs: advection, diffusion, advection-diffusion, reaction-diffusion and inhomogeneous Burgers’ equations-in one, two and three spatial dimensions. In both in-distribution and out-of-distribution tests (quantified by the Hellinger distance between coefficient priors), DeepFDM attains normalized mean-squared errors one to two orders of magnitude smaller than Fourier Neural Operators, U-Nets and ResNets; requires 10-20X fewer training epochs; and uses 5-50X fewer parameters. Moreover, recovered coefficient fields accurately match ground-truth parameters. These results establish DeepFDM as a robust, efficient, and transparent baseline for data-driven solution and identification of parametric PDEs.

[LG-39] Diffusion Denoiser-Aided Gyrocompassing

链接: https://arxiv.org/abs/2507.21245
作者: Gershy Ben-Arie,Daniel Engelsman,Rotem Dror,Itzik Klein
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:An accurate initial heading angle is essential for efficient and safe navigation across diverse domains. Unlike magnetometers, gyroscopes can provide accurate heading reference independent of the magnetic disturbances in a process known as gyrocompassing. Yet, accurate and timely gyrocompassing, using low-cost gyroscopes, remains a significant challenge in scenarios where external navigation aids are unavailable. Such challenges are commonly addressed in real-world applications such as autonomous vehicles, where size, weight, and power limitations restrict sensor quality, and noisy measurements severely degrade gyrocompassing performance. To cope with this challenge, we propose a novel diffusion denoiser-aided gyrocompass approach. It integrates a diffusion-based denoising framework with an enhanced learning-based heading estimation model. The diffusion denoiser processes raw inertial sensor signals before input to the deep learning model, resulting in accurate gyrocompassing. Experiments using both simulated and real sensor data demonstrate that our proposed approach improves gyrocompassing accuracy by 26% compared to model-based gyrocompassing and by 15% compared to other learning-driven approaches. This advancement holds particular significance for ensuring accurate and robust navigation in autonomous platforms that incorporate low-cost gyroscopes within their navigation systems.

[LG-40] Fluidically Innervated Lattices Make Versatile and Durable Tactile Sensors

链接: https://arxiv.org/abs/2507.21225
作者: Annan Zhang,Miguel Flores-Acton,Andy Yu,Anshul Gupta,Maggie Yao,Daniela Rus
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted for publication in the proceedings of the 2025 International Symposium on Experimental Robotics (ISER)

点击查看摘要

Abstract:Tactile sensing plays a fundamental role in enabling robots to navigate dynamic and unstructured environments, particularly in applications such as delicate object manipulation, surface exploration, and human-robot interaction. In this paper, we introduce a passive soft robotic fingertip with integrated tactile sensing, fabricated using a 3D-printed elastomer lattice with embedded air channels. This sensorization approach, termed fluidic innervation, transforms the lattice into a tactile sensor by detecting pressure changes within sealed air channels, providing a simple yet robust solution to tactile sensing in robotics. Unlike conventional methods that rely on complex materials or designs, fluidic innervation offers a simple, scalable, single-material fabrication process. We characterize the sensors’ response, develop a geometric model to estimate tip displacement, and train a neural network to accurately predict contact location and contact force. Additionally, we integrate the fingertip with an admittance controller to emulate spring-like behavior, demonstrate its capability for environment exploration through tactile feedback, and validate its durability under high impact and cyclic loading conditions. This tactile sensing technique offers advantages in terms of simplicity, adaptability, and durability and opens up new opportunities for versatile robotic manipulation.

[LG-41] Combolutional Neural Networks

链接: https://arxiv.org/abs/2507.21202
作者: Cameron Churchwell,Minje Kim,Paris Smaragdis
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 4 pages, 3 figures, accepted to WASPAA 2025

点击查看摘要

Abstract:Selecting appropriate inductive biases is an essential step in the design of machine learning models, especially when working with audio, where even short clips may contain millions of samples. To this end, we propose the combolutional layer: a learned-delay IIR comb filter and fused envelope detector, which extracts harmonic features in the time domain. We demonstrate the efficacy of the combolutional layer on three information retrieval tasks, evaluate its computational cost relative to other audio frontends, and provide efficient implementations for training. We find that the combolutional layer is an effective replacement for convolutional layers in audio tasks where precise harmonic analysis is important, e.g., piano transcription, speaker classification, and key detection. Additionally, the combolutional layer has several other key benefits over existing frontends, namely: low parameter count, efficient CPU inference, strictly real-valued computations, and improved interpretability.

[LG-42] AdaptHetero: Machine Learning Interpretation-Driven Subgroup Adaptation for EHR-Based Clinical Prediction

链接: https://arxiv.org/abs/2507.21197
作者: Ling Liao,Eva Aagaard
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Machine learning interpretation has primarily been leveraged to build clinician trust and uncover actionable insights in EHRs. However, the intrinsic complexity and heterogeneity of EHR data limit its effectiveness in guiding subgroup-specific modeling. We propose AdaptHetero, a novel MLI-driven framework that transforms interpretability insights into actionable guidance for tailoring model training and evaluation across subpopulations within individual hospital systems. Evaluated on three large-scale EHR datasets - GOSSIS-1-eICU, WiDS, and MIMIC-IV - AdaptHetero consistently identifies heterogeneous model behaviors in predicting ICU mortality, in-hospital death, and hidden hypoxemia. By integrating SHAP-based interpretation and unsupervised clustering, the framework enhances the identification of clinically meaningful subgroup-specific characteristics, leading to improved predictive performance.

[LG-43] Interpretable Anomaly-Based DDoS Detection in AI-RAN with XAI and LLM s

链接: https://arxiv.org/abs/2507.21193
作者: Sotiris Chatzimiltis,Mohammad Shojafar,Mahdi Boloursaz Mashhadi,Rahim Tafazolli
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Next generation Radio Access Networks (RANs) introduce programmability, intelligence, and near real-time control through intelligent controllers, enabling enhanced security within the RAN and across broader 5G/6G infrastructures. This paper presents a comprehensive survey highlighting opportunities, challenges, and research gaps for Large Language Models (LLMs)-assisted explainable (XAI) intrusion detection (IDS) for secure future RAN environments. Motivated by this, we propose an LLM interpretable anomaly-based detection system for distributed denial-of-service (DDoS) attacks using multivariate time series key performance measures (KPMs), extracted from E2 nodes, within the Near Real-Time RAN Intelligent Controller (Near-RT RIC). An LSTM-based model is trained to identify malicious User Equipment (UE) behavior based on these KPMs. To enhance transparency, we apply post-hoc local explainability methods such as LIME and SHAP to interpret individual predictions. Furthermore, LLMs are employed to convert technical explanations into natural-language insights accessible to non-expert users. Experimental results on real 5G network KPMs demonstrate that our framework achieves high detection accuracy (F1-score 0.96) while delivering actionable and interpretable outputs.

[LG-44] Exploring Adaptive Structure Learning for Heterophilic Graphs ICLR2025

链接: https://arxiv.org/abs/2507.21191
作者: Garv Kaushik
类目: Machine Learning (cs.LG)
*备注: Initially submitted this draft at Tiny ICLR 2025

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) gained traction for graph representation learning, with recent attention on improving performance on heterophilic graphs for various real-world applications. The localized feature aggregation in a typical message-passing paradigm hinders the capturing of long-range dependencies between non-local nodes of the same class. The inherent connectivity structure in heterophilic graphs often conflicts with information sharing between distant nodes of same class. We propose structure learning to rewire edges in shallow GCNs itself to avoid performance degradation in downstream discriminative tasks due to oversmoothing. Parameterizing the adjacency matrix to learn connections between non-local nodes and extend the hop span of shallow GCNs facilitates the capturing of long-range dependencies. However, our method is not generalizable across heterophilic graphs and performs inconsistently on node classification task contingent to the graph structure.

[LG-45] Beyond Neural Networks: Symbolic Reasoning over Wavelet Logic Graph Signals

链接: https://arxiv.org/abs/2507.21190
作者: Andrew Kiruluta,Andreas Lemos,Priscilla Burity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a fully non neural learning framework based on Graph Laplacian Wavelet Transforms (GLWT). Unlike traditional architectures that rely on convolutional, recurrent, or attention based neural networks, our model operates purely in the graph spectral domain using structured multiscale filtering, nonlinear shrinkage, and symbolic logic over wavelet coefficients. Signals defined on graph nodes are decomposed via GLWT, modulated with interpretable nonlinearities, and recombined for downstream tasks such as denoising and token classification. The system supports compositional reasoning through a symbolic domain-specific language (DSL) over graph wavelet activations. Experiments on synthetic graph denoising and linguistic token graphs demonstrate competitive performance against lightweight GNNs with far greater transparency and efficiency. This work proposes a principled, interpretable, and resource-efficient alternative to deep neural architectures for learning on graphs.

[LG-46] Operator-Based Machine Intelligence: A Hilbert Space Framework for Spectral Learning and Symbolic Reasoning

链接: https://arxiv.org/abs/2507.21189
作者: Andrew Kiruluta,Andreas Lemos,Priscilla Burity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional machine learning models, particularly neural networks, are rooted in finite-dimensional parameter spaces and nonlinear function approximations. This report explores an alternative formulation where learning tasks are expressed as sampling and computation in infinite dimensional Hilbert spaces, leveraging tools from functional analysis, signal processing, and spectral theory. We review foundational concepts such as Reproducing Kernel Hilbert Spaces (RKHS), spectral operator learning, and wavelet-domain representations. We present a rigorous mathematical formulation of learning in Hilbert spaces, highlight recent models based on scattering transforms and Koopman operators, and discuss advantages and limitations relative to conventional neural architectures. The report concludes by outlining directions for scalable and interpretable machine learning grounded in Hilbertian signal processing.

[LG-47] FedBAP: Backdoor Defense via Benign Adversarial Perturbation in Federated Learning

链接: https://arxiv.org/abs/2507.21177
作者: Xinhai Yan,Libing Wu,Zhuangzhuang Zhang,Bingyi Liu,Lijuan Huo,Jing Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to ACM Multimedia 2025

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while preserving data privacy, but it is highly vulnerable to backdoor attacks. Most existing defense methods in FL have limited effectiveness due to their neglect of the model’s over-reliance on backdoor triggers, particularly as the proportion of malicious clients increases. In this paper, we propose FedBAP, a novel defense framework for mitigating backdoor attacks in FL by reducing the model’s reliance on backdoor triggers. Specifically, first, we propose a perturbed trigger generation mechanism that creates perturbation triggers precisely matching backdoor triggers in location and size, ensuring strong influence on model outputs. Second, we utilize these perturbation triggers to generate benign adversarial perturbations that disrupt the model’s dependence on backdoor triggers while forcing it to learn more robust decision boundaries. Finally, we design an adaptive scaling mechanism to dynamically adjust perturbation intensity, effectively balancing defense strength and model performance. The experimental results demonstrate that FedBAP reduces the attack success rates by 0.22%-5.34%, 0.48%-6.34%, and 97.22%-97.6% under three types of backdoor attacks, respectively. In particular, FedBAP demonstrates outstanding performance against novel backdoor attacks.

[LG-48] SPADE-S: A Sparsity-Robust Foundational Forecaster

链接: https://arxiv.org/abs/2507.21155
作者: Malcolm Wolff,Matthew Li,Ravi Kiran Selvam,Hanjing Zhu,Kin G. Olivares,Ruijun Ma,Abhinav Katoch,Shankar Ramasubramanian,Mengfei Cao,Roberto Bandarra,Rahul Gopalsamy,Stefania La Vattiata,Sitan Yang,Michael M. Mahoney
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Despite significant advancements in time series forecasting, accurate modeling of time series with strong heterogeneity in magnitude and/or sparsity patterns remains challenging for state-of-the-art deep learning architectures. We identify several factors that lead existing models to systematically underperform on low-magnitude and sparse time series, including loss functions with implicit biases toward high-magnitude series, training-time sampling methods, and limitations of time series encoding methods. SPADE-S is a robust forecasting architecture that significantly reduces magnitude- and sparsity-based systematic biases and improves overall prediction accuracy. Empirical results demonstrate that SPADE-S outperforms existing state-of-the-art approaches across a diverse set of use cases in demand forecasting. In particular, we show that, depending on the quantile forecast and magnitude of the series, SPADE-S can improve forecast accuracy by up to 15%. This results in P90 overall forecast accuracy gains of 2.21%, 6.58%, and 4.28%, and P50 forecast accuracy gains of 0.92%, 0.77%, and 1.95%, respectively, for each of three distinct datasets, ranging from 3 million to 700 million series, from a large online retailer. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.21155 [cs.LG] (or arXiv:2507.21155v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.21155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] Quantum Geometry of Data

链接: https://arxiv.org/abs/2507.21135
作者: Alexander G. Abanov,Luca Candelori,Harold C. Steinacker,Martin T. Wells,Jerome R. Busemeyer,Cameron J. Hogan,Vahagn Kirakosyan,Nicola Marzari,Sunil Pinnamaneni,Dario Villani,Mengjia Xu,Kharen Musaelian
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 27 pages, 14 figures, 1 table

点击查看摘要

Abstract:We demonstrate how Quantum Cognition Machine Learning (QCML) encodes data as quantum geometry. In QCML, features of the data are represented by learned Hermitian matrices, and data points are mapped to states in Hilbert space. The quantum geometry description endows the dataset with rich geometric and topological structure - including intrinsic dimension, quantum metric, and Berry curvature - derived directly from the data. QCML captures global properties of data, while avoiding the curse of dimensionality inherent in local methods. We illustrate this on a number of synthetic and real-world examples. Quantum geometric representation of QCML could advance our understanding of cognitive phenomena within the framework of quantum cognition.

[LG-50] Pre- In- and Post-Processing Class Imbalance Mitigation Techniques for Failure Detection in Optical Networks

链接: https://arxiv.org/abs/2507.21119
作者: Yousuf Moiz Ali,Jaroslaw E. Prilepsky,Nicola Sambo,João Pedro,Mohammad M. Hosseini,Antonio Napoli,Sergei K. Turitsyn,Pedro Freire
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optics (physics.optics)
*备注: 3 pages + 1 page for acknowledgement and references

点击查看摘要

Abstract:We compare pre-, in-, and post-processing techniques for class imbalance mitigation in optical network failure detection. Threshold Adjustment achieves the highest F1 gain (15.3%), while Random Under-sampling (RUS) offers the fastest inference, highlighting a key performance-complexity trade-off.

[LG-51] Higher-Order Kuramoto Oscillator Network for Dense Associative Memory

链接: https://arxiv.org/abs/2507.21984
作者: Jona Nagerl,Natalia G. Berloff
类目: Adaptation and Self-Organizing Systems (nlin.AO); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Networks of phase oscillators can serve as dense associative memories if they incorporate higher-order coupling beyond the classical Kuramoto model’s pairwise interactions. Here we introduce a generalized Kuramoto model with combined second-harmonic (pairwise) and fourth-harmonic (quartic) coupling, inspired by dense Hopfield memory theory. Using mean-field theory and its dynamical approximation, we obtain a phase diagram for dense associative memory model that exhibits a tricritical point at which the continuous onset of memory retrieval is supplanted by a discontinuous, hysteretic transition. In the quartic-dominated regime, the system supports bistable phase-locked states corresponding to stored memory patterns, with a sizable energy barrier between memory and incoherent states. We analytically determine this bistable region and show that the escape time from a memory state (due to noise) grows exponentially with network size, indicating robust storage. Extending the theory to finite memory load, we show that higher-order couplings achieve superlinear scaling of memory capacity with system size, far exceeding the limit of pairwise-only oscillators. Large-scale simulations of the oscillator network confirm our theoretical predictions, demonstrating rapid pattern retrieval and robust storage of many phase patterns. These results bridge the Kuramoto synchronization with modern Hopfield memories, pointing toward experimental realization of high-capacity, analog associative memory in oscillator systems.

[LG-52] Reducing Data Requirements for Sequence-Property Prediction in Copolymer Compatibilizers via Deep Neural Network Tuning

链接: https://arxiv.org/abs/2507.21902
作者: Md Mushfiqul Islam,Nishat N. Labiba,Lawrence O. Hall,David S. Simmons
类目: Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Synthetic sequence-controlled polymers promise to transform polymer science by combining the chemical versatility of synthetic polymers with the precise sequence-mediated functionality of biological proteins. However, design of these materials has proven extraordinarily challenging, because they lack the massive datasets of closely related evolved molecules that accelerate design of proteins. Here we report on a new Artifical Intelligence strategy to dramatically reduce the amount of data necessary to accelerate these materials’ design. We focus on data connecting the repeat-unit-sequence of a \emphcompatibilizer molecule to its ability to reduce the interfacial tension between distinct polymer domains. The optimal sequence of these molecules, which are essential for applications such as mixed-waste polymer recycling, depends strongly on variables such as concentration and chemical details of the polymer. With current methods, this would demand an entirely distinct dataset to enable design at each condition. Here we show that a deep neural network trained on low-fidelity data for sequence/interfacial tension relations at one set of conditions can be rapidly tuned to make higher-fidelity predictions at a distinct set of conditions, requiring far less data that would ordinarily be needed. This priming-and-tuning approach should allow a single low-fidelity parent dataset to dramatically accelerate prediction and design in an entire constellation of related systems. In the long run, it may also provide an approach to bootstrapping quantitative atomistic design with AI insights from fast, coarse simulations.

[LG-53] Representations in vision and language converge in a shared multidimensional space of perceived similarities

链接: https://arxiv.org/abs/2507.21871
作者: Katerina Marie Simkova,Adrien Doerig,Clayton Hickey,Ian Charest
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 51 pages, 15 figures

点击查看摘要

Abstract:Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a significant challenge. Emerging evidence suggests that human brain representations in both vision and language are well predicted by semantic feature spaces obtained from large language models (LLMs). This raises the possibility that sensory systems converge in their inherent ability to transform their inputs onto shared, embedding-like representational space. However, it remains unclear how such a space manifests in human behaviour. To investigate this, sixty-three participants performed behavioural similarity judgements separately on 100 natural scene images and 100 corresponding sentence captions from the Natural Scenes Dataset. We found that visual and linguistic similarity judgements not only converge at the behavioural level but also predict a remarkably similar network of fMRI brain responses evoked by viewing the natural scene images. Furthermore, computational models trained to map images onto LLM-embeddings outperformed both category-trained and AlexNet controls in explaining the behavioural similarity structure. These findings demonstrate that human visual and linguistic similarity judgements are grounded in a shared, modality-agnostic representational structure that mirrors how the visual system encodes experience. The convergence between sensory and artificial systems suggests a common capacity of how conceptual representations are formed-not as arbitrary products of first order, modality-specific input, but as structured representations that reflect the stable, relational properties of the external world.

[LG-54] MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation

链接: https://arxiv.org/abs/2507.21807
作者: Robert Kuchen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 2 algorithms, includes a simulation study

点击查看摘要

Abstract:Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which creates several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets. Simulation studies suggest that our approach yields prediction performance comparable to that of these recently proposed methods.

[LG-55] Domain Generalization and Adaptation in Intensive Care with Anchor Regression

链接: https://arxiv.org/abs/2507.21783
作者: Malte Londschien,Manuel Burger,Gunnar Rätsch,Peter Bühlmann
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The performance of predictive models in clinical settings often degrades when deployed in new hospitals due to distribution shifts. This paper presents a large-scale study of causality-inspired domain generalization on heterogeneous multi-center intensive care unit (ICU) data. We apply anchor regression and introduce anchor boosting, a novel, tree-based nonlinear extension, to a large dataset comprising 400,000 patients from nine distinct ICU databases. The anchor regularization consistently improves out-of-distribution performance, particularly for the most dissimilar target domains. The methods appear robust to violations of theoretical assumptions, such as anchor exogeneity. Furthermore, we propose a novel conceptual framework to quantify the utility of large external data datasets. By evaluating performance as a function of available target-domain data, we identify three regimes: (i) a domain generalization regime, where only the external model should be used, (ii) a domain adaptation regime, where refitting the external model is optimal, and (iii) a data-rich regime, where external data provides no additional value.

[LG-56] Unified machine-learning framework for property prediction and time-evolution simulation of strained alloy microstructure

链接: https://arxiv.org/abs/2507.21760
作者: Andrea Fantasia,Daniele Lanzoni,Niccolò Di Eugenio,Angelo Monteleone,Roberto Bergamaschini,Francesco Montalenti
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:We introduce a unified machine-learning framework designed to conveniently tackle the temporal evolution of alloy microstructures under the influence of an elastic field. This approach allows for the simultaneous extraction of elastic parameters from a short trajectory and for the prediction of further microstructure evolution under their influence. This is demonstrated by focusing on spinodal decomposition in the presence of a lattice mismatch eta, and by carrying out an extensive comparison between the ground-truth evolution supplied by phase field simulations and the predictions of suitable convolutional recurrent neural network architectures. The two tasks may then be performed subsequently into a cascade framework. Under a wide spectrum of misfit conditions, the here-presented cascade model accurately predicts eta and the full corresponding microstructure evolution, also when approaching critical conditions for spinodal decomposition. Scalability to larger computational domain sizes and mild extrapolation errors in time (for time sequences five times longer than the sampled ones during training) are demonstrated. The proposed framework is general and can be applied beyond the specific, prototypical system considered here as an example. Intriguingly, experimental videos could be used to infer unknown external parameters, prior to simulating further temporal evolution.

[LG-57] Riemannian Optimization on Tree Tensor Networks with Application in Machine Learning

链接: https://arxiv.org/abs/2507.21726
作者: Marius Willner,Marco Trenti,Dirk Lebiedz
类目: Optimization and Control (math.OC); Other Condensed Matter (cond-mat.other); Machine Learning (cs.LG)
*备注: 24 pages, 6 figures, 4 pseudo-code algorithms, 1 table

点击查看摘要

Abstract:Tree tensor networks (TTNs) are widely used in low-rank approximation and quantum many-body simulation. In this work, we present a formal analysis of the differential geometry underlying TTNs. Building on this foundation, we develop efficient first- and second-order optimization algorithms that exploit the intrinsic quotient structure of TTNs. Additionally, we devise a backpropagation algorithm for training TTNs in a kernel learning setting. We validate our methods through numerical experiments on a representative machine learning task.

[LG-58] An Equal-Probability Partition of the Sample Space: A Non-parametric Inference from Finite Samples

链接: https://arxiv.org/abs/2507.21712
作者: Urban Eriksson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper investigates what can be inferred about an arbitrary continuous probability distribution from a finite sample of N observations drawn from it. The central finding is that the N sorted sample points partition the real line into N+1 segments, each carrying an expected probability mass of exactly 1/(N+1) . This non-parametric result, which follows from fundamental properties of order statistics, holds regardless of the underlying distribution’s shape. This equal-probability partition yields a discrete entropy of \log_2(N+1) bits, which quantifies the information gained from the sample and contrasts with Shannon’s results for continuous variables. I compare this partition-based framework to the conventional ECDF and discuss its implications for robust non-parametric inference, particularly in density and tail estimation.

[LG-59] An em algorithm for quantum Boltzmann machines

链接: https://arxiv.org/abs/2507.21569
作者: Takeshi Kimura,Kohtaro Kato,Masahito Hayashi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Main text: 10 pages, 2 figures. Appendix: 3 pages, 1 figure

点击查看摘要

Abstract:We develop a quantum version of the em algorithm for training quantum Boltzmann machines. The em algorithm is an information-geometric extension of the well-known expectation-maximization (EM) algorithm, offering a structured alternative to gradient-based methods with potential advantages in stability and convergence. We implement the algorithm on a semi-quantum restricted Boltzmann machine, where quantum effects are confined to the hidden layer. This structure enables analytical update rules while preserving quantum expressivity. Numerical experiments on benchmark datasets show that the proposed method achieves stable learning and outperforms gradient-based training in several cases. These results demonstrate the potential of information-geometric optimization for quantum machine learning, particularly in settings where standard methods struggle due to non-commutativity or vanishing gradients.

[LG-60] On Policy Stochasticity in Mutual Information Optimal Control of Linear Systems

链接: https://arxiv.org/abs/2507.21543
作者: Shoju Enami,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages

点击查看摘要

Abstract:In recent years, mutual information optimal control has been proposed as an extension of maximum entropy optimal control. Both approaches introduce regularization terms to render the policy stochastic, and it is important to theoretically clarify the relationship between the temperature parameter (i.e., the coefficient of the regularization term) and the stochasticity of the policy. Unlike in maximum entropy optimal control, this relationship remains unexplored in mutual information optimal control. In this paper, we investigate this relationship for a mutual information optimal control problem (MIOCP) of discrete-time linear systems. After extending the result of a previous study of the MIOCP, we establish the existence of an optimal policy of the MIOCP, and then derive the respective conditions on the temperature parameter under which the optimal policy becomes stochastic and deterministic. Furthermore, we also derive the respective conditions on the temperature parameter under which the policy obtained by an alternating optimization algorithm becomes stochastic and deterministic. The validity of the theoretical results is demonstrated through numerical experiments.

[LG-61] Stochastic forest transition model dynamics and parameter estimation via deep learning

链接: https://arxiv.org/abs/2507.21486
作者: Satoshi Kumabe,Tianyu Song,Ton Viet Ta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forest transitions, characterized by dynamic shifts between forest, agricultural, and abandoned lands, are complex phenomena. This study developed a stochastic differential equation model to capture the intricate dynamics of these transitions. We established the existence of global positive solutions for the model and conducted numerical analyses to assess the impact of model parameters on deforestation incentives. To address the challenge of parameter estimation, we proposed a novel deep learning approach that estimates all model parameters from a single sample containing time-series observations of forest and agricultural land proportions. This innovative approach enables us to understand forest transition dynamics and deforestation trends at any future time.

[LG-62] From Global to Local: A Scalable Benchmark for Local Posterior Sampling

链接: https://arxiv.org/abs/2507.21449
作者: Rohan Hitchcock,Jesse Hoogland
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Degeneracy is an inherent feature of the loss landscape of neural networks, but it is not well understood how stochastic gradient MCMC (SGMCMC) algorithms interact with this degeneracy. In particular, current global convergence guarantees for common SGMCMC algorithms rely on assumptions which are likely incompatible with degenerate loss landscapes. In this paper, we argue that this gap requires a shift in focus from global to local posterior sampling, and, as a first step, we introduce a novel scalable benchmark for evaluating the local sampling performance of SGMCMC algorithms. We evaluate a number of common algorithms, and find that RMSProp-preconditioned SGLD is most effective at faithfully representing the local geometry of the posterior distribution. Although we lack theoretical guarantees about global sampler convergence, our empirical results show that we are able to extract non-trivial local information in models with up to O(100M) parameters.

[LG-63] Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations INTERSPEECH2025

链接: https://arxiv.org/abs/2507.21448
作者: Teng(Aleksandra)Ma,Sile Yin,Li-Chia Yang,Shuo Zhang
类目: Audio and Speech Processing (eess.AS); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Accepted into Interspeech 2025

点击查看摘要

Abstract:Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.

[LG-64] Measuring Sample Quality with Copula Discrepancies

链接: https://arxiv.org/abs/2507.21434
作者: Agnideep Aich,Ashit Baran Aich,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scalable Markov chain Monte Carlo (MCMC) algorithms that underpin modern Bayesian machine learning, such as Stochastic Gradient Langevin Dynamics (SGLD), sacrifice asymptotic exactness for computational speed, creating a critical diagnostic gap: traditional sample quality measures fail catastrophically when applied to biased samplers. While powerful Stein-based diagnostics can detect distributional mismatches, they provide no direct assessment of dependence structure, often the primary inferential target in multivariate problems. We introduce the Copula Discrepancy (CD), a principled and computationally efficient diagnostic that leverages Sklar’s theorem to isolate and quantify the fidelity of a sample’s dependence structure independent of its marginals. Our theoretical framework provides the first structure-aware diagnostic specifically designed for the era of approximate inference. Empirically, we demonstrate that a moment-based CD dramatically outperforms standard diagnostics like effective sample size for hyperparameter selection in biased MCMC, correctly identifying optimal configurations where traditional methods fail. Furthermore, our robust MLE-based variant can detect subtle but critical mismatches in tail dependence that remain invisible to rank correlation-based approaches, distinguishing between samples with identical Kendall’s tau but fundamentally different extreme-event behavior. With computational overhead orders of magnitude lower than existing Stein discrepancies, the CD provides both immediate practical value for MCMC practitioners and a theoretical foundation for the next generation of structure-aware sample quality assessment.

[LG-65] From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

链接: https://arxiv.org/abs/2507.21429
作者: Agnideep Aich,Ashit Baran Aich,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The convergence of gradient descent (GD) on the non-convex loss landscapes of deep neural networks (DNNs) presents a fundamental theoretical challenge. While recent work has established that GD converges to a stationary point at a sublinear rate within locally quasi-convex regions (LQCRs), this fails to explain the exponential convergence rates consistently observed in practice. In this paper, we resolve this discrepancy by proving that under a mild assumption on Neural Tangent Kernel (NTK) stability, these same regions satisfy a local Polyak-Lojasiewicz (PL) condition. We introduce the concept of a Locally Polyak-Lojasiewicz Region (LPLR), where the squared gradient norm lower-bounds the suboptimality gap, prove that properly initialized finite-width networks admit such regions around initialization, and establish that GD achieves linear convergence within an LPLR, providing the first finite-width guarantee that matches empirically observed rates. We validate our theory across diverse settings, from controlled experiments on fully-connected networks to modern ResNet architectures trained with stochastic methods, demonstrating that LPLR structure emerges robustly in practical deep learning scenarios. By rigorously connecting local landscape geometry to fast optimization through the NTK framework, our work provides a definitive theoretical explanation for the remarkable efficiency of gradient-based optimization in deep learning.

[LG-66] Graph neural networks for residential location choice: connection to classical logit models

链接: https://arxiv.org/abs/2507.21334
作者: Zhanhong Cheng,Lingqian Hu,Yuheng Bu,Yuqi Zhou,Shenhao Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Researchers have adopted deep learning for classical discrete choice analysis as it can capture complex feature relationships and achieve higher predictive performance. However, the existing deep learning approaches cannot explicitly capture the relationship among choice alternatives, which has been a long-lasting focus in classical discrete choice models. To address the gap, this paper introduces Graph Neural Network (GNN) as a novel framework to analyze residential location choice. The GNN-based discrete choice models (GNN-DCMs) offer a structured approach for neural networks to capture dependence among spatial alternatives, while maintaining clear connections to classical random utility theory. Theoretically, we demonstrate that the GNN-DCMs incorporate the nested logit (NL) model and the spatially correlated logit (SCL) model as two specific cases, yielding novel algorithmic interpretation through message passing among alternatives’ utilities. Empirically, the GNN-DCMs outperform benchmark MNL, SCL, and feedforward neural networks in predicting residential location choices among Chicago’s 77 community areas. Regarding model interpretation, the GNN-DCMs can capture individual heterogeneity and exhibit spatially-aware substitution patterns. Overall, these results highlight the potential of GNN-DCMs as a unified and expressive framework for synergizing discrete choice modeling and deep learning in the complex spatial choice contexts.

[LG-67] Predicting VBAC Outcomes from U.S. Natality Data using Deep and Classical Machine Learning Models

链接: https://arxiv.org/abs/2507.21330
作者: Ananya Anand
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures, 1 table

点击查看摘要

Abstract:Accurately predicting the outcome of a trial of labor after cesarean (TOLAC) is essential for guiding prenatal counseling and minimizing delivery-related risks. This study presents supervised machine learning models for predicting vaginal birth after cesarean (VBAC) using 643,029 TOLAC cases from the CDC WONDER Natality dataset (2017-2023). After filtering for singleton births with one or two prior cesareans and complete data across 47 prenatal-period features, three classifiers were trained: logistic regression, XGBoost, and a multilayer perceptron (MLP). The MLP achieved the highest performance with an AUC of 0.7287, followed closely by XGBoost (AUC = 0.727), both surpassing the logistic regression baseline (AUC = 0.709). To address class imbalance, class weighting was applied to the MLP, and a custom loss function was implemented in XGBoost. Evaluation metrics included ROC curves, confusion matrices, and precision-recall analysis. Logistic regression coefficients highlighted maternal BMI, education, parity, comorbidities, and prenatal care indicators as key predictors. Overall, the results demonstrate that routinely collected, early-pregnancy variables can support scalable and moderately high-performing VBAC prediction models. These models offer potential utility in clinical decision support, particularly in settings lacking access to specialized intrapartum data.

[LG-68] Generative imaging for radio interferometry with fast uncertainty quantification

链接: https://arxiv.org/abs/2507.21270
作者: Matthijs Mars,Tobías I. Liaudat,Jessica J. Whitney,Marta M. Betcke,Jason D. McEwen
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rise of large radio interferometric telescopes, particularly the SKA, there is a growing demand for computationally efficient image reconstruction techniques. Existing reconstruction methods, such as the CLEAN algorithm or proximal optimisation approaches, are iterative in nature, necessitating a large amount of compute. These methods either provide no uncertainty quantification or require large computational overhead to do so. Learned reconstruction methods have shown promise in providing efficient and high quality reconstruction. In this article we explore the use of generative neural networks that enable efficient approximate sampling of the posterior distribution for high quality reconstructions with uncertainty quantification. Our RI-GAN framework, builds on the regularised conditional generative adversarial network (rcGAN) framework by integrating a gradient U-Net (GU-Net) architecture - a hybrid reconstruction model that embeds the measurement operator directly into the network. This framework uses Wasserstein GANs to improve training stability in combination with regularisation terms that combat mode collapse, which are typical problems for conditional GANs. This approach takes as input the dirty image and the point spread function (PSF) of the observation and provides efficient, high-quality image reconstructions that are robust to varying visibility coverages, generalises to images with an increased dynamic range, and provides informative uncertainty quantification. Our methods provide a significant step toward computationally efficient, scalable, and uncertainty-aware imaging for next-generation radio telescopes.

[LG-69] Multiscale geometrical and topological learning in the analysis of soft matter collective dynamics

链接: https://arxiv.org/abs/2507.21265
作者: Tetiana Orlova,Amaranta Membrillo Solis,Hayley R. O. Sohn,Tristan Madeleine,Giampaolo D’Alessandro,Ivan I. Smalyukh,Malgosia Kaczmarek,Jacek Brodzki
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Understanding the behavior and evolution of a dynamical many-body system by analyzing patterns in their experimentally captured images is a promising method relevant for a variety of living and non-living self-assembled systems. The arrays of moving liquid crystal skyrmions studied here are a representative example of hierarchically organized materials that exhibit complex spatiotemporal dynamics driven by multiscale processes. Joint geometric and topological data analysis (TDA) offers a powerful framework for investigating such systems by capturing the underlying structure of the data at multiple scales. In the TDA approach, we introduce the \Psi -function, a robust numerical topological descriptor related to both the spatiotemporal changes in the size and shape of individual topological solitons and the emergence of regions with their different spatial organization. The geometric method based on the analysis of vector fields generated from images of skyrmion ensembles offers insights into the nonlinear physical mechanisms of the system’s response to external stimuli and provides a basis for comparison with theoretical predictions. The methodology presented here is very general and can provide a characterization of system behavior both at the level of individual pattern-forming agents and as a whole, allowing one to relate the results of image data analysis to processes occurring in a physical, chemical, or biological system in the real world.

[LG-70] Benchmarking a Tunable Quantum Neural Network on Trapped-Ion and Superconducting Hardware

链接: https://arxiv.org/abs/2507.21222
作者: Djamil Lakhdar-Hamina,Xingxin Liu,Richard Barney,Sarah H. Miller,Alaina M. Green,Norbert M. Linke,Victor Galitski
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:We implement a quantum generalization of a neural network on trapped-ion and IBM superconducting quantum computers to classify MNIST images, a common benchmark in computer vision. The network feedforward involves qubit rotations whose angles depend on the results of measurements in the previous layer. The network is trained via simulation, but inference is performed experimentally on quantum hardware. The classical-to-quantum correspondence is controlled by an interpolation parameter, a , which is zero in the classical limit. Increasing a introduces quantum uncertainty into the measurements, which is shown to improve network performance at moderate values of the interpolation parameter. We then focus on particular images that fail to be classified by a classical neural network but are detected correctly in the quantum network. For such borderline cases, we observe strong deviations from the simulated behavior. We attribute this to physical noise, which causes the output to fluctuate between nearby minima of the classification energy landscape. Such strong sensitivity to physical noise is absent for clear images. We further benchmark physical noise by inserting additional single-qubit and two-qubit gate pairs into the neural network circuits. Our work provides a springboard toward more complex quantum neural networks on current devices: while the approach is rooted in standard classical machine learning, scaling up such networks may prove classically non-simulable and could offer a route to near-term quantum advantage.

[LG-71] An empirical comparison of some outlier detection methods with longitudinal data

链接: https://arxiv.org/abs/2507.21203
作者: Marcello D’Orazio
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This note investigates the problem of detecting outliers in longitudinal data. It compares well-known methods used in official statistics with proposals from the fields of data mining and machine learning that are based on the distance between observations or binary partitioning trees. This is achieved by applying the methods to panel survey data related to different types of statistical units. Traditional methods are quite simple, enabling the direct identification of potential outliers, but they require specific assumptions. In contrast, recent methods provide only a score whose magnitude is directly related to the likelihood of an outlier being present. All the methods require the user to set a number of tuning parameters. However, the most recent methods are more flexible and sometimes more effective than traditional methods. In addition, these methods can be applied to multidimensional data.

信息检索

[IR-0] Not Here Go There: Analyzing Redirection Patterns on the Web

链接: https://arxiv.org/abs/2507.22019
作者: Kritika Garg,Sawood Alam,Dietrich Ayala,Michele C. Weigle,Michael L. Nelson
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
*备注: Extended version of the paper accepted at the 2025 ACM Web Science Conference (WebSci 2025)

点击查看摘要

Abstract:URI redirections are integral to web management, supporting structural changes, SEO optimization, and security. However, their complexities affect usability, SEO performance, and digital preservation. This study analyzed 11 million unique redirecting URIs, following redirections up to 10 hops per URI, to uncover patterns and implications of redirection practices. Our findings revealed that 50% of the URIs terminated successfully, while 50% resulted in errors, including 0.06% exceeding 10 hops. Canonical redirects, such as HTTP to HTTPS transitions, were prevalent, reflecting adherence to SEO best practices. Non-canonical redirects, often involving domain or path changes, highlighted significant web migrations, rebranding, and security risks. Notable patterns included “sink” URIs, where multiple redirects converged, ranging from traffic consolidation by global websites to deliberate “Rickrolling.” The study also identified 62,000 custom 404 URIs, almost half being soft 404s, which could compromise SEO and user experience. These findings underscore the critical role of URI redirects in shaping the web while exposing challenges such as outdated URIs, server instability, and improper error handling. This research offers a detailed analysis of URI redirection practices, providing insights into their prevalence, types, and outcomes. By examining a large dataset, we highlight inefficiencies in redirection chains and examine patterns such as the use of “sink” URIs and custom error pages. This information can help webmasters, researchers, and digital archivists improve web usability, optimize resource allocation, and safeguard valuable online content.

[IR-1] Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

链接: https://arxiv.org/abs/2507.21989
作者: Patrick Iff,Paul Bruegger,Marcin Chrapek,Maciej Besta,Torsten Hoefler
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, vehicle/person reidentification, and face recognition. Many applications in these domains require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item’s attributes, a problem known as Filtered Approximate Nearest Neighbor Search (FANNS). In this work, we present a comprehensive survey and taxonomy of FANNS methods and analyze how they are benchmarked in the literature. By doing so, we identify a key challenge in the current FANNS landscape: the lack of diverse and realistic datasets, particularly ones derived from the latest transformer-based text embedding models. To address this, we introduce a novel dataset consisting of embedding vectors for the abstracts of over 2.7 million research articles from the arXiv repository, accompanied by 11 real-world attributes such as authors and categories. We benchmark a wide range of FANNS methods on our novel dataset and find that each method has distinct strengths and limitations; no single approach performs best across all scenarios. ACORN, for example, supports various filter types and performs reliably across dataset scales but is often outperformed by more specialized methods. SeRF shows excellent performance for range filtering on ordered attributes but cannot handle categorical attributes. Filtered-DiskANN and UNG excel on the medium-scale dataset but fail on the large-scale dataset, highlighting the challenge posed by transformer-based embeddings, which are often more than an order of magnitude larger than earlier embeddings. We conclude that no universally best method exists.

[IR-2] he Curious Case of High-Dimensional Indexing as a File Structure: A Case Study of eCP-FS

链接: https://arxiv.org/abs/2507.21939
作者: Omar Shahbaz Khan,Gylfi Þór Guðmundsson,Björn Þór Jónsson
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern analytical pipelines routinely deploy multiple deep learning and retrieval models that rely on approximate nearest-neighbor (ANN) indexes to support efficient similarity-based search. While many state-of-the-art ANN-indexes are memory-based (e.g., HNSW and IVF), using multiple ANN indexes creates a competition for limited GPU/CPU memory resources, which in turn necessitates disk-based index structures (e.g., DiskANN or eCP). In typical index implementations, the main component is a complex data structure that is serialized to disk and is read either fully at startup time, for memory-based indexes, or incrementally at query time, for disk-based indexes. To visualize the index structure, or analyze its quality, complex coding is needed that is either embedded in the index implementation or replicates the code that reads the data structure. In this paper, we consider an alternative approach that maps the data structure to a file structure, using a file library, making the index easily readable for any programming language and even human-readable. The disadvantage is that the serialized index is verbose, leading to overhead of searching through the index. The question addressed in this paper is how severe this performance penalty is. To that end, this paper presents eCP-FS, a file-based implementation of eCP, a well-known disk-based ANN index. A comparison with state-of-the-art indexes shows that while eCP-FS is slower, the implementation is nevertheless somewhat competitive even when memory is not constrained. In a memory-constrained scenario, eCP-FS offers a minimal memory footprint, making it ideal for resource-constrained or multi-index environments.

[IR-3] Exploration on Demand: From Algorithmic Control to User Empowerment

链接: https://arxiv.org/abs/2507.21884
作者: Edoardo Bianchi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems often struggle with over-specialization, which severely limits users’ exposure to diverse content and creates filter bubbles that reduce serendipitous discovery. To address this fundamental limitation, this paper introduces an adaptive clustering framework with user-controlled exploration that effectively balances personalization and diversity in movie recommendations. Our approach leverages sentence-transformer embeddings to group items into semantically coherent clusters through an online algorithm with dynamic thresholding, thereby creating a structured representation of the content space. Building upon this clustering foundation, we propose a novel exploration mechanism that empowers users to control recommendation diversity by strategically sampling from less-engaged clusters, thus expanding their content horizons while preserving relevance. Experiments on the MovieLens dataset demonstrate the system’s effectiveness, showing that exploration significantly reduces intra-list similarity from 0.34 to 0.26 while simultaneously increasing unexpectedness to 0.73. Furthermore, our Large Language Model-based A/B testing methodology, conducted with 300 simulated users, reveals that 72.7% of long-term users prefer exploratory recommendations over purely exploitative ones, providing strong evidence for the system’s ability to promote meaningful content discovery without sacrificing user satisfaction.

[IR-4] Solution for Meta KDD Cup25: A Comprehensive Three-Step Framework for Vision Question Answering

链接: https://arxiv.org/abs/2507.21520
作者: Zijian Zhang,Xiaocheng Zhang,Yang Zhou,Zhimin Lin,Peng Yan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup’25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.

[IR-5] Conversations over Clicks: Impact of Chatbots on Information Search in Interdisciplinary Learning

链接: https://arxiv.org/abs/2507.21490
作者: Hannah Kim,Sergei L. Kosakovsky Pond,Stephen MacNeil
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 9 pages, 2 tables, 3 figures, 2025 ASEE/IEEE Frontiers in Education (FIE) Conference preprint

点击查看摘要

Abstract:This full research paper investigates the impact of generative AI (GenAI) on the learner experience, with a focus on how learners engage with and utilize the information it provides. In e-learning environments, learners often need to navigate a complex information space on their own. This challenge is further compounded in interdisciplinary fields like bioinformatics, due to the varied prior knowledge and backgrounds. In this paper, we studied how GenAI influences information search in bioinformatics research: (1) How do interactions with a GenAI chatbot influence learner orienteering behaviors?; and (2) How do learners identify information scent in GenAI chatbot responses? We adopted an autoethnographic approach to investigate these questions. GenAI was found to support orienteering once a learning plan was established, but it was counterproductive prior to that. Moreover, traditionally value-rich information sources such as bullet points and related terms proved less effective when applied to GenAI responses. Information scents were primarily recognized through the presence or absence of prior knowledge of the domain. These findings suggest that GenAI should be adopted into e-learning environments with caution, particularly in interdisciplinary learning contexts.

[IR-6] Efficient Data Retrieval and Comparative Bias Analysis of Recommendation Algorithms for YouTube Shorts and Long-Form Videos

链接: https://arxiv.org/abs/2507.21467
作者: Selimhan Dagtas,Mert Can Cakmak,Nitin Agarwal
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The growing popularity of short-form video content, such as YouTube Shorts, has transformed user engagement on digital platforms, raising critical questions about the role of recommendation algorithms in shaping user experiences. These algorithms significantly influence content consumption, yet concerns about biases, echo chambers, and content diversity persist. This study develops an efficient data collection framework to analyze YouTube’s recommendation algorithms for both short-form and long-form videos, employing parallel computing and advanced scraping techniques to overcome limitations of YouTube’s API. The analysis uncovers distinct behavioral patterns in recommendation algorithms across the two formats, with short-form videos showing a more immediate shift toward engaging yet less diverse content compared to long-form videos. Furthermore, a novel investigation into biases in politically sensitive topics, such as the South China Sea dispute, highlights the role of these algorithms in shaping narratives and amplifying specific viewpoints. By providing actionable insights for designing equitable and transparent recommendation systems, this research underscores the importance of responsible AI practices in the evolving digital media landscape.

附件下载

点击下载今日全部论文列表