本篇博文主要内容为 2025-08-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-01)

今日共更新447篇论文,其中:

  • 自然语言处理87篇(Computation and Language (cs.CL))
  • 人工智能126篇(Artificial Intelligence (cs.AI))
  • 计算机视觉125篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习114篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

【速读】: 该论文旨在解决当前基于问答(Question-Answering, QA)基准测试评估大语言模型(Large Language Models, LLMs)时存在的间接性问题,即标准QA范式难以准确反映模型的真实问题求解能力,且可能高估不同模型之间的性能差异。解决方案的关键在于提出一种基于**级联问题披露(cascaded question disclosure)**的综合评估框架:该框架通过分阶段逐步披露问题的部分信息,引导模型生成更具泛化性的推理过程,从而更真实地衡量其问题求解能力,同时保持自动化和可扩展性。实证结果表明,该方法不仅提升了模型间性能比较的准确性,还显著改善了中间推理轨迹的质量,并缩小了标准QA设置下观测到的性能差距。

链接: https://arxiv.org/abs/2507.23776
作者: Yunxiang Yan,Tomohiro Sawada,Kartik Goyal
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emphcascaded question disclosure that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.
zh

[NLP-1] SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM -Based World Model

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的AI代理(AI agent)普遍采用“单任务-单代理”模式所带来的可扩展性差、泛化能力弱以及受限于自回归推理机制的问题。其解决方案的核心在于提出SimuRA架构,这是一种以目标为导向的通用代理推理框架,通过引入世界模型(world model)实现基于模拟的规划(planning via simulation),从而突破自回归推理的局限。该世界模型基于LLM构建,利用自然语言丰富的潜在空间在多种环境中灵活规划,实验表明其在复杂网页浏览任务中显著提升任务成功率(如航班搜索任务从0%提升至32.2%),且基于世界模型的规划相比传统自回归方法优势最高达124%,验证了模拟式推理作为新一代推理范式的有效性。

链接: https://arxiv.org/abs/2507.23773
作者: Mingkai Deng,Jinyu Hou,Yilin Shen,Hongxia Jin,Graham Neubig,Zhiting Hu,Eric Xing
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Carnegie Mellon University (卡内基梅隆大学); Samsung Research (三星研究院); Halıcıoğlu Data Science Institute, UC San Diego (哈利奇奥卢数据科学研究所,加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0% to 32.2%. World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.
zh

[NLP-2] CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练数据稀缺且质量难以保证的问题,尤其是在需要复杂推理能力的任务中。其解决方案的关键在于提出一种名为 CoT-Self-Instruct 的合成数据生成方法:首先利用 Chain-of-Thought (CoT) 引导模型进行推理与规划,随后基于原始种子任务生成高质量、复杂度相当的新型合成提示(synthetic prompt),并结合自动指标对生成数据进行过滤以确保质量。该方法在可验证推理任务(如 MATH500、AMC23、AIME24 和 GPQA-Diamond)和不可验证指令遵循任务(如 AlpacaEval 2.0 和 Arena-Hard)上均显著优于现有数据集和人类标注或标准 Self-Instruct 提示。

链接: https://arxiv.org/abs/2507.23751
作者: Ping Yu,Jack Lanchantin,Tianlu Wang,Weizhe Yuan,Olga Golovneva,Ilia Kulikov,Sainbayar Sukhbaatar,Jason Weston,Jing Xu
机构: FAIR at Meta(脸书人工智能研究室); NYU(纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.
zh

[NLP-3] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中逻辑规则难以被人类理解的问题,尤其是在复杂规则与各KG独特标注规范并存的情况下。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成自然语言解释,以提升逻辑规则的可读性与可解释性。研究通过AMIE 3.5.1算法从FB15k-237及两个大规模数据集提取逻辑规则,并采用零样本、少样本提示(prompting)策略,结合变量实体类型和思维链(Chain-of-Thought)推理方法,系统评估LLMs生成解释在正确性、清晰度及幻觉控制方面的表现,验证了LLMs作为自动评判工具的有效性与潜力。

链接: https://arxiv.org/abs/2507.23740
作者: Nasim Shirvani-Mahdavi,Devin Wingfield,Amin Ghasemi,Chengkai Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at this https URLthis https URL.
zh

[NLP-4] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学定理证明任务中因缺乏清晰监督信号而导致的性能瓶颈问题,尤其是在国际数学奥林匹克(IMO)级别竞赛题中的表现不足。现有方法依赖自然语言进行推理,难以获得有效的反馈以指导模型迭代优化;而专用形式化语言如Lean虽提供明确的验证机制,但其与复杂数学推理的结合仍面临挑战。解决方案的关键在于提出一种基于lemma风格的全证明推理框架Seed-Prover,该框架通过结合Lean的正式验证反馈、已证明引理(lemma)以及自我总结机制,实现对证明过程的持续迭代优化,并设计三种测试时推理策略以支持深度与广度并重的推理能力。此外,为弥补Lean在几何推理上的短板,作者进一步引入Seed-Geometry几何推理引擎,显著提升了形式化几何问题的求解能力。最终,该系统在多个基准上超越现有最优水平,并成功在IMO 2025中完整解答了6道题中的5道,标志着自动化数学推理的重大进展。

链接: https://arxiv.org/abs/2507.23726
作者: Luoxin Chen,Jinming Gu,Liankai Huang,Wenhao Huang,Zhicheng Jiang,Allan Jie,Xiaoran Jin,Xing Jin,Chenggang Li,Kaijing Ma,Cheng Ren,Jiawei Shen,Wenlei Shi,Tong Sun,He Sun,Jiahui Wang,Siran Wang,Zhihong Wang,Chenrui Wei,Shufa Wei,Yonghui Wu,Yuchen Wu,Yihang Xia,Huajian Xin,Fan Yang,Huaiyuan Ying,Hongyi Yuan,Zheng Yuan,Tianyang Zhan,Chi Zhang,Yue Zhang,Ge Zhang,Tianyun Zhao,Jianqiu Zhao,Yichi Zhou,Thomas Hanwen Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbfSeed-Prover, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves 78.1% of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbfSeed-Geometry, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
zh

[NLP-5] xtQuests: How Good are LLM s at Text-Based Video Games?

【速读】: 该论文旨在解决当前AI代理评估基准在衡量模型于复杂、交互式环境中进行长期自主推理能力方面的不足,尤其是缺乏对长上下文下内在推理能力的系统性测试。现有基准多聚焦于工具使用或结构化任务表现,难以反映代理在探索性场景中持续、自驱动的问题求解能力。解决方案的关键在于引入TextQuests,这是一个基于Infocom互动小说游戏套件的基准,其文本冒险特性要求玩家完成数百次精确操作并耗时超过30小时,从而成为评估大语言模型(LLM)代理在无外部工具支持下进行自包含问题求解的理想场景;该设计强制模型依赖内在推理与状态记忆,在单一交互会话内实现试错学习和长时间问题解决,有效聚焦于长程上下文下的内在推理能力评估。

链接: https://arxiv.org/abs/2507.23701
作者: Long Phan,Mantas Mazeika,Andy Zou,Dan Hendrycks
机构: Center for AI Safety; Carnegie Mellon University; Gray Swan AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at this https URL.
zh

[NLP-6] weakLLM : A Routing Architecture for Dynamic Tailoring of Cached Responses

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在实际部署中因高频查询导致的高延迟与高成本问题,核心挑战在于传统响应缓存机制难以保持缓存内容与用户查询的相关性,尤其受限于个性化交互场景下语义相似度检索的准确性。解决方案的关键在于提出TweakLLM——一种轻量级路由架构,利用一个小型语言模型动态调整缓存响应以适配新输入提示(prompt),从而在不牺牲响应质量的前提下显著提升缓存命中率和利用率。

链接: https://arxiv.org/abs/2507.23674
作者: Muhammad Taha Cheema,Abeer Aamir,Khawaja Gul Muhammad,Naveed Anwar Bhatti,Ihsan Ayyub Qazi,Zafar Ayyub Qazi
机构: Lahore University of Management Sciences (拉合尔管理科学大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.
zh

[NLP-7] Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

【速读】: 该论文旨在解决社交媒体中阿拉伯语文本的仇恨言论(hate speech)识别与净化问题。针对识别问题,研究采用深度学习模型和Transformer架构进行实验,以优化F1分数;针对净化问题,将文本去污视为机器翻译任务,即输入含仇恨言论的句子后输出掩码后的干净句子(通过星号替换敏感词)。关键解决方案在于:利用先进的自然语言处理技术实现高精度的仇恨言论检测(达到92%宏平均F1分数),并基于机器翻译框架构建文本掩码模型,在BLEU评分上取得0.3(1-gram)的良好表现,显著优于传统方法。

链接: https://arxiv.org/abs/2507.23661
作者: Salam Thabet Doghmash,Motaz Saad
机构: Islamic University of Gaza (伊斯兰大学加沙分校)
类目: Computation and Language (cs.CL)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92% Macro F1 score and 95% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.
zh

[NLP-8] Deep Learning-based Prediction of Clinical Trial Enrollm ent with Uncertainty Estimates

【速读】: 该论文旨在解决临床试验中患者入组预测不准确的问题,这是影响试验成败的关键因素之一。其解决方案的核心在于提出一种基于深度学习的新方法,通过预训练语言模型(Pre-trained Language Models, PLMs)提取临床文档的语义特征,并结合结构化表格特征,利用注意力机制进行融合;同时引入基于伽马分布(Gamma distribution)的概率层以量化预测不确定性,从而实现对多中心临床试验中各研究中心入组人数的精准范围估计。

链接: https://arxiv.org/abs/2507.23607
作者: Tien Huu Do,Antoine Masquelier,Nae Eoun Lee,Jonathan Crowther
机构: Pfizer(辉瑞); Merck(默克)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.
zh

[NLP-9] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

【速读】: 该论文旨在解决当前参数高效微调方法在保持模型性能的同时难以有效利用差分注意力机制(differential attention)的问题。其核心挑战在于如何将具有噪声抑制能力的差分注意力机制与低秩适配器(Low-Rank Adapter, LoRA)相结合,从而在不显著增加参数量的前提下提升模型表现。解决方案的关键在于提出DiffLoRA,即在正负注意力项上分别引入低秩适配器,以保留LoRA的高效性并借鉴差分注意力的性能优势。实验表明,尽管DiffLoRA在多数任务中不如其他参数高效微调方法,但在特定领域(如HumanEval编程评测)中较LoRA提升达+11分,说明其在某些场景下具备独特潜力。

链接: https://arxiv.org/abs/2507.23588
作者: Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina
机构: EPFL(瑞士联邦理工学院); NAVER LABS Europe(NAVER实验室欧洲)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.
zh

[NLP-10] -Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本中存在对抗性扰动(adversarial perturbations)时,现有零样本检测方法因依赖高斯分布假设而失效的问题。其关键解决方案在于重新设计基于曲率的检测器的统计核心:用来自 Student’s t 分布的重尾差异分数(heavy-tailed discrepancy score)替代传统的高斯归一化方法,从而更有效地建模对抗文本所表现出的显著尖峰态(leptokurtosis)特性,并提升对统计异常值的鲁棒性。

链接: https://arxiv.org/abs/2507.23577
作者: Alva West,Luodan Zhang,Liuliu Zhang,Minjun Zhu,Yixuan Weng,Yue Zhang
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at this https URL.
zh

[NLP-11] Med-R3: Enhancing Medical Retrieval-Augmented Reasoning of LLM s via Progressive Reinforcement Learning

【速读】: 该论文旨在解决医疗场景中检索增强推理(Retrieval-Augmented Reasoning, RAR)模型在检索与推理能力分离优化、监督微调导致的泛化能力受限,以及奖励函数设计未能充分适配医学领域特殊需求等问题。其解决方案的关键在于提出一种基于渐进式强化学习(Progressive Reinforcement Learning)的医疗检索增强推理框架——Med-R³,通过分阶段优化策略:首先提升模型对医学问题的逻辑推理能力,继而自适应地优化检索模块以匹配知识库特征和推理过程中的外部信息利用,最终实现检索与推理之间的协同优化,从而显著提升模型在复杂医疗任务中的性能表现。

链接: https://arxiv.org/abs/2507.23541
作者: Keer Lu,Zheng Liang,Youquan Li,Jiejun Tan,Da Pan,Shusen Zhang,Guosheng Dong,Huang Leng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R ^3 **, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that **Med-R ^3 ** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R ^3 surpassing closed-sourced GPT-4o-mini by 3.93% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R ^3 shows a more substantial gain of 13.53%.
zh

[NLP-12] A Novel Evaluation Benchmark for Medical LLM s: Illuminating Safety and Effectiveness in Clinical Domains

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床决策支持场景中面临的安全性评估与有效性验证难题。其解决方案的关键在于构建了一个基于临床专家共识的多维评估框架——临床安全-有效性双轨基准(Clinical Safety-Effectiveness Dual-Track Benchmark, CSEDB),涵盖30项关键指标,包括危重症识别、指南依从性和用药安全等,并引入加权后果度量;同时通过32名专科医生设计和审核的2,069个开放式问答样本,覆盖26个临床科室,模拟真实医疗场景进行测试,从而实现对LLMs在安全性与有效性维度上的系统化、标准化评价,为不同模型的比较分析、风险暴露识别及改进方向提供依据。

链接: https://arxiv.org/abs/2507.23486
作者: Shirui Wang,Zhihui Tang,Huaxia Yang,Qiuhong Gong,Tiantian Gu,Hongyang Ma,Yongxin Wang,Wubin Sun,Zeliang Lian,Kehang Mao,Yinan Jiang,Zhicheng Huang,Lingyun Ma,Wenjie Shen,Yajie Ji,Yunhui Tan,Chunbo Wang,Yunlu Gao,Qianling Ye,Rui Lin,Mingyu Chen,Lijuan Niu,Zhihao Wang,Peng Yu,Mengran Lang,Yue Liu,Huimin Zhang,Haitao Shen,Long Chen,Qiguang Zhao,Si-Xuan Liu,Lina Zhou,Hua Gao,Dongqiang Ye,Lingmin Meng,Youtao Yu,Naixin Liang,Jianxiong Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended QA items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
zh

[NLP-13] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业场景中部署时,如何根据用户角色实现访问权限控制的问题。现有安全方法通常假设所有用户具有相同访问权限,仅关注防止有害或不当输出,而忽视了不同组织角色对内容访问的差异化约束。解决方案的关键在于通过三种建模策略——基于BERT的分类器、基于LLM的分类器以及角色条件生成(role-conditioned generation)——训练模型以识别并响应与特定角色权限相匹配的内容请求。研究进一步构建了两类数据集:一类基于已有指令微调语料聚类和角色标注,另一类为模拟真实企业场景的角色敏感数据,从而系统评估模型在不同组织结构下的表现及对提示注入、角色错配和越狱攻击的鲁棒性。

链接: https://arxiv.org/abs/2507.23465
作者: Saeed Almheiri,Yerulan Kongrat,Adrian Santosh,Ruslan Tasmukhanov,Josemaria Vera,Muhammad Dehan Al Kautsar,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Nazarbayev University (纳扎尔巴耶夫大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.
zh

[NLP-14] Counterfactual Evaluation for Blind Attack Detection in LLM -based Evaluation Systems

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的评估系统在面对提示注入攻击(prompt injection attacks)时的安全性问题,特别是针对一类称为“盲攻击”(blind attacks)的威胁——即攻击者构造一个与真实答案无关的候选答案来欺骗评估器。解决方案的关键在于提出一种结合标准评估(Standard Evaluation, SE)与反事实评估(Counterfactual Evaluation, CFE)的框架:CFE通过故意设定一个错误的基准答案对提交内容进行重新评估;若系统在标准和反事实条件下均判定答案有效,则可判定为攻击行为。实验表明,该SE+CFE框架在显著提升检测能力的同时,仅带来微小的性能损失。

链接: https://arxiv.org/abs/2507.23453
作者: Lijia Liu,Takumi Kondo,Kyohei Atarashi,Koh Takeuchi,Jiyi Li,Shigeru Saito,Hisashi Kashima
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
zh

[NLP-15] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

【速读】: 该论文旨在解决当前大语言模型在面对不完整或误导性信息时缺乏主动识别与修正能力的问题,即传统被动式批判性思维(passive critical thinking)无法有效促进模型与用户之间的协作式问题解决。其核心解决方案是提出“主动式批判性思维”(proactive critical thinking),即模型通过主动向用户请求缺失或澄清信息来提升推理准确性。关键创新在于构建了两个新基准测试集GSM-MC和GSM-MCE,用于评估模型在数学推理中应对信息缺失和干扰的能力,并结合强化学习(Reinforcement Learning, RL)算法显著提升模型的主动批判性思维性能,例如使Qwen3-1.7B模型在GSM-MC上的准确率从0.15%大幅提升至73.98%。

链接: https://arxiv.org/abs/2507.23407
作者: Ante Wang,Yujie Lin,Jingyao Liu,Suhang Wu,Hao Liu,Xinyan Xiao,Jinsong Su
机构: Xiamen University (厦门大学); Xiamen University (厦门大学); Baidu Inc. (百度公司); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.
zh

[NLP-16] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

【速读】: 该论文旨在解决阿拉伯语在自然语言处理(Natural Language Processing, NLP)和信息检索(Information Retrieval, IR)中面临的独特挑战,包括其复杂的形态学结构、可选的元音符号(diacritics)以及现代标准阿拉伯语(Modern Standard Arabic, MSA)与多种方言共存的问题。由于阿拉伯语在NLP研究和基准资源中仍处于代表性不足的状态,本文提出了一种专为阿拉伯语优化的增强型密集段落检索(Dense Passage Retrieval, DPR)框架。其解决方案的关键在于引入一种新颖的注意力相关性评分机制(Attentive Relevance Scoring, ARS),该机制替代了传统的交互方式,采用自适应评分函数更有效地建模问题与段落之间的语义相关性,并结合预训练阿拉伯语语言模型及架构改进,在回答阿拉伯语问题时显著提升了排序准确率。

链接: https://arxiv.org/abs/2507.23404
作者: Salah Eddine Bekhouche,Azeddine Benlamoudi,Yazid Bounab,Fadi Dornaika,Abdenour Hadid
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \hrefthis https URLGitHub.
zh

[NLP-17] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

【速读】: 该论文旨在解决多文档摘要(multi-document summarization)中因文档间复杂关系和信息冗余导致的摘要质量下降问题。现有方法通常仅依赖单关系图结构且需预先设定聚类数量,难以充分建模语义与篇章关系并自适应地减少冗余。其解决方案的关键在于提出MRGSEM-Sum框架,通过构建融合语义与篇章关系的多关系图(multi-relational graph),利用二维结构熵最小化算法自动确定最优聚类数,并结合位置感知压缩机制对每个聚类进行精炼,从而实现高一致性、高覆盖度的无监督摘要生成。

链接: https://arxiv.org/abs/2507.23400
作者: Yongbing Zhang,Fang Nan,Shengxiang Gao,Yuxin Huang,Kaiwen Tan,Zhengtao Yu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.
zh

[NLP-18] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

【速读】: 该论文旨在解决当前翻译领域中对商业云服务驱动的大语言模型(Large Language Models, LLMs)依赖所引发的数据隐私、安全及公平访问问题。其解决方案的关键在于探索本地部署的开源语言模型作为替代方案的可行性与性能表现,通过在基于CPU的平台运行三款免费开放模型,并将其功能表现与商用在线聊天机器人进行对比,验证了本地化部署在提升数据控制力、保障隐私和降低云服务依赖方面的优势,从而为个体译者和中小企业提供更可控、可及的AI翻译工具路径。

链接: https://arxiv.org/abs/2507.23399
作者: Peter Sandrini
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.
zh

[NLP-19] Causal2Vec: Improving Decoder-only LLM s as Versatile Embedding Models

【速读】: 该论文旨在解决当前基于解码器-only大语言模型(Large Language Models, LLMs)的文本嵌入方法中存在的两个核心问题:一是现有方法通过移除因果注意力掩码(causal attention mask)以实现双向注意力,可能削弱模型在预训练阶段习得的语义信息提取能力;二是主流单向方法依赖额外输入文本来弥补因果注意力的局限性,导致计算成本显著增加。解决方案的关键在于提出Causal2Vec,其创新性地引入一个轻量级BERT-style模型预先将输入文本编码为单一上下文标记(Contextual token),并将其前置至LLM输入序列中,使每个token即使在因果注意力约束下也能获取全局上下文信息;同时,为缓解因最后token池化(last-token pooling)带来的近期偏差(recency bias),采用拼接上下文标记与结束符(EOS)的最后隐藏状态作为最终文本嵌入表示,从而在不修改原生LLM架构且无显著计算开销的前提下,显著提升嵌入质量与效率。

链接: https://arxiv.org/abs/2507.23386
作者: Ailiang Lin,Zhuoyun Li,Kotaro Funakoshi
机构: Institute of Science Tokyo (东京科学研究所); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.
zh

[NLP-20] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂任务规划中面临的两大挑战:一是现有基准无法直接评估模型在真实世界场景下的多模态规划能力,二是缺乏跨模态的显式或隐式约束机制。为应对这些问题,作者提出了首个系统性评估MLLMs处理多模态约束下规划能力的基准——多模态复杂约束规划(Multimodal Planning with Complex Constraints, MPCC)。其关键创新在于:(1)聚焦于三个真实世界任务(航班规划、日程规划和会议规划),以贴近实际应用;(2)引入具有分级难度(易、中、难)的复杂约束(如预算、时间与空间约束),将约束复杂度与搜索空间扩展分离,从而精准衡量模型在多约束条件下的推理能力。实验表明,当前主流MLLMs在满足复杂约束的可行计划生成上表现有限,凸显了约束感知推理能力的不足,为未来研究提供了明确方向。

链接: https://arxiv.org/abs/2507.23382
作者: Yiyan Ji,Haoran Chen,Qiguang Chen,Chengyue Wu,Libo Qin,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); The University of Hong Kong (香港大学); Central South University School of Computer Science and Engineering (中南大学计算机科学与工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025

点击查看摘要

Abstract:Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.
zh

[NLP-21] Holistic Evaluations of Topic Models

【速读】: 该论文试图解决的问题是:主题模型(Topic Models)在实际应用中可能沦为“黑箱”,即用户仅将输入数据直接交给模型并接受输出结果作为准确摘要,而缺乏对模型输出可靠性和解释性的审慎评估。为应对这一问题,论文从数据库视角出发,基于1140次BERTopic模型运行实验,系统评估了不同参数配置下的性能表现,其解决方案的关键在于识别和量化优化过程中的权衡关系(trade-offs),从而为用户提供可解释、负责任地使用主题模型的实践依据与决策支持。

链接: https://arxiv.org/abs/2507.23364
作者: Thomas Compton
机构: University of York (约克大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 6 tables

点击查看摘要

Abstract:Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a ‘black box’, where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models
zh

[NLP-22] SWE-Exp: Experience-Driven Software Issue Resolution

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在软件问题修复中缺乏经验积累与复用的问题,即代理作为无记忆的探索者,每次独立处理问题而无法利用过往修复轨迹中的知识,导致重复尝试失败路径并错失将成功策略迁移至相似问题的机会。解决方案的关键在于提出SWE-Exp方法,其核心是构建一个多维度的经验库(experience bank),从不同抽象层级(包括问题理解到具体代码修改)提炼出可复用的修复知识,并实现跨问题的持续学习,从而将自动化软件工程代理从试错式探索转变为基于经验的战略性修复模式。

链接: https://arxiv.org/abs/2507.23361
作者: Silin Chen,Shaoxin Lin,Xiaodong Gu,Yuling Shi,Heng Lian,Longfei Yun,Dong Chen,Weiguo Sun,Lin Cao,Qianxiang Wang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL

点击查看摘要

Abstract:Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.
zh

[NLP-23] xt-to-SQL Task-oriented Dialogue Ontology Construction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在任务导向对话(Task-Oriented Dialogue, TOD)系统中因依赖参数化知识而导致可解释性与可信度不足的问题。传统TOD系统虽通过外部数据库和显式本体(ontology)实现可控性和可解释性,但其构建需人工标注或监督训练,成本高且扩展性差。论文提出的解决方案是TeQoDO——一种无需监督的文本到SQL任务导向对话本体构建方法,其关键在于利用LLM固有的SQL编程能力,并结合提示中提供的对话理论(dialogue theory),使模型能够自主从零开始构建结构化的TOD本体。实验证明,TeQoDO优于迁移学习方法,且其生成的本体在下游对话状态追踪任务中表现优异,同时具备良好的可扩展性,适用于大规模数据集如Wikipedia和ArXiv。

链接: https://arxiv.org/abs/2507.23358
作者: Renato Vukovic,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Hsien-Chin Lin,Shutong Feng,Nurul Lubis,Milica Gasic
机构: Heinrich Heine University Düsseldorf (海因里希·海涅大学杜塞尔多夫分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.
zh

[NLP-24] SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

【速读】: 该论文旨在解决现有基于代理(agent)的代码问题定位方法在面对跨代码库多模块依赖时,因局限于独立探索而难以识别全局性缺陷模式的问题。其解决方案的关键在于提出SWE-Debate框架,该框架通过构建多个故障传播路径作为定位提案,并组织三轮竞争性多代理辩论,使不同推理视角的代理在结构化对抗中协同收敛至统一的修复方案;最终将此方案输入基于蒙特卡洛树搜索(MCTS)的代码修改代理以生成补丁。这一机制显著提升了问题定位的广度与准确性。

链接: https://arxiv.org/abs/2507.23348
作者: Han Li,Yuling Shi,Shaoxin Lin,Xiaodong Gu,Heng Lian,Xin Wang,Yantao Jia,Tao Huang,Qianxiang Wang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL

点击查看摘要

Abstract:Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.
zh

[NLP-25] DSBC : Data Science task Benchmarking with Context engineering

【速读】: 该论文旨在解决当前数据科学代理(Data Science Agents)在实际应用中缺乏系统性评估基准的问题,以准确衡量其效能与局限性。解决方案的关键在于构建一个基于真实用户交互行为的综合性基准测试框架,该框架不仅涵盖八类典型数据科学任务,还引入了对常见提示工程问题(如数据泄露和指令模糊性)以及温度参数敏感性的深入分析,并对比三种主流大语言模型(LLMs)在零样本、多步推理及SmolAgent增强方法下的表现差异,从而为未来更鲁棒和高效的自动化数据分析代理研发提供可复现的评估基础。

链接: https://arxiv.org/abs/2507.23336
作者: Ram Mohan Rao Kadiyala,Siddhant Gupta,Jebish Purbey,Giulio Martini,Suman Debnath,Hamza Farooq
机构: Traversaal.ai; Cohere Labs Community; IIT Roorkee; Amazon; Stanford
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 32 pages

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.
zh

[NLP-26] MUST-RAG : MUSical Text Question Answering with Retrieval Augmented Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在音乐相关任务中表现受限的问题,其根本原因在于训练数据中音乐领域知识占比相对较低。为提升LLMs在纯文本音乐问答(Music Question Answering, MQA)任务中的适应能力,作者提出MusT-RAG框架,其核心解决方案在于:首先构建一个专门面向音乐领域的向量数据库MusWikiDB,用于增强检索阶段的知识获取;其次,在推理和微调过程中均利用检索到的上下文信息,从而有效将通用LLM转化为音乐特定模型。实验表明,该方法显著优于传统微调策略,并在域内与域外MQA基准测试中均实现稳定性能提升。

链接: https://arxiv.org/abs/2507.23334
作者: Daeyong Kwon,SeungHeon Doh,Juhan Nam
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs’ effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs’ music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.
zh

[NLP-27] Whats Taboo for You? - An Empirical Evaluation of LLM s Behavior Toward Sensitive Content

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在未接受显式指令的情况下,是否能够隐式地对敏感内容进行净化(implicit content moderation)。以往研究多聚焦于通过显式训练实现内容去毒(detoxification)与合规性调整,但缺乏对模型在无监督或零样本条件下自发产生语义偏移行为的系统分析。解决方案的关键在于:通过实证方法评估 GPT-4o-mini 在改写敏感文本时的隐式调节能力,发现其倾向于将原始内容向低敏感类别迁移,显著减少侮辱性和禁忌语句的使用;同时,进一步验证了 LLM 在零样本场景下对句子敏感度分类的有效性,为理解模型内部隐含的内容治理机制提供了实证依据。

链接: https://arxiv.org/abs/2507.23319
作者: Alfio Ferrara,Sergio Picascia,Laura Pinnavaia,Vojimir Ranitovic,Elisabetta Rocchetti,Alice Tuveri
机构: Università degli Studi di Milano (米兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.
zh

[NLP-28] SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

【速读】: 该论文旨在解决序列建模中模型在训练与推理阶段执行方式不一致导致的复杂性与错误问题,特别是针对流式处理(streaming)和并行处理(parallel processing)场景下常见的一致性bug。其核心解决方案是提出一种称为SequenceLayers的神经网络层API及库,关键在于显式定义层的状态随时间的变化(如Transformer的键值缓存KV cache、RNN隐藏状态等),并通过提供一个step方法来演化该状态,确保其结果与无状态的逐层调用完全一致。这一设计使得复杂模型可立即支持流式推理,同时提升代码正确性和可组合性,且可在任意深度学习框架中实现。

链接: https://arxiv.org/abs/2507.23292
作者: RJ Skerry-Ryan,Julian Salazar,Soroosh Mariooryad,David Kao,Daisy Stanton,Eric Battenberg,Matt Shannon,Ron J. Weiss,Robin Scheibler,Jonas Rothfuss,Tom Bagby
机构: Google DeepMind
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at this https URL.
zh

[NLP-29] Unveiling Super Experts in Mixture-of-Experts Large Language Models

【速读】: 该论文旨在解决稀疏激活的专家混合模型(Mixture-of-Experts, MoE)中专家重要性异质性未被充分理解的问题,尤其是缺乏对关键专家的系统识别与机制解析。现有方法多依赖经验准则筛选专家进行压缩,未能揭示哪些专家在模型前向推理中发挥核心作用。其解决方案的关键在于首次发现并验证了一类具有决定性影响的“超级专家”(Super Experts, SEs)——这些专家虽数量极少且激活频率低,但在下投影层(down_proj)输出中表现为极端激活异常值,进而引发解码器层间隐藏状态的显著放大效应;更重要的是,SEs的存在对于维持注意力分数分布的稳定性至关重要,其剪枝会导致模型性能严重下降(如Qwen3-30B-A3B模型中仅剪枝三个SEs即导致输出重复无意义),这表明MoE LLMs依赖SEs诱导“注意力陷阱”(attention sinks)以实现有效的信息分配。

链接: https://arxiv.org/abs/2507.23279
作者: Zunhai Su,Qingyuan Li,Hao Zhang,YuLei Qian,Yuchen Xie,Kehong Yuan
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at this https URL.
zh

[NLP-30] Evaluating LLM s Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

【速读】: 该论文旨在解决 Bengali 语言在自然语言处理(Natural Language Processing, NLP)研究中因缺乏标准化评估基准而导致的性能瓶颈问题。其解决方案的关键在于系统性地评估10个开源大型语言模型(Large Language Models, LLMs)在8个翻译数据集上的表现,并通过详尽的错误分析识别出模型的主要失效模式。研究发现,相较于英语,Bengali 在小规模模型和特定架构(如 Mistral)上存在显著性能差距,而某些架构(如 DeepSeek)则展现出跨语言稳定性;同时揭示了分词效率与模型准确率之间的反比关系——过度分词会降低性能,高效且紧凑的分词策略则有助于提升效果。这一工作为改进多语言场景下的数据集质量和评估方法提供了实证依据,推动面向低资源语言的 NLP 技术发展。

链接: https://arxiv.org/abs/2507.23248
作者: Shimanto Bhowmik,Tawsif Tashwar Dipto,Md Sazzad Islam,Sheryl Hsu,Tahsin Reasat
机构: Rochester Institute of Technology (罗切斯特理工学院); Islamic University of Technology (伊斯兰科技大学); Stanford University (斯坦福大学); Bengali.AI (Bengali.AI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \ concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at this https URL.
zh

[NLP-31] P-ReMIS: Prag matic Reasoning in Mental Health and a Social Implication

【速读】: 该论文旨在解决当前心理健康领域个性化聊天机器人在推理可解释性与对话语用学方面研究不足的问题,特别是缺乏对隐含意义(implicature)和预设(presupposition)等语用现象的系统建模与评估。其解决方案的关键在于构建了首个针对心理健康场景的P-ReMe数据集,并提出适用于该领域的隐含意义与预设的修正定义,进而设计了三项具体任务用于衡量大语言模型(LLM)的语用推理能力;同时引入StiPRompts框架用于评估主流LLM在处理心理健康污名化问题上的表现,实验表明Claude-3.5-haiku在应对污名化方面更具责任意识。

链接: https://arxiv.org/abs/2507.23247
作者: Sneha Oram,Pushpak Bhattacharyya
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.
zh

[NLP-32] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中查询表述优化难题,尤其是在面对多样化、非结构化真实世界文档时,如何有效提升检索性能。其核心挑战在于传统方法依赖人工标注数据进行查询重写训练,且难以适配多模态与纯文本数据库。解决方案的关键在于提出一种基于强化学习的检索器特定查询重写框架 RL-QR,该框架无需人类标注数据,通过合成场景-问题对并利用广义奖励策略优化(Generalized Reward Policy Optimization, GRPO)来训练针对特定检索器的查询重写模型,从而显著提升检索效果。实验表明,RL-QR 在多模态和词法检索器上分别实现了 NDCG@3 相对提升 11% 和 9%,验证了其在工业级数据上的有效性与可扩展性。

链接: https://arxiv.org/abs/2507.23242
作者: Sungguk Cha,DongWook Kim,Taeseung Hahn,Mintae Kim,Youngsub Han,Byoung-Ki Jeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbfRL-QR, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with \textRL-QR_\textmulti-modal achieving an 11% relative gain in NDCG@3 for multi-modal RAG and \textRL-QR_\textlexical yielding a 9% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.
zh

[NLP-33] Enabling Few-Shot Alzheimers Disease Diagnosis on Tabular Biomarker Data with LLM s

【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期精准诊断中因多源异构生物标志物(如神经影像、遗传风险因素、认知测试和脑脊液蛋白等)数据结构复杂且样本量有限而导致的预测准确性不足问题。解决方案的关键在于提出一种名为TAP-GPT的新框架,其核心创新是将专为业务智能任务设计的多模态表格语言模型TableGPT2通过少样本上下文学习构建结构化提示,并结合参数高效微调方法qLoRA,在小样本条件下实现对AD与认知正常(cognitively normal, CN)的临床二分类任务。该方法充分利用了TableGPT2强大的表格理解能力及预训练大语言模型(LLM)中的先验知识,在性能上超越了更通用的大语言模型和专门用于预测任务的表格基础模型(Tabular Foundation Model, TFM)。

链接: https://arxiv.org/abs/2507.23227
作者: Sophie Kearney,Shu Yang,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason Moore,Marylyn Ritchie,Li Shen
机构: University of Pennsylvania(宾夕法尼亚大学); United States Naval Academy(美国海军学院); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校); Cedars Sinai Medical Center(塞德斯-西奈医学中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.
zh

[NLP-34] Model Directions Not Words: Mechanistic Topic Models Using Sparse Autoencoders

【速读】: 该论文旨在解决传统主题模型(topic models)和神经主题模型在捕捉语义抽象特征方面的局限性,这些问题主要源于其依赖词袋(bag-of-words)表示方式以及将主题表达为词列表的机制,导致难以揭示深层次的概念主题并限制了对复杂语义结构的建模能力。解决方案的关键在于提出机制化主题模型(Mechanistic Topic Models, MTMs),该模型基于稀疏自编码器(sparse autoencoders, SAEs)学习到的可解释特征空间定义主题,从而实现对语义丰富特征的显式描述;同时,MTMs引入基于主题的引导向量(steering vectors),首次在主题模型中实现了可控文本生成的能力,显著提升了主题的表达力与实用性。

链接: https://arxiv.org/abs/2507.23220
作者: Carolina Zheng,Nicolas Beltran-Velez,Sweta Karlekar,Claudia Shi,Achille Nazaret,Asif Mallik,Amir Feder,David M. Blei
机构: Columbia University (哥伦比亚大学); Google Research (谷歌研究); Independent (独立)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textittopic judge, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.
zh

[NLP-35] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leverag ing Negative Samples

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在少样本上下文学习(Few-shot In-context Learning, ICL)中性能对示例选择敏感的问题,尤其是现有方法过度依赖正样本(Positive samples)而忽视负样本(Negative samples)所蕴含的辅助信息。其解决方案的关键在于:首先基于零样本思维链(Zero-Shot-CoT)构建正负样本语料库;在推理阶段,通过语义相似度分别从正负样本库中选取与查询最相关的示例,并进一步基于负样本检索出相关正样本,最终将两类正样本合并作为ICL演示(demonstrations),从而提升正样本的选择质量与整体性能。实验表明,利用负样本信息有助于优化正样本筛选,显著增强ICL效果。

链接: https://arxiv.org/abs/2507.23211
作者: Yunhao Liang,Ruixuan Ying,Takuya Taniguchi,Zhe Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.23211 [cs.CL] (or arXiv:2507.23211v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.23211 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yunhao Liang [view email] [v1] Thu, 31 Jul 2025 03:06:27 UTC (250 KB)
zh

[NLP-36] Geak: Introducing Triton Kernel AI Agent Evaluation Benchmarks

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在 GPU 核函数(GPU kernel)自动化生成中的性能与正确性难题,尤其是在面向 AMD GPU 硬件平台时,如何实现接近专家水平的优化效率。其核心挑战在于传统手工编写核函数难以满足日益复杂的深度学习工作负载对可扩展性和硬件适配性的需求。解决方案的关键是提出 GEAK(Generating Efficient AI-centric GPU Kernels)框架,该框架基于 Triton 语言构建,利用前沿大语言模型(LLM)结合推理时计算资源扩展(inference-time compute scaling),并通过受 Reflexion 风格反馈机制启发的推理循环(reasoning loop)迭代优化生成过程,从而在两个基准测试中显著优于直接提示前沿 LLM 或传统 Reflexion 方法,在代码正确率上达到 63%,执行速度提升最高达 2.59 倍。

链接: https://arxiv.org/abs/2507.23194
作者: Jianghui Wang,Vinay Joshi,Saptarshi Majumder,Xu Chao,Bin Ding,Ziqiong Liu,Pratik Prabhanjan Brahma,Dong Li,Zicheng Liu,Emad Barsoum
机构: Advanced Micro Devices, Inc. (AMD)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to 63 % and execution speed up of up to 2.59 X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.
zh

[NLP-37] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)集成方法中因忽视模型在不同情境下置信度差异而导致性能受限的问题。传统集成策略如投票或logits平均未能充分考虑各模型在特定任务或输入下的可靠性变化,从而影响整体系统鲁棒性与准确性。其解决方案的关键在于提出LENS(Learning ENsemble confidence from Neural States),一种通过分析模型内部表示来学习置信度的新方法:为每个LLM训练一个轻量级线性置信度预测器,利用层间隐藏状态和归一化概率作为输入,实现基于上下文的动态权重分配,且无需修改原始模型参数、计算开销极低。实验表明,该方法在多项选择题和布尔问答任务上显著优于传统集成方法。

链接: https://arxiv.org/abs/2507.23167
作者: Jizhou Guo
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.
zh

[NLP-38] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal ICML2025

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在部署后如何通过用户长期交互持续优化的问题,特别是如何从用户与模型的交互日志中自动提取隐式用户反馈(implicit user feedback),以替代直接询问用户反馈所带来的干扰。其解决方案的关键在于:首先系统分析了用户在与大语言模型(Large Language Models, LLMs)对话轨迹中的隐式反馈模式,识别出反馈发生的时间和动机;其次,提出从这些隐式反馈中挖掘学习信号的方法,发现反馈内容(如用户希望澄清)比单纯的情感极性(如用户对先前回复不满意)更能提升模型在短而设计精良的问题集(MTBench)上的性能,但对复杂长文本任务(WildBench)效果有限,且反馈的有效性高度依赖于用户初始提示的质量。

链接: https://arxiv.org/abs/2507.23158
作者: Yuhan Liu,Michael J.Q. Zhang,Eunsol Choi
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: Earlier version of this paper was presented at 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA), ICML 2025

点击查看摘要

Abstract:Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user’s initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.
zh

[NLP-39] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

【速读】: 该论文旨在解决多模态模型在真实环境中理解跨模态因果关系的挑战,特别是视觉观察与程序性文本之间的因果依赖推理问题。解决方案的关键在于提出ISO-Bench基准测试,该基准通过判断图像任务步骤与文本计划片段的时间先后顺序来评估模型的因果推理能力,从而系统性地衡量当前前沿视觉-语言模型在因果理解上的不足,并为后续改进提供明确方向。

链接: https://arxiv.org/abs/2507.23135
作者: Ananya Sadana,Yash Kumar Lal,Jiawei Zhou
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
zh

[NLP-40] Uncovering the Frag ility of Trustworthy LLM s through Chinese Textual Ambiguity KDD KDD’25

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理语义模糊文本时的信任度问题,尤其聚焦于中文语境下的语言歧义现象。其核心发现表明,当前LLMs在面对模糊文本时表现出显著的脆弱性:无法可靠区分模糊与非模糊文本、对模糊内容过度自信地赋予单一解释、并在尝试理解多种可能含义时出现“过度思考”行为。解决方案的关键在于构建了一个系统化的中文模糊文本基准数据集(benchmark dataset),包含3大类9小类标注示例,每条数据均配有上下文和对应的消歧配对,从而为评估和改进LLMs在不确定性场景下的语言理解能力提供了可量化、可复现的实验基础。

链接: https://arxiv.org/abs/2507.23121
作者: Xinwei Wu,Haojie Li,Hongyu Liu,Xinyu Ji,Ruohan Li,Yule Chen,Yigeng Zhang
机构: Chalmers University of Technology (查尔姆斯理工大学); Boeing (波音公司); Salesforce (Salesforce)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models (Agentic GenAI Evaluation Workshop KDD '25)

点击查看摘要

Abstract:In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: this https URL.
zh

[NLP-41] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

【速读】: 该论文旨在解决大规模企业级数据目录(data catalog)下基于大语言模型(Large Language Model, LLM)的自然语言数据库接口(natural language interface for databases)在可扩展性方面的挑战。现有方法依赖领域特定的微调(domain-specific fine-tuning),部署复杂且未能有效利用数据库元数据中的语义信息。其解决方案的关键在于提出一种基于组件的检索架构(component-based retrieval architecture),将数据库模式(schema)和元数据分解为离散的语义单元,并分别索引以实现精准检索;该方法优先确保表级别的准确识别,同时合理利用列级信息,在保持上下文预算可控的前提下提升召回率与准确性,从而无需专门微调即可在多样化企业环境中部署高效的文本到SQL(text-to-SQL)系统。

链接: https://arxiv.org/abs/2507.23104
作者: Jeffrey Eben,Aitzaz Ahmad,Stephen Lau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.
zh

[NLP-42] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

【速读】: 该论文旨在解决跨结构化(如海报、网页)与非结构化(如自然图像)场景下,生成式 AI (Generative AI) 在进行局部内容编辑时难以维持全局语义一致性和视觉连贯性的问题。解决方案的关键在于提出 SMART-Editor 框架,其核心创新包括两个策略:一是在推理阶段采用 Reward-Refine 方法,通过奖励引导的精炼机制优化编辑结果;二是在训练阶段引入 RewardDPO 方法,利用奖励对齐的布局配对数据进行偏好优化,从而提升模型在复杂多级编辑任务中的全局一致性表现。

链接: https://arxiv.org/abs/2507.23095
作者: Ishani Mondal,Meera Bharadwaj,Ayush Roy,Aparna Garimella,Jordan Lee Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Submission

点击查看摘要

Abstract:We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.
zh

[NLP-43] Context-aware Rotary Position Embedding

【速读】: 该论文旨在解决传统旋转位置编码(Rotary Positional Embedding, RoPE)依赖静态、输入无关的正弦频率模式,从而限制其建模上下文敏感关系能力的问题。解决方案的关键在于提出一种新的通用化RoPE方法——上下文感知旋转位置编码(Context-Aware Rotary Positional Embedding, CARoPE),其核心创新是通过token嵌入的有界变换动态生成头特定的频率模式,从而引入token和上下文感知的位置表示,同时保持RoPE的计算效率与架构简洁性。CARoPE在FineWeb-Edu-10B数据集上使用GPT-2变体进行训练,实验表明其在长上下文场景下显著降低困惑度,并实现更快的训练吞吐量,验证了其作为现有位置编码策略的可扩展、高表达力且高效的升级方案。

链接: https://arxiv.org/abs/2507.23083
作者: Ali Veisi,Delaram Fartoot,Hamidreza Amirzadeh
机构: Axiom Lab (Axiom 实验室)
类目: Computation and Language (cs.CL)
备注: 4 pages, 1 table

点击查看摘要

Abstract:Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.
zh

[NLP-44] Exploring In-Context Learning for Frame-Semantic Parsing

【速读】: 该论文旨在解决**框架语义解析(Frame Semantic Parsing, FSP)任务中依赖模型微调(fine-tuning)的局限性,尤其是在特定领域(如暴力事件相关框架)中缺乏标注数据时的适应性问题。其解决方案的关键在于利用上下文学习(In-Context Learning, ICL)**与大型语言模型(Large Language Models, LLMs),通过自动构建基于FrameNet数据库的任务特定提示(prompt),实现无需微调即可完成帧识别(Frame Identification, FI)和帧语义角色标注(Frame Semantic Role Labeling, FSRL)两个子任务。具体而言,提示由框架定义和标注示例构成,引导LLMs在不更新参数的情况下完成FSP,实验表明该方法在FI和FSRL上分别达到94.3%和77.4%的F1分数,验证了ICL在领域特定FSP中的有效性与实用性。

链接: https://arxiv.org/abs/2507.23082
作者: Diego Garat,Guillermo Moncecchi,Dina Wonsever
机构: Facultad de Ingeniería, Universidad de la República (乌拉圭共和国大学工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.
zh

[NLP-45] Math Natural Language Inference: this should be easy!

【速读】: 该论文旨在解决生成式 AI(Generative AI)在数学文本上的自然语言推理(Natural Language Inference, NLI)能力问题,即“Math NLI”问题。其核心解决方案在于构建一个高质量的Math NLI语料库,其中前提(premise)来自真实数学文献,假设(hypothesis)及标签由兼具研究级数学与NLI领域经验的专家标注;同时对比了由大语言模型(LLM)自动生成假设的语料质量,并系统评估了多模型间的一致性与性能表现。关键发现包括:在特定设置下,多个LLM的多数投票结果可近似等效于人工标注数据;尽管当前模型已减少对仅基于假设的错误推理倾向,但仍难以处理基础数学推理任务,表明其对数学语言的理解仍存在显著局限。

链接: https://arxiv.org/abs/2507.23063
作者: Valeria de Paiva,Qiyue Gao,Hai Hu,Pavel Kovalev,Yikang Liu,Lawrence S. Moss,Zhiheng Qian
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages plus appendices

点击查看摘要

Abstract:We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only “inference” in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.
zh

[NLP-46] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

【速读】: 该论文旨在解决当前对语音对话模型(Spoken Dialogue Models, SDMs)在实际场景中理解与模拟人类对话能力的研究不足问题,尤其是相较于文本型大语言模型(Large Language Models, LLMs)缺乏系统性评估基准的现状。其关键解决方案是构建一个包含1,079个实例的多语言(英语和中文)基准数据集,并配套一种基于LLM的评估方法,该方法能高度贴近人工判断,从而全面衡量SDMs在应对语义歧义、语音层面同形异音/异音同形现象以及上下文依赖(如省略、指代消解和多轮交互)等复杂人类对话特征时的表现。

链接: https://arxiv.org/abs/2507.22968
作者: Chengqian Ma,Wei Tao,Yiwen Guo
机构: Peking University (北京大学); LIGHTSPEED; Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.
zh

[NLP-47] ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在教育场景中缺乏统一、可量化且情境适配的评估体系的问题。现有基准主要衡量通用智能,而忽视了教学能力(pedagogical capabilities)的差异化表现,导致不同教育应用场景(如知识点讲解、引导式解题教学等)难以进行科学比较与优化。解决方案的关键在于提出 ELMES——一个开源的自动化评估框架,其核心创新包括:(1) 模块化架构支持通过配置文件快速构建多智能体动态对话场景,降低开发门槛;(2) 混合评估引擎采用“大模型作为裁判”(LLM-as-a-Judge)方法,将传统主观的教学质量指标客观量化,从而实现对 LLM 教育应用能力的细粒度测评。该框架已在四个关键教育任务中系统验证,揭示了模型在特定情境下的能力分布差异,为教育领域 LLM 的落地提供了可复用、易扩展的评估工具。

链接: https://arxiv.org/abs/2507.22947
作者: Shou’ang Wei,Xinyun Wang,Shuzhen Bi,Jian Chen,Ruijia Li,Bo Jiang,Xin Lin,Min Zhang,Yu Song,BingDong Li,Aimin Zhou,Hao Hao
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emphthis https URL.
zh

[NLP-48] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

【速读】: 该论文试图解决传统理论将任意性(arbitrariness)简单归结为规范性缺陷或权力支配的体现这一问题,提出其本质应被视为一种结构功能机制,支撑人类系统(如语言、法律和社会互动)的有效运作。解决方案的关键在于引入“动机—可验证性—可争辩性”链条(Motivation - Constatability - Contestability chain),指出当该链条因“去动机化”(immotivization)或“冲突横向转移”(Conflict Lateralization)等机制断裂时,行为虽具约束力却隐藏其逻辑依据,从而规避司法审查与正当性检验;作者进一步以香农熵模型形式化任意性为条件熵 A = H(L|M),揭示其作为中立操作符在控制与关怀之间平衡的核心作用,为理解人际互动及解释人工智能系统的可解释性提供新范式。

链接: https://arxiv.org/abs/2507.22944
作者: Naomi Omeonga wa Kayembe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation - Constatability - Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.
zh

[NLP-49] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

【速读】: 该论文旨在解决大规模医疗数据中基于编码的健康结局识别算法验证效率低下的问题,尤其是传统依赖人工逐条查阅电子健康记录(Electronic Health Records, EHR)自由文本注释以建立金标准标签所耗费的巨大时间和资源。其解决方案的关键在于引入两项核心机制:一是利用自然语言处理(Natural Language Processing, NLP)技术减少人工审阅每份病历所需时间;二是采用预设停止规则的多波次自适应抽样策略,在保证测量特征估计精度的前提下显著降低样本量,从而实现高效、精准的算法性能验证。

链接: https://arxiv.org/abs/2507.22943
作者: Shirley V Wang,Georg Hahn,Sushama Kattinakere Sreedhara,Mufaddal Mahesri,Haritha S. Pillai,Rajendra Aldis,Joyce Lii,Sarah K. Dutcher,Rhoda Eniafe,Jamal T. Jones,Keewan Kim,Jiwei He,Hana Lee,Sengwee Toh,Rishi J Desai,Jie Yang
机构: 未知
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.
zh

[NLP-50] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology KDD2025 ECML

【速读】: 该论文旨在解决现有生存分析方法在处理电子健康记录(Electronic Health Records, EHR)中大量时序文本数据时效率低、难以捕捉复杂时间动态的问题。其核心解决方案是提出SigBERT框架,通过将带时间戳的临床报告转换为句子嵌入(sentence embeddings),并利用粗糙路径理论中的签名提取(signature extraction)技术从嵌入的时间序列中提取几何特征,从而有效建模患者随时间演化的临床轨迹;这些高阶时序特征被整合进LASSO正则化的Cox比例风险模型中,用于生成个体化风险评分,显著提升了生存预测性能(C-index = 0.75)。

链接: https://arxiv.org/abs/2507.22941
作者: Paul Minchella,Loïc Verlingue,Stéphane Chrétien,Rémi Vaucher,Guillaume Metzler
机构: SHAPE-Med@Lyon (SHAPE-Med@Lyon); Centre de Recherche en Cancérologie de Lyon (癌症研究中心); Laboratoire ERIC, 5 avenue Pierre Mendes-France, 69676 Bron (ERIC实验室)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
备注: 12 pages, 2 figures, accepted for ECML PKDD 2025

点击查看摘要

Abstract:Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the Léon Bérard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.
zh

[NLP-51] rustworthy Reasoning : Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中普遍存在事实性错误的问题,即尽管最终答案正确,但中间推理步骤中存在不准确或矛盾的信息,这在医疗、法律和科研等高风险领域可能引发严重后果。解决方案的关键在于提出RELIAANCE框架,其核心创新包括:(1) 基于反事实增强数据训练的事实核查分类器,用于识别推理链中的细微事实不一致;(2) 一种组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习方法,通过多维奖励机制平衡事实准确性、连贯性和结构正确性;(3) 机制可解释性模块,分析事实改进如何体现在模型激活层面的推理轨迹变化。实验证明,该框架显著提升事实鲁棒性(最高达49.90%改善),同时保持或优于主流基准测试性能。

链接: https://arxiv.org/abs/2507.22940
作者: Rui Jiao,Yue Zhang,Jinku Li
机构: Xidian University (西安电子科技大学); Shandong University (山东大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
zh

[NLP-52] PARROT: An Open Multilingual Radiology Reports Dataset

【速读】: 该论文旨在解决当前自然语言处理(Natural Language Processing, NLP)在放射学领域应用中缺乏多语言、多中心、大规模且隐私无风险的公开数据集的问题。其解决方案的关键在于构建并验证PARROT(Polyglottal Annotated Radiology Reports for Open Testing)数据集,该数据集包含来自21个国家、13种语言的2,658份虚构放射学报告,涵盖多种成像模态(如CT、MRI、X线摄影和超声)及解剖部位,并附有ICD-10编码与元数据。通过一项由154名参与者(包括放射科医生、医疗专业人员及其他群体)参与的人工 vs. AI报告辨识研究,证明该数据集可用于评估NLP模型在跨语言、跨地域场景下的性能,从而推动生成式AI(Generative AI)在放射学文本分析中的可信开发与验证。

链接: https://arxiv.org/abs/2507.22939
作者: Bastien Le Guellec,Kokou Adambounou,Lisa C Adams,Thibault Agripnidis,Sung Soo Ahn,Radhia Ait Chalal,Tugba Akinci D Antonoli,Philippe Amouyel,Henrik Andersson,Raphael Bentegeac,Claudio Benzoni,Antonino Andrea Blandino,Felix Busch,Elif Can,Riccardo Cau,Armando Ugo Cavallo,Christelle Chavihot,Erwin Chiquete,Renato Cuocolo,Eugen Divjak,Gordana Ivanac,Barbara Dziadkowiec Macek,Armel Elogne,Salvatore Claudio Fanni,Carlos Ferrarotti,Claudia Fossataro,Federica Fossataro,Katarzyna Fulek,Michal Fulek,Pawel Gac,Martyna Gachowska,Ignacio Garcia Juarez,Marco Gatti,Natalia Gorelik,Alexia Maria Goulianou,Aghiles Hamroun,Nicolas Herinirina,Krzysztof Kraik,Dominik Krupka,Quentin Holay,Felipe Kitamura,Michail E Klontzas,Anna Kompanowska,Rafal Kompanowski,Alexandre Lefevre,Tristan Lemke,Maximilian Lindholz,Lukas Muller,Piotr Macek,Marcus Makowski,Luigi Mannacio,Aymen Meddeb,Antonio Natale,Beatrice Nguema Edzang,Adriana Ojeda,Yae Won Park,Federica Piccione,Andrea Ponsiglione,Malgorzata Poreba,Rafal Poreba,Philipp Prucker,Jean Pierre Pruvo,Rosa Alba Pugliesi,Feno Hasina Rabemanorintsoa,Vasileios Rafailidis,Katarzyna Resler,Jan Rotkegel,Luca Saba,Ezann Siebert,Arnaldo Stanzione,Ali Fuat Tekin,Liz Toapanta Yanchapaxi,Matthaios Triantafyllou,Ekaterini Tsaoulia,Evangelia Vassalou,Federica Vernuccio,Johan Wasselius,Weilang Wang,Szymon Urban,Adrian Wlodarczak,Szymon Wlodarczak,Andrzej Wysocki,Lina Xu,Tomasz Zatonski,Shuhang Zhang,Sebastian Ziegelmayer,Gregory Kuchcinski,Keno K Bressem
机构: Lille University Hospital, Salengro Hospital, Lille, France; Université Lille, Lille, France; Campus University Hospital Centre, Department of Radiology & Medical Imaging, Lomé, Togo; Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany; University Hospital Timone (AP-HM), Marseille, France; Yonsei University College of Medicine, Seoul, South Korea; Bab El-Oued University Hospital, Algiers, Algeria; University Hospital Basel, Basel, Switzerland; University Children’s Hospital Basel, Basel, Switzerland; Pasteur Institute of Lille, Inserm, Lille University, Lille, France; Skåne University Hospital, Lund, Sweden; Institute of AI & Informatics in Medicine (AIIM), TUM University Hospital, Technical University of Munich, Munich, Germany; University of Palermo, Palermo, Italy; Medical Center – University of Freiburg, Faculty of Medicine, Freiburg, Germany; Azienda Ospedaliero-Universitaria (A.O.U.) di Cagliari, Monserrato, Cagliari, Italy; Istituto Dermopatico dell’Immacolata (IDI) IRCCS, Rome, Italy; Hôpital Instruction des Armées, Libreville, Gabon; Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico; University of Salerno, Baronissi, Italy; University Hospital Dubrava, Zagreb, Croatia; Wroclaw Medical University, Wroclaw, Poland; Hôpital Militaire d’Abidjan, Abidjan, Côte d’Ivoire; University of Pisa, Pisa, Italy; CEMIC “Norberto Quirno”, Buenos Aires, Argentina; Catholic University “Sacro Cuore”, Rome, Italy; ASST Fatebenefratelli Sacco, Milan, Italy; Department and Clinic of Otolaryngology, Head & Neck Surgery, Wroclaw Medical University, Wroclaw, Poland; Department and Clinic of Diabetology, Hypertension & Internal Diseases, Institute of Internal Diseases, Wroclaw Medical University, Wroclaw, Poland; 4th Military Hospital, Wroclaw, Poland; Department of Pediatrics, Klodzko County Hospital, Klodzko, Poland; Orthopedics & Traumatology Department, Specialist Medical Centre, Polanica-Zdrój, Poland; Faculty of Medicine, Wroclaw Medical University, Wroclaw, Poland; Department of Radiology, Charité University Hospital Berlin, Berlin, Germany; University Medical Center – Johannes Gutenberg-University Mainz, Mainz, Germany; Department of Surgical Sciences, Radiology Unit, University of Turin, Turin, Italy; University of Naples Federico II, Naples, Italy; Department of Medical Imaging, University Hospital of Heraklion, Heraklion, Greece; AI & Translational Imaging Lab, Department of Radiology, University of Crete, Heraklion, Greece; Department of Radiology, McGill University Health Center, Montreal, Quebec, Canada; Department of Advanced Biomedical Sciences, University of Naples Federico II, Naples, Italy; Tanambao University Hospital, Antsiranana, Madagascar; Department of Radiology, Morafeno Toamasina University Hospital, Toamasina, Madagascar; Başakşehir Çam & Sakura City Hospital, Istanbul, Turkey; Department of Radiology, Aristotle University of Thessaloniki, AHEPA University General Hospital, Thessaloniki, Greece; Sir Charles Gairdner Hospital, Perth, Australia; Zhongda Hospital, Southeast University, Nanjing, China; The Copper Health Center, Lubin, Poland; Department of Cardiology, The Copper Health Center, Lubin, Poland; Department of Paralympic Sport, Wroclaw University of Health and Sport Sciences, Wroclaw, Poland; Dresden International University, Dresden, Germany; Department of Neuroradiology, Lille University Hospital, Salengro Hospital, Lille, France; U1172 LilNCog, Lille Neuroscience & Cognition, Université Lille, Lille, France; Bunkerhill Health, San Francisco, CA, USA; Universidade Federal de São Paulo (UNIFESP), São Paulo, Brazil; Department of Radiology, Başakşehir Çam & Sakura City Hospital, Istanbul, Turkey; Department of Radiology, Zhongda Hospital, Southeast University, Nanjing, China; Department of Diagnostic & Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany; Grégory Kuchcinski; Keno K. Bressem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.
zh

[NLP-53] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents KDD2025

【速读】: 该论文旨在解决技术文档中问答(Question-Answering, QA)任务中因答案存在于图表(如流程图)而传统文本-based Retrieval Augmented Generation (RAG) 系统无法有效响应的问题。其解决方案的关键在于:利用视觉大语言模型(Visual Large Language Models, VLMs)对流程图进行图像理解并生成结构化图表示(graph representations),将这些图表示与文本嵌入(text embedding)管道相结合,从而实现基于图文混合信息的高效检索。实验表明,经微调的VLM所生成的图表示具有更低的编辑距离(edit distance)与真实标签一致,且在电信领域专用QA数据集上表现出良好的检索性能,同时在推理阶段无需依赖VLM,显著降低了部署成本。

链接: https://arxiv.org/abs/2507.22938
作者: Sumit Soman,H. G. Ranjani,Sujoy Roychowdhury,Venkata Dharma Surya Narayana Sastry,Akshat Jain,Pranav Gangrade,Ayaaz Khan
机构: Ericsson(爱立信)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the KDD 2025 Workshop on Structured Knowledge for Large Language Models

点击查看摘要

Abstract:Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.
zh

[NLP-54] CoE-Ops: Collaboration of LLM -based Experts for AIOps Question-Answering

【速读】: 该论文旨在解决当前AIOps(智能运维)领域中单一模型因受限于特定领域知识而难以处理多样化任务的问题,例如日志解析、根因分析等。其解决方案的关键在于提出一种协作专家框架(Collaboration-of-Expert framework, CoE-Ops),该框架引入了一个通用大语言模型任务分类器,并结合检索增强生成(Retrieval-Augmented Generation, RAG)机制,以提升对高阶任务(如Code、Build、Test)与低阶任务(如故障分析、异常检测)的统一处理能力。实验证明,该方法在高阶任务路由准确率上相比现有CoE方法提升72%,在DevOps问题解决准确率上较单模型提升8%,并优于更大规模的MoE模型达14%。

链接: https://arxiv.org/abs/2507.22937
作者: Jinkun Zhao,Yuanshuai Wang,Xingjian Zhang,Ruibo Chen,Xingchuang Liao,Junle Wang,Lei Huang,Kui Zhang,Wenjun Wu
机构: SKLCCSE, Institute of Artificial Intelligence, Beihang University (北京航空航天大学智能科学与技术研究中心,人工智能研究院); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University (北京航空航天大学未来区块链与隐私计算高精尖创新中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework’s capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.
zh

[NLP-55] Evaluating Large Language Models (LLM s) in Financial NLP: A Comparative Study on Financial Report Analysis

【速读】: 该论文旨在解决当前金融自然语言处理(FinNLP)领域中对主流大语言模型(LLMs)性能缺乏系统性比较的问题。其解决方案的关键在于构建一套领域特定的提示(prompt)体系,并采用三种互补评估方法:人工标注、自动词汇语义指标(ROUGE、余弦相似度、Jaccard相似度)以及模型行为诊断(如提示层面的方差和跨模型一致性),从而全面衡量GPT、Claude、Perplexity、Gemini和DeepSeek五种LLM在分析“七只精英”科技公司10-K年报时的表现,揭示不同模型在连贯性、语义一致性与上下文相关性上的差异及其稳定性特征。

链接: https://arxiv.org/abs/2507.22936
作者: Md Talha Mohsin
机构: University of Tulsa (塔尔顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Computational Finance (q-fin.CP)
备注: 22 Pages, 6 Tables, 7 Figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the ‘Magnificent Seven’ technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.
zh

[NLP-56] rusted Knowledge Extraction for Operations and Maintenance Intelligence

【速读】: 该论文旨在解决从组织数据存储中提取运营智能(Operational Intelligence)时面临的两大核心挑战:一是数据保密性与数据集成目标之间的矛盾,二是传统自然语言处理(Natural Language Processing, NLP)工具在特定领域知识结构(如运维领域)中的适用性局限。其解决方案的关键在于构建知识图谱(Knowledge Graph),并将知识抽取过程分解为命名实体识别(Named Entity Recognition)、共指消解(Coreference Resolution)、命名实体链接(Named Entity Linking)和关系抽取(Relation Extraction)四个功能模块,并在此基础上系统评估了十六种NLP工具及大语言模型(Large Language Models, LLMs)的零样本性能,特别聚焦于航空业可信应用场景下可在受控环境中运行、不涉及第三方数据传输的工具。研究通过使用来自美国联邦航空管理局(FAA)公开数据集的基准数据,揭示了当前工具在技术成熟度(Technical Readiness Level)上的显著不足,进而提出提升可信度的改进建议并开源了高质量标注数据集,以支持后续基准测试与评估。

链接: https://arxiv.org/abs/2507.22935
作者: Kathleen Mealey,Jonathan A. Karr Jr.,Priscila Saboia Moreira,Paul R. Brenner,Charles F. Vardeman II
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.
zh

[NLP-57] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

【速读】: 该论文旨在解决传统意图识别(Intent Recognition)方法在自然人机交互需求日益增长背景下,因仅依赖文本信息而难以全面捕捉用户意图的问题。其解决方案的关键在于系统梳理并总结了从单模态到多模态深度学习方法的演进路径,特别强调了基于Transformer架构的模型在融合音频、视觉和生理信号等多源数据中的突破性作用,从而推动了多模态意图识别(Multimodal Intent Recognition, MIR)的发展,并为未来研究方向提供了理论依据与实践参考。

链接: https://arxiv.org/abs/2507.22934
作者: Jingwei Zhao,Yuhua Wen,Qifei Li,Minchi Hu,Yingying Zhou,Jingyao Xue,Junyang Wu,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ACM Computing Surveys

点击查看摘要

Abstract:Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.
zh

[NLP-58] Augmented Vision-Language Models: A Systematic Review

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在可解释性、知识更新效率以及逻辑推理能力方面的局限性问题。其核心解决方案在于通过将预训练的VLMs与外部符号信息系统(symbolic information systems)进行集成,构建神经符号系统(neural-symbolic systems),从而提升模型的推理能力和记忆能力,并实现无需大规模重新训练即可整合新知识的能力,同时提供更可解释的输出结果。

链接: https://arxiv.org/abs/2507.22933
作者: Anthony C Davis,Burhan Sadiq,Tianmin Shu,Chien-Ming Huang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.
zh

[NLP-59] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

【速读】: 该论文旨在解决传统投资组合优化中难以有效融合多源异构信息(如金融新闻情绪与市场指标)的问题,从而提升策略的收益稳定性。其解决方案的关键在于提出了一种分层式强化学习框架,通过轻量级大语言模型(Large Language Models, LLMs)提取新闻情感信号,并将其与传统市场指标进行跨模态融合;在此基础上构建三层代理结构——基础强化学习(Reinforcement Learning, RL)代理处理混合数据、元代理聚合决策、超级代理基于市场状态和情感分析做出最终资产配置,实现了可扩展、高稳定性的智能投资决策机制。

链接: https://arxiv.org/abs/2507.22932
作者: Baptiste Lefort,Eric Benhamou,Beatrice Guez,Jean-Jacques Ohana,Ethan Setrouk,Alban Etienne
机构: Ai For Alpha(人工智能 for Alpha); Centrale Supélec; Paris Dauphine PSL
类目: Computation and Language (cs.CL); General Finance (q-fin.GN)
备注: 8 pages

点击查看摘要

Abstract:This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and SP 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.
zh

[NLP-60] Enhancing RAG Efficiency with Adaptive Context Compression

【速读】: 该论文旨在解决检索增强生成(Retrieval-augmented generation, RAG)系统在推理阶段因长上下文检索导致的高计算开销问题。现有方法采用固定压缩率进行上下文压缩,无法根据输入复杂度自适应调整,从而造成简单查询过度压缩或复杂查询压缩不足的问题。解决方案的关键在于提出自适应上下文压缩框架(Adaptive Context Compression for RAG, ACC-RAG),其核心由两级压缩器(用于多粒度嵌入表示)和上下文选择模块组成,能够根据输入复杂度动态调节压缩率,在保留最小必要信息的同时实现高效推理,显著提升速度(最高提速4倍以上)且不牺牲甚至提升准确性。

链接: https://arxiv.org/abs/2507.22931
作者: Shuyu Guo,Zhaochun Ren
机构: Shandong University (山东大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
zh

[NLP-61] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

【速读】: 该论文旨在解决在线社交平台中个人身份信息(PII)自披露带来的隐私风险与网络危害问题,尤其是由于缺乏公开标注数据集导致相关研究难以开展和复现。其关键解决方案是提出一种新型合成数据生成方法,通过使用三种大型语言模型(LLMs)——Llama2-7B、Llama3-8B 和 zephyr-7b-beta——结合顺序指令提示技术,生成模拟原始 Reddit 帖子的合成文本数据,并构建了一个包含 19 类 PII 自披露行为的分类体系及多文本跨度标注数据集。该方法确保合成数据在三个维度上达到要求:与真实数据训练结果具有可比性(可复现性等价)、无法通过常见手段如 Google 搜索关联到原用户(不可链接性),以及人类判别者无法区分真假数据(不可区分性)。此方案为 PII 隐私风险研究提供了安全、可复现的数据基础。

链接: https://arxiv.org/abs/2507.22930
作者: Shalini Jangra,Suparna De,Nishanth Sastry,Saeed Fadaei
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 15 pages, 4 Figures, Accepted in “The 17th International Conference on Advances in Social Networks Analysis and Mining -ASONAM-2025”

点击查看摘要

Abstract:Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at this https URL to foster reproducible research into PII privacy risks in online social media.
zh

[NLP-62] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent -Driven Top-Down Traceable Reasoning Workflow

【速读】: 该论文旨在解决医学大语言模型(Medical Large Language Models, MLLMs)在眼科诊断中因幻觉(hallucinations)导致的准确性受限问题,具体包括:有限的眼科知识、视觉定位与推理能力不足,以及多模态眼科数据稀缺,这些因素共同阻碍了病灶精准检测和疾病诊断。为应对上述挑战,作者提出EH-Benchmark这一新型眼科基准,用于系统评估MLLMs中的幻觉类型,并设计了一种以代理(agent)为中心的三阶段框架:知识层检索(Knowledge-Level Retrieval)、任务层案例研究(Task-Level Case Studies)和结果层验证(Result-Level Validation)。该方案的关键在于通过分阶段的结构化干预机制,增强模型对视觉信息的理解与逻辑推理能力,从而显著降低视觉理解类和逻辑组合类幻觉,提升诊断准确率、可解释性与可靠性。

链接: https://arxiv.org/abs/2507.22929
作者: Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu
机构: Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR); Centre for Innovation and Precision Eye Health; and Department of Ophthalmology, NUHS Tower Block, Level 7, 1E Kent Ridge Road, Singapore, 119228; Singapore Eye Research Institute, Singapore National Eye Centre, 20 College Road, Singapore, 169856
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 9 figures, 5 tables. submit/6621751

点击查看摘要

Abstract:Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs’ hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at this https URL.
zh

[NLP-63] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

【速读】: 该论文试图解决的问题是:链式思维(Chain-of-thought, CoT)提示是否真正反映了大型语言模型(Large Language Models, LLMs)内部的推理过程,即CoT生成的“思考”是否具有因果忠实性(faithfulness)。为回答这一问题,作者提出了一种特征级别的因果分析方法,其关键在于结合稀疏自编码器(sparse autoencoders)与激活补丁替换(activation patching)技术,从Pythia-70M和Pythia-2.8B模型中提取单语义特征(monosemantic features),并在GSM8K数学问题上对比CoT与无CoT(noCoT)提示下的内部激活模式。实验发现,在2.8B模型中,将少量CoT推理特征替换到noCoT运行中显著提升答案对数概率,而在70M模型中则无此效应,揭示了模型规模存在一个明显的阈值;同时,CoT在大模型中带来更高的激活稀疏性和特征可解释性评分,表明其能诱导更模块化的内部计算结构。这一结果验证了CoT作为结构化提示方法的有效性,并首次从特征层面提供了其因果忠实性的实证依据。

链接: https://arxiv.org/abs/2507.22928
作者: Xi Chen,Aske Plaat,Niki van Stein
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.
zh

[NLP-64] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

【速读】: 该论文旨在解决当前Retrieval-Augmented Generation (RAG)系统评估中缺乏对大语言模型(LLM)特有能力的细粒度分析问题,尤其是对文档利用能力的系统性评测不足。现有基准多关注整体性能和噪声鲁棒性,但未能深入刻画LLM在多级过滤、文档组合与引用推理等关键环节的表现。其解决方案的关键在于提出一个名为\textitPlaceholder-RAG-Benchmark的多层级细粒度评估框架,并创新性地采用占位符(placeholder)方法将LLM的参数化知识与外部知识贡献解耦,从而精准量化LLM在RAG生成过程中的作用与局限,尤其揭示了主流LLM在错误容错性和上下文忠实性方面的不足,为构建更可靠高效的RAG系统提供了可复现的评估基准。

链接: https://arxiv.org/abs/2507.22927
作者: Zhehao Tan,Yihan Jiao,Dan Yang,Lei Liu,Jie Feng,Duolin Sun,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM’s ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textitPlaceholder-RAG-Benchmark, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs’ roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM’s parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system’s generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in this https URL.
zh

[NLP-65] Multi-Relation Extraction in Entity Pairs using Global Context

【速读】: 该论文旨在解决文档级关系抽取中因实体在文档中多次出现且上下文关系可能变化而导致的全局语境建模不足问题。传统方法仅关注实体提及所在的句子片段,难以捕捉跨句、跨段落的完整文档语境,从而影响关系预测准确性。解决方案的关键在于提出一种新颖的输入嵌入方法,将实体表示为独立于其在文档中位置的独立段落(standalone segments),从而显式建模实体在整个文档中的分布及其全局关联性,强化多句推理能力。实验表明,该方法在DocRED、Re-DocRED和REBEL三个基准数据集上均能更准确地预测文档层面的实体关系,提升了模型对复杂语境的理解与解释能力。

链接: https://arxiv.org/abs/2507.22926
作者: Nilesh,Atul Gupta,Avinash C Panday
机构: PDPM IIITDM Jabalpur, India (印度帕尔马迪普IIITDM贾巴尔普尔)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.
zh

[NLP-66] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

【速读】: 该论文旨在解决大型语言模型代理(LLM Agents)在长期记忆管理中面临的结构化组织不足与高效检索困难的问题。现有方法如基于密集向量的相似性搜索或图结构知识组织,虽在记忆存储与检索上取得进展,但在语义抽象层次上的结构化建模和快速定位方面存在局限。解决方案的关键在于提出一种分层记忆(Hierarchical Memory, H-MEM)架构,通过多层级语义抽象对记忆进行组织,并为每个记忆向量嵌入指向下一层语义相关子记忆的位置索引(positional index encoding)。推理阶段采用基于索引的路由机制,实现无需全量相似度计算的逐层高效检索,从而提升决策连贯性与上下文感知能力。

链接: https://arxiv.org/abs/2507.22925
作者: Haoran Sun,Shaoning Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.
zh

[NLP-67] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

【速读】: 该论文旨在解决国际学生在以英语为授课语言的在线计算机科学(Computer Science, CS)课程中,其同伴反馈(peer feedback)体验如何受到母语是否为英语的影响这一问题。研究发现,母语为英语的学生对同伴反馈的评分较低,而非母语者虽写作时更积极,但收到的反馈情感倾向却较不积极;进一步控制性别与年龄后,语言背景与反馈体验之间存在显著交互效应,表明语言背景在塑造同伴反馈体验中起着复杂且不可忽视的作用。解决方案的关键在于利用基于Twitter-roBERTa的文本情感分析模型,量化并比较不同语言背景学生在同伴反馈中的情感表达与接收差异,从而揭示语言因素在在线教育互动中的潜在影响机制。

链接: https://arxiv.org/abs/2507.22924
作者: Brittney Exline,Melanie Duffin,Brittany Harbison,Chrissa da Gomez,David Joyner
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master’s degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students’ language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.
zh

[NLP-68] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting KDD’25

【速读】: 该论文旨在解决多语言检索增强生成(Retrieval-Augmented Generation, RAG)系统中因知识库(KB)主要来自高资源语言(如英语)而引发的跨语言信息不一致问题,即检索到的知识与用户查询语言不一致所导致的性能下降。其解决方案的关键在于系统性地评估不同提示(prompt)翻译策略对分类任务的影响,发现通过优化提示策略(如采用跨语言提示或预翻译构建单语提示),可以显著提升低资源语言下的知识迁移效果,从而改善下游任务性能。研究强调了跨语言提示优化和多语言资源共享在非英语场景中的重要价值。

链接: https://arxiv.org/abs/2507.22923
作者: Aman Gupta,Yingying Zhuang,Zhou Yu,Ziji Zhang,Anurag Beniwal
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Prompt Optimization KDD '25

点击查看摘要

Abstract:Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.
zh

[NLP-69] Predicting stock prices with ChatGPT -annotated Reddit sentiment

【速读】: 该论文试图解决的问题是:社交媒体上的情绪(sentiment)是否能够有效预测股票市场波动,尤其是在零售投资者活跃的平台上,如Reddit的r/wallstreetbets。研究聚焦于GameStop(GME)和AMC娱乐(AMC)两家公司,旨在评估不同文本情感分析方法在捕捉市场情绪与股价变动之间关联方面的有效性。解决方案的关键在于引入一种基于ChatGPT标注并微调的RoBERTa模型,以更好地理解社交媒体中常见的非正式语言和表情符号(emoji),并与两种现有文本情感分析方法进行对比。结果表明,尽管传统情感分析方法表现有限,但评论数量和谷歌搜索趋势等简单指标展现出更强的预测能力,揭示了零售投资者行为的复杂性以及传统情感分析在刻画市场驱动因素上的局限性。

链接: https://arxiv.org/abs/2507.22922
作者: Mateusz Kmak,Kamil Chmurzyński,Kamil Matejuk,Paweł Kotzbach,Jan Kocoń
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: International Conference on Computational Science 2025

点击查看摘要

Abstract:The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit’s r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment’s role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models’ predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.
zh

[NLP-70] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

【速读】: 该论文旨在解决语言模型在知识抽取任务中存在计算成本高、生成虚假信息(即幻觉)等问题,这些问题会导致资源浪费并降低结果可靠性。其解决方案的关键在于提出并实现了一种称为语言模型链(Language Model Chain, LMC)的算法:该算法要求模型对给定文本的响应仅在候选答案集合中存在时才被视为正确;若预测错误,则将对应文本输入到一个更具预测能力但速度较慢的后续语言模型中进行修正,形成多阶段级联结构,直至所有预测均正确为止。通过该机制,LMC显著提升了医疗文档中患者出生日期提取任务的准确性与效率,并大幅减少了幻觉现象。

链接: https://arxiv.org/abs/2507.22921
作者: Lee Harris
机构: The University of Kent (肯特大学); TMLEP & Elysium Web Services; Ashford (阿什福德); Kent (肯特郡); UK (英国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model’s response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.
zh

[NLP-71] Discrete Tokenization for Multimodal LLM s: A Comprehensive Survey

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理连续多模态数据时,缺乏高效且兼容的离散化表示机制的问题。其核心挑战在于如何将来自不同模态的连续特征(如图像、音频等)通过向量量化(Vector Quantization, VQ)转化为适合LLM处理的离散token表示,从而提升计算效率并保障与LLM架构的兼容性。解决方案的关键在于提出首个系统性的分类体系和分析框架,对8种代表性VQ变体进行归纳,涵盖经典与现代方法,并深入剖析其算法原理、训练动态及与LLM流水线的集成难点;同时识别出代码本坍塌(codebook collapse)、梯度估计不稳定和模态特异性编码约束等关键挑战,进而推动动态自适应量化、统一token化框架以及生物启发式代码本学习等新兴方向的发展,为构建高效、通用的多模态系统提供理论基础与实践参考。

链接: https://arxiv.org/abs/2507.22920
作者: Jindong Li,Yali Fu,Jiahong Liu,Linxiao Cao,Wei Ji,Menglin Yang,Irwin King,Ming-Hsuan Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: this https URL.
zh

[NLP-72] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

【速读】: 该论文旨在解决临床试验中因缺乏对严重不良事件(Serious Adverse Event, SAE)结果的准确预测而导致的试验设计不合理、受试者暴露于不必要的风险以及试验提前终止的问题。其解决方案的关键在于利用仅来自试验注册信息的文本数据,通过迁移学习方法(如ClinicalT5和BioBERT等预训练语言模型)提取特征,并结合滑动窗口策略处理长文本输入限制,从而构建分类模型预测实验组与对照组SAE发生率高低(AUC达77.6%)及回归模型预测对照组SAE比例(RMSE为18.6%),显著优于未采用滑动窗口的方法。

链接: https://arxiv.org/abs/2507.22919
作者: Qixuan Hu,Xumou Zhang,Jinman Kim,Florence Bourgeois,Adam G. Dunn
机构: University of Sydney (悉尼大学); Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from this http URL with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at this http URL remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.
zh

[NLP-73] Semantic Convergence: Investigating Shared Representations Across Scaled LLM s ACL2025

【速读】: 该论文旨在解决大规模语言模型在不同规模下是否仍能收敛到相似内部概念的问题,即探讨模型规模差异是否影响其特征空间的通用性。解决方案的关键在于使用稀疏自编码器(Sparse Autoencoder, SAE)对Gemma-2系列模型残差流激活进行字典学习,通过激活相关性对齐单义特征,并利用结构化向量交叉校准(SVCCA)和表示相似性分析(RSA)比较匹配后的特征空间,结果表明中间层具有最强的特征重叠,验证了不同规模模型在语义层面存在显著的特征通用性。

链接: https://arxiv.org/abs/2507.22918
作者: Daniel Son,Sanjana Rathore,Andrew Rufail,Adrian Simon,Daniel Zhang,Soham Dave,Cole Blondin,Kevin Zhu,Sean O’Brien
机构: University of Virginia (弗吉尼亚大学); University of California, Los Angeles (加州大学洛杉矶分校); Columbia University (哥伦比亚大学); University of California, San Diego (加州大学圣地亚哥分校); Algoverse AI Research (Algoverse AI 研究); Algoverse Academy (Algoverse 学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ACL 2025 Student Research Workshop (poster)

点击查看摘要

Abstract:We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model’s residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.
zh

[NLP-74] Reading Between the Timelines: RAG for Answering Diachronic Questions

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理纵向查询(longitudinal queries)时的局限性,即无法有效追踪实体和现象随时间演变的动态信息。传统基于语义匹配的检索方法难以同时保证证据的主题相关性和时间连贯性,导致生成答案时缺乏对时间维度的准确建模。解决方案的关键在于重构RAG管道,引入时间逻辑:首先将用户查询解耦为核心主题与时间窗口;随后采用专门设计的检索器,在语义匹配的基础上校准时间相关性,从而收集覆盖整个查询时间段的连续证据集。这一方法显著提升了对需长期演化分析的复杂问题的回答准确性。

链接: https://arxiv.org/abs/2507.22917
作者: Kwun Hang Lau,Ruiyuan Zhang,Weijie Shi,Xiaofang Zhou,Xiaojun Cheng
机构: Hong Kong University of Science and Technology (香港科技大学); China Unicom (Hong Kong) Operation Ltd (中国联通(香港)运营有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user’s query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at this https URL.
zh

[NLP-75] heoretical Foundations and Mitigation of Hallucination in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉(hallucination)问题,即模型生成的内容与输入信息或真实世界事实不符的现象。解决方案的关键在于从理论和实践两个层面构建系统性框架:在理论上,区分内在幻觉与外在幻觉,并基于学习理论(如PAC-Bayes和Rademacher复杂度)推导出幻觉风险的上界;在实践上,提出一个统一的检测与缓解工作流,整合了基于token级不确定性估计、置信度校准、注意力对齐检查等检测策略,以及检索增强生成(Retrieval-Augmented Generation, RAG)、幻觉感知微调、logit校准和事实验证模块等缓解方法,从而实现对幻觉的有效控制与量化评估。

链接: https://arxiv.org/abs/2507.22915
作者: Esmail Gumaan
机构: University of Sana’a (萨那大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textithallucination risk for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.
zh

[NLP-76] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)集成过程中Context匹配不足的问题,即在不同来源、规模和信息密度的知识图谱中,如何有效对齐实体与三元组以提升集成精度。当前方法主要关注schema和identity匹配,而忽视了context层面的复杂性,导致在真实场景下性能受限。解决方案的关键在于提出一种结合标签匹配(label matching)与三元组匹配(triple matching)的新方法:首先利用字符串处理、模糊匹配和向量相似度技术对实体和谓词标签进行对齐;随后通过识别语义等价的三元组映射关系来增强实体匹配准确性。实验表明,该方法在OAEI竞赛和监督方法中均表现优异,并引入了一个新的数据集用于更全面评估三元组匹配步骤。

链接: https://arxiv.org/abs/2507.22914
作者: Victor Eiti Yamamoto,Hideaki Takeda
机构: National Institute of Informatics (日本信息研究所); Graduate University for Advanced Studies, SOKENDAI (高级研究所大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.22914 [cs.CL] (or arXiv:2507.22914v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.22914 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-77] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在图书主题分析任务中存在过度生成和幻觉(hallucination)的问题,同时克服传统机器学习(Machine Learning, ML)模型在处理未见案例时的局限性。其解决方案的关键在于提出一种混合框架,该框架结合嵌入驱动的ML模型与LLM:首先利用ML模型预测最优的《国会图书馆主题词表》(Library of Congress Subject Headings, LCSH)标签数量以引导LLM生成;其次通过后编辑步骤将LLM输出的术语替换为实际LCSH术语,从而有效抑制幻觉并确保结果在词汇层面与标准受控词表对齐。

链接: https://arxiv.org/abs/2507.22913
作者: Jinyu Liu,Xiaoying Song,Diana Zhang,Jason Thomale,Daqing He,Lingzi Hong
机构: University of North Texas (北德克萨斯大学); University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures, accepted by ASIST 2025

点击查看摘要

Abstract:Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.
zh

[NLP-78] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

【速读】: 该论文旨在解决非法市场内容检测与分类难题,特别是在深网(Deep Web)和暗网(Dark Web)以及Telegram、Reddit等平台中,因标注数据稀缺、非法语言动态演变及来源结构异质性导致的识别困难问题。其解决方案的关键在于提出一种分层分类框架,结合微调后的语言模型(如ModernBERT)与半监督集成学习策略:首先利用针对多源非法内容微调的Transformer模型提取长文档语义表示,捕捉专业术语与模糊语言模式;其次融合人工设计特征(如文档结构、比特币地址、电子邮件、IP地址等嵌入模式及元数据),增强对非法交易内容的判别能力;最后通过两阶段分类流程——第一阶段使用基于熵加权投票的XGBoost、随机森林与SVM集成模型识别销售相关文档,第二阶段进一步细分为毒品、武器或凭证售卖类别。该方法在多个数据集上显著优于BERT、DarkBERT等基线模型,展现出优异的泛化能力和在弱监督条件下的鲁棒性。

链接: https://arxiv.org/abs/2507.22912
作者: Navid Yazdanjue,Morteza Rakhshaninejad,Hossein Yazdanjouei,Mohammad Sadegh Khorshidi,Mikko S. Niemela,Fang Chen,Amir H. Gandomi
机构: University of Technology Sydney (悉尼科技大学); Ghent University (根特大学); Khazar University (哈扎尔大学); Singapore; Óbuda University (布达佩斯技术与经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 9 tables

点击查看摘要

Abstract:Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.
zh

[NLP-79] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

【速读】: 该论文旨在解决电力营销客户服务领域中现有系统响应慢、流程僵化及专业任务准确性不足的问题,同时应对通用大语言模型(Large Language Models, LLMs)在该领域缺乏专业知识和共情能力的局限。解决方案的关键在于提出首个面向电力营销场景的基准测试平台ElectriQ,其包含覆盖六大服务类别的对话数据集,并引入专业性(professionalism)、流行度(popularity)、可读性(readability)和用户友好性(user-friendliness)四项评估指标;此外,通过构建领域知识库并设计知识增强方法,显著提升了小规模模型(如LLama3-8B)在专业性和用户友好性上的表现,使其优于GPT-4o等主流大模型。

链接: https://arxiv.org/abs/2507.22911
作者: Jinzhi Wang,Qingke Peng,Haozhou Li,Zeyuan Zeng,Qinfeng Song,Kaixuan Yang,Jiangbo Zhang,Yaoying Wang,Ruimeng Li,Biyi Zhou
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China’s 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.
zh

[NLP-80] Large Language Models in the Travel Domain: An Industrial Experience

【速读】: 该论文旨在解决在线房产预订平台中因第三方数据源信息不完整或不一致而导致的住宿设施描述质量低下问题,这会影响用户体验并造成市场份额流失。解决方案的关键在于将大语言模型(Large Language Models, LLMs)集成到名为 CALEIDOHOTELS 的预订平台中,通过微调与提示工程优化模型输出,以生成更一致、准确且低幻觉的住宿描述内容。研究对比了两种主流LLM:基于QLoRA微调的Mistral 7B和采用优化系统提示的Mixtral 8x7B,结果表明Mixtral在完整性(99.6% vs. 93%)、精确度(98.8% vs. 96%)及幻觉率(1.2% vs. 4%)上表现更优,尽管其计算资源消耗显著增加(50GB VRAM vs. 5GB)。这一发现为生产环境中LLM部署提供了关于模型质量与资源效率之间权衡的实际指导。

链接: https://arxiv.org/abs/2507.22910
作者: Sergio Di Meglio,Aniello Somma,Luigi Libero Lucio Starace,Fabio Scippacercola,Giancarlo Sperlì,Sergio Di Martino
机构: University of Naples Federico II (那不勒斯腓特烈第二大学); Fervento srl
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript accepted to the International Conference on Software Engineering and Knowledge Engineering (SEKE) 2025

点击查看摘要

Abstract:Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and 1.61/hour versus 5GB and 0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.
zh

[NLP-81] oward the Autonomous AI Doctor: Quantitative Benchmarking of an Autonomous Agent ic AI Versus Board-Certified Clinicians in a Real World Setting

【速读】: 该论文旨在解决全球预计到2030年将面临1100万医疗从业者短缺的问题,以及临床工作中50%的时间被行政负担占据的现实挑战。其解决方案的关键在于开发并验证了一个基于多智能体大语言模型(multi-agent large language model, LLM)的自主AI系统——Doctronic,该系统能够在虚拟急症护理场景中独立执行诊断和治疗计划制定任务。研究通过与执业医生在500例连续远程诊疗中的表现对比发现,Doctronic在诊断一致性上达到81%,治疗方案一致性高达99.2%,且未出现临床幻觉(clinical hallucinations),在专家评审的分歧病例中,AI表现优于人类医生的比例为36.1%,显示出其具备与人类医生相当甚至超越的临床决策能力,为缓解医疗人力资源短缺提供了可扩展的自动化解决方案。

链接: https://arxiv.org/abs/2507.22902
作者: Hashim Hayat,Maksim Kudrautsau,Evgeniy Makarov,Vlad Melnichenko,Tim Tsykunou,Piotr Varaksin,Matt Pavelle,Adam Z. Oskowitz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Background: Globally we face a projected shortage of 11 million healthcare practitioners by 2030, and administrative burden consumes 50% of clinical time. Artificial intelligence (AI) has the potential to help alleviate these problems. However, no end-to-end autonomous large language model (LLM)-based AI system has been rigorously evaluated in real-world clinical practice. In this study, we evaluated whether a multi-agent LLM-based AI framework can function autonomously as an AI doctor in a virtual urgent care setting. Methods: We retrospectively compared the performance of the multi-agent AI system Doctronic and board-certified clinicians across 500 consecutive urgent-care telehealth encounters. The primary end points: diagnostic concordance, treatment plan consistency, and safety metrics, were assessed by blinded LLM-based adjudication and expert human review. Results: The top diagnosis of Doctronic and clinician matched in 81% of cases, and the treatment plan aligned in 99.2% of cases. No clinical hallucinations occurred (e.g., diagnosis or treatment not supported by clinical findings). In an expert review of discordant cases, AI performance was superior in 36.1%, and human performance was superior in 9.3%; the diagnoses were equivalent in the remaining cases. Conclusions: In this first large-scale validation of an autonomous AI doctor, we demonstrated strong diagnostic and treatment plan concordance with human clinicians, with AI performance matching and in some cases exceeding that of practicing clinicians. These findings indicate that multi-agent AI systems achieve comparable clinical decision-making to human providers and offer a potential solution to healthcare workforce shortages.
zh

[NLP-82] Voice-guided Orchestrated Intelligence for Clinical Evaluation (VOICE): A Voice AI Agent System for Prehospital Stroke Assessment

【速读】: 该论文旨在解决急诊医疗中卒中识别不一致且准确率低的问题,当前一线响应者对卒中的敏感度最低可达58%,导致治疗延误。解决方案的关键在于开发了一种语音驱动的人工智能(Artificial Intelligence, AI)系统,通过自然对话引导非专业人员完成专家级卒中评估,并利用智能手机视频记录关键检查环节以供文档化和专家复核。该系统在模拟场景中实现了84%的卒中体征识别准确率和75%的大血管闭塞(Large Vessel Occlusion, LVO)检测率,显著提升了早期识别能力,同时用户信心与易用性较高,尽管仍需人工审核以纠正AI生成报告中的错误。

链接: https://arxiv.org/abs/2507.22898
作者: Julian Acosta,Scott Adams,Julius Kernbach,Romain Hardy,Sung Eun Kim,Luyang Luo,Xiaoman Zhang,Shreya Johri,Mohammed Baharoon,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); University of Saskatchewan (萨斯喀彻温大学); University Hospital Heidelberg (UKHD) (海德堡大学医院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We developed a voice-driven artificial intelligence (AI) system that guides anyone - from paramedics to family members - through expert-level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life-threatening delays in treatment. Three non-medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke-like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI-generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non-stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system’s limitations necessitate human oversight, ongoing rapid advancements in speech-to-speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human-level voice interaction could transform emergency medical care, putting expert-informed assessment capabilities in everyone’s hands.
zh

[NLP-83] Hybrid EEG–Driven Brain–Computer Interface: A Large Language Model Framework for Personalized Language Rehabilitation

【速读】: 该论文旨在解决传统增强与替代沟通(Augmentative and Alternative Communication, AAC)系统及语言学习平台在应对神经系统疾病(如卒中后失语症或肌萎缩侧索硬化症)患者时,无法实时适应用户认知和语言需求的问题。其解决方案的关键在于提出一种新颖的混合框架,利用实时脑电图(Electroencephalography, EEG)信号驱动基于大语言模型(Large Language Models, LLMs)的语言康复助手,通过神经意图识别实现低疲劳交互,并动态个性化词汇、句法练习与反馈,同时监测认知负荷神经标记以在线调整任务难度。

链接: https://arxiv.org/abs/2507.22892
作者: Ismail Hossain,Mridul Banik
机构: George Mason University (乔治梅森大学); Colorado State University (科罗拉多州立大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conventional augmentative and alternative communication (AAC) systems and language-learning platforms often fail to adapt in real time to the user’s cognitive and linguistic needs, especially in neurological conditions such as post-stroke aphasia or amyotrophic lateral sclerosis. Recent advances in noninvasive electroencephalography (EEG)–based brain-computer interfaces (BCIs) and transformer–based large language models (LLMs) offer complementary strengths: BCIs capture users’ neural intent with low fatigue, while LLMs generate contextually tailored language content. We propose and evaluate a novel hybrid framework that leverages real-time EEG signals to drive an LLM-powered language rehabilitation assistant. This system aims to: (1) enable users with severe speech or motor impairments to navigate language-learning modules via mental commands; (2) dynamically personalize vocabulary, sentence-construction exercises, and corrective feedback; and (3) monitor neural markers of cognitive effort to adjust task difficulty on the fly.
zh

[NLP-84] OAEI-LLM -T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于本体匹配(Ontology Matching, OM)任务中不可避免的幻觉(Hallucination)问题。其解决方案的关键在于构建了一个新的基准数据集 OAEI-LLM-T,该数据集源自 Ontology Alignment Evaluation Initiative (OAEI) 的七个 TBox 数据集,捕获了十种不同 LLM 在执行 OM 任务时产生的特定于本体匹配的幻觉,并将其系统性地归纳为两大类和六小类。该数据集不仅可用于建立面向 OM 任务的 LLM 排行榜,还可用于微调用于 OM 的 LLM,从而提升模型在该任务中的准确性与可靠性。

链接: https://arxiv.org/abs/2503.21813
作者: Zhangcheng Qiang,Kerry Taylor,Weiqing Wang,Jing Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 4 figures, 4 tables, 2 prompt templates

点击查看摘要

Abstract:Hallucinations are often inevitable in downstream tasks using large language models (LLMs). To tackle the substantial challenge of addressing hallucinations for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset OAEI-LLM-T. The dataset evolves from seven TBox datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of ten different LLMs performing OM tasks. These OM-specific hallucinations are organised into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing an LLM leaderboard for OM tasks and for fine-tuning LLMs used in OM tasks.
zh

[NLP-85] MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

【速读】: 该论文旨在解决当前大型音频-语言模型在细粒度音频理解任务中难以达到人类水平认知的问题,其根源在于现有基准测试受限于数据标注和评估指标,无法有效区分模型输出的通用性与细节丰富度。解决方案的关键在于提出MECAT(Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks),该基准通过融合专业专家模型分析与链式思维(Chain-of-Thought)大语言模型推理的生成流程,构建多视角、细粒度的音频描述和开放集问答对;同时引入DATE(Discriminative-Enhanced Audio Text Evaluation)这一新型评估指标,通过结合单样本语义相似性和跨样本可判别性,惩罚泛化表述并奖励具体描述,从而更精准地衡量模型在细粒度音频理解上的性能。

链接: https://arxiv.org/abs/2507.23511
作者: Yadong Niu,Tianzi Wang,Heinrich Dinkel,Xingwei Sun,Jiahao Zhou,Gang Li,Jizhong Liu,Xunying Liu,Junbo Zhang,Jian Luan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 9 main pages, 5 figures, 3 tables, and 14 appendix pages

点击查看摘要

Abstract:While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at this https URL
zh

[NLP-86] Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR

【速读】: 该论文旨在解决自动语音识别(ASR)系统中因缺乏有效动态特征而导致的识别准确率不足问题,尤其在越南语语音识别中表现更为明显。其解决方案的关键在于利用频带子频中心频率比值平面(SSCF ratio plane)中的极坐标参数来刻画语音的动态特性,并最小化频谱变化的影响;同时引入SSCF0作为基频(F0)的伪特征以稳健描述声调信息,再将这些动态参数与梅尔频率倒谱系数(MFCCs)融合,从而捕获更精细的频谱细节。实验表明,该方法显著降低了词错误率(WER),且具有更强的性别独立性。

链接: https://arxiv.org/abs/2507.22964
作者: Sotheara Leang(CADT, M-PSI),Éric Castelli(M-PSI),Dominique Vaufreydaz(M-PSI),Sethserey Sam(CADT)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The dynamic characteristics of speech signal provides temporal information and play an important role in enhancing Automatic Speech Recognition (ASR). In this work, we characterized the acoustic transitions in a ratio plane of Spectral Subband Centroid Frequencies (SSCFs) using polar parameters to capture the dynamic characteristics of the speech and minimize spectral variation. These dynamic parameters were combined with Mel-Frequency Cepstral Coefficients (MFCCs) in Vietnamese ASR to capture more detailed spectral information. The SSCF0 was used as a pseudo-feature for the fundamental frequency (F0) to describe the tonal information robustly. The findings showed that the proposed parameters significantly reduce word error rates and exhibit greater gender independence than the baseline MFCCs.
zh

计算机视觉

[CV-0] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis ICCV2025

【速读】:该论文旨在解决从单个视频输入中生成高质量动态3D内容(即4D内容)的难题,其核心挑战在于直接建模4D扩散过程时面临的数据构建成本高和联合表示3D形状、外观与运动的高维特性。解决方案的关键在于提出一种名为“Direct 4DMesh-to-GS Variation Field VAE”的新框架,该框架可直接从3D动画数据中编码规范空间的高斯点云(Gaussian Splats, GS)及其时间变化,无需针对每个实例进行拟合,并将高维动画信息压缩至紧凑的潜在空间;在此基础上,训练一个基于时间感知扩散Transformer的高斯变分场扩散模型,以视频和规范GS为条件进行生成,从而实现高效且高质量的视频到4D生成。

链接: https://arxiv.org/abs/2507.23785
作者: Bowen Zhang,Sicheng Xu,Chuxin Wang,Jiaolong Yang,Feng Zhao,Dong Chen,Baining Guo
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: this https URL.
zh

[CV-1] SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions ICCV2025

【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在分布偏移(distribution shifts)下难以可靠识别正确概念的问题,从而提升可解释人工智能(Interpretable AI)模型的鲁棒性。其解决方案的关键在于构建了一个细粒度的图像与概念基准数据集SUB(Synthetic Urban Benchmark),该数据集基于CUB鸟类数据集生成了38,400张合成图像,涵盖33个鸟种和45个概念属性(如翅膀颜色或腹部图案),并通过引入一种新颖的绑定扩散引导方法(Tied Diffusion Guidance, TDG) 实现对生成图像中鸟类类别和特定属性的精确控制:通过共享噪声信息以并行执行两个去噪过程,确保生成图像同时满足正确的类别和概念标签。此方法为评估CBMs等可解释模型提供了严谨的基准,推动更鲁棒的可解释AI方法的发展。

链接: https://arxiv.org/abs/2507.23784
作者: Jessica Bader,Leander Girrbach,Stephan Alaniz,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (亥姆霍兹慕尼黑中心); Munich Center for Machine Learning (慕尼黑机器学习中心); LTCI; Télécom Paris (巴黎电信学院); Institut Polytechnique de Paris (巴黎综合理工学院); France (法国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at this https URL and the dataset at this http URL.
zh

[CV-2] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion ICCV2025

【速读】:该论文旨在解决从稀疏视角视频中重建动态场景的问题,尤其针对现有方法依赖密集多视角采集(如数百台校准相机组成的Panoptic Studio)所带来的高成本和野外场景适应性差的局限。其核心解决方案在于:通过精细对齐每个摄像头独立进行的单目重建结果,实现时间一致性和视角一致性动态场景重建,从而在仅使用少量(如四台)均匀分布的固定视角相机情况下,仍能获得高质量的三维动态场景表示,尤其在新视角渲染任务中表现优于先前方法。

链接: https://arxiv.org/abs/2507.23782
作者: Zihan Wang,Jeff Tan,Tarasha Khurana,Neehar Peri,Deva Ramanan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on this https URL.
zh

[CV-3] Phi-Ground Tech Report: Advancing Perception in GUI Grounding

【速读】:该论文旨在解决生成式 AI (Generative AI) 代理中 GUI grounding(图形用户界面定位)的准确性问题,即如何让智能体在复杂界面中精准执行点击、输入等操作。当前端到端的 grounding 模型在 ScreenSpot-pro 和 UI-Vision 等挑战性基准上的准确率不足 65%,难以满足实际部署需求。解决方案的关键在于通过系统性的实证研究,从数据收集到模型训练全流程优化,最终提出 Phi-Ground 模型家族,在参数量小于 10B 的条件下实现了五个 grounding 基准上的最先进性能,并在端到端设置下分别取得 ScreenSpot-pro 上 43.2 和 UI-Vision 上 27.2 的得分,显著提升了 GUI 操作的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2507.23779
作者: Miaosen Zhang,Ziqiang Xu,Jialiang Zhu,Qi Dai,Kai Qiu,Yifan Yang,Chong Luo,Tianyi Chen,Justin Wagle,Tim Franklin,Baining Guo
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit"Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbfPhi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit\textbf43.2 on ScreenSpot-pro and \textit\textbf27.2 on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \hrefthis https URLthis https URL
zh

[CV-4] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

【速读】:该论文旨在解决当前通用3D人体模型(如SMPL-X)因基于运动学(kinematic)特性而无法实现与环境物理交互的问题,从而导致交互过程中出现穿插(interpenetration)和不真实的物体动力学现象。解决方案的关键在于提出一种“半物理”(half-physics)机制,将3D运动学动作转化为物理仿真过程:在保持SMPL-X原有姿态控制能力的同时,确保人体与场景及物体之间具有物理合理性,有效消除穿插并提升动态交互的真实性。该方法无需训练、适用于任意体型与动作,并能在实时条件下运行,同时保留原始运动学动作的保真度。

链接: https://arxiv.org/abs/2507.23778
作者: Li Siyao,Yao Feng,Omid Tehari,Chen Change Loy,Michael J. Black
机构: Max Planck Institute for Intelligent Systems (马普所智能系统研究所); S-Lab, Nanyang Technological University (南洋理工大学S-Lab实验室); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a “half-physics” mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions
zh

[CV-5] XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding

【速读】:该论文旨在解决自回归(auto-regressive)网格生成模型在推理阶段因需进行数千甚至数万次逐 token 预测而导致的高延迟问题。其核心解决方案是提出 XSpecMesh,一种质量保持型加速方法,关键在于引入轻量级多头推测解码(multi-head speculative decoding)机制,在单次前向传播中并行预测多个 token 以加速推理;同时设计验证与重采样策略,由主干模型对每个预测 token 进行质量校验并重采样不合格项,并结合蒸馏策略训练轻量解码头使其预测分布与主干模型对齐,从而提升推测预测的成功率。实验表明,该方法可在不损失生成质量的前提下实现 1.7 倍的加速效果。

链接: https://arxiv.org/abs/2507.23777
作者: Dian Chen,Yansong Qu,Xinyang Li,Ming Li,Shengchuan Zhang
机构: Xiamen University (厦门大学); Shandong Inspur Database Technology Co., Ltd (山东浪潮数据库技术有限公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current auto-regressive models can generate high-quality, topologically precise meshes; however, they necessitate thousands-or even tens of thousands-of next-token predictions during inference, resulting in substantial latency. We introduce XSpecMesh, a quality-preserving acceleration method for auto-regressive mesh generation models. XSpecMesh employs a lightweight, multi-head speculative decoding scheme to predict multiple tokens in parallel within a single forward pass, thereby accelerating inference. We further propose a verification and resampling strategy: the backbone model verifies each predicted token and resamples any tokens that do not meet the quality criteria. In addition, we propose a distillation strategy that trains the lightweight decoding heads by distilling from the backbone model, encouraging their prediction distributions to align and improving the success rate of speculative predictions. Extensive experiments demonstrate that our method achieves a 1.7x speedup without sacrificing generation quality. Our code will be released.
zh

[CV-6] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting

【速读】:该论文旨在解决当前基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的可操作性推理方法在处理长时程、多物体任务时的局限性,即现有方法仅支持单对象、单步骤交互,难以满足复杂现实场景中对连续可操作性理解的需求。其解决方案的关键在于提出了一种新的任务范式——顺序3D高斯可操作性推理(Sequential 3D Gaussian Affordance Reasoning),并构建了包含1800+场景的大规模基准SeqAffordSplat;同时设计了SeqSplatNet端到端框架,通过大语言模型自回归生成带分割标记的文本序列,引导条件解码器输出3D可操作性掩码,并引入条件几何重建预训练策略以建立稳健的几何先验,以及多尺度特征注入机制融合来自2D视觉基础模型(Vision Foundation Models, VFM)的语义信息,从而显著提升复杂场景下多步可操作性推理的准确性与鲁棒性。

链接: https://arxiv.org/abs/2507.23772
作者: Di Li,Jie Feng,Jiahao Chen,Weisheng Dong,Guanbin Li,Yuhui Zheng,Mingtao Feng,Guangming Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
zh

[CV-7] Consensus-Driven Active Model Selection ICCV2025

【速读】:该论文旨在解决在众多现成可用的机器学习模型中,如何高效选择最适合特定数据任务的模型这一问题。传统方法依赖于构建和标注验证数据集,成本高且耗时。其解决方案的关键在于提出一种基于共识驱动的主动模型选择方法(Consensus-Driven Active Model Selection, CODA),该方法通过概率框架建模分类器、类别与数据点之间的关系,利用候选模型间的预测一致性与分歧来指导标签获取过程,并借助贝叶斯推断动态更新对最优模型的信念。实验证明,CODA显著优于现有方法,在发现最佳模型所需的标注工作量上减少超过70%。

链接: https://arxiv.org/abs/2507.23771
作者: Justin Kay,Grant Van Horn,Subhransu Maji,Daniel Sheldon,Sara Beery
机构: MIT (麻省理工学院); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Highlight. 16 pages, 8 figures

点击查看摘要

Abstract:The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset – a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at this https URL.
zh

[CV-8] Slot Attention with Re-Initialization and Self-Distillation ACM-MM2025

【速读】:该论文旨在解决对象中心学习(Object-Centric Learning, OCL)中因槽位(slot)重复使用导致的冗余竞争问题,以及主流方法仅依赖解码重建信号进行监督、忽视内部信息监督的问题。关键解决方案为:(1)引入槽位重初始化机制,减少冗余槽位并重新聚合以更新剩余槽位;(2)通过自蒸馏策略,使第一次聚合时的劣质注意力图逼近最后一次聚合时的优质注意力图,从而利用内部结构信息实现自我监督。该方法在对象发现与识别等任务上达到当前最优性能,并提升视觉预测与推理能力。

链接: https://arxiv.org/abs/2507.23755
作者: Rongzhen Zhao,Yi Zhao,Juho Kannala,Joni Pajarinen
机构: Aalto University (阿尔托大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): \emphi) We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; \emphii) We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on this https URL.
zh

[CV-9] RAG Net: Large-scale Reasoning -based Affordance Segmentation Benchmark towards General Grasping ICCV2025

【速读】:该论文旨在解决当前通用机器人抓取系统在开放世界场景中因缺乏基于推理的大规模物体功能感知(affordance perception)数据而导致的泛化能力不足问题。其解决方案的关键在于构建了一个大规模、面向抓取任务的功能分割基准数据集RAGNet,该数据集包含273k张图像、180类物体和26k条类人指令,并通过去除类别名称仅提供功能描述的方式显著提升了语言指令的难度;同时提出了一种基于功能感知的抓取框架AffordanceNet,该框架由在RAGNet上预训练的视觉-语言模型(VLM)与条件依赖于功能图的抓取网络组成,从而实现了更强的开放世界泛化性能。

链接: https://arxiv.org/abs/2507.23734
作者: Dongming Wu,Yanping Fu,Saike Huang,Yingfei Liu,Fan Jia,Nian Liu,Feng Dai,Tiancai Wang,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jianbing Shen
机构: The Chinese University of Hong Kong; Institute of Computing Technology, Chinese Academy of Sciences; Dexmal; Mohamed bin Zayed University of Artificial Intelligence; SKL-IOTSC, CIS, University of Macau
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICCV 2025. The code is at this https URL

点击查看摘要

Abstract:General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at this https URL.
zh

[CV-10] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching ICCV2025

【速读】:该论文旨在解决当前基于生成式 AI (Generative AI) 的非刚性形状对应方法中,训练损失与功能映射正则化仍依赖于先验假设建模的问题,这些问题限制了模型在真实场景中的准确性和泛化能力。其解决方案的关键在于首次将网络内正则化和功能映射训练完全替换为数据驱动的方法:通过在谱域(spectral domain)中使用基于得分的生成模型(score-based generative modeling)从高质量功能映射集合中学习一个生成模型,并利用该模型在新形状集合上促进真实功能映射的结构特性。这一方法无需依赖传统的拉普拉斯交换性或正交性约束,且实验表明其在零样本非刚性形状匹配任务中优于传统基于公理的策略。核心技术创新在于提出了一种新颖的扩散模型到谱域的蒸馏策略(distillation strategy)。

链接: https://arxiv.org/abs/2507.23715
作者: Emery Pierson,Lei Li,Angela Dai,Maks Ovsjanikov
机构: LIX, Ecole Polytechnique(法国巴黎综合理工学院); Technical University of Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at ICCV 2025

点击查看摘要

Abstract:Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: this https URL
zh

[CV-11] Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation

【速读】:该论文旨在解决深度学习模型在术中组织特征识别中的可解释性问题,特别是现有像素归因(Pixel Attribution, PA)方法因模型过自信而产生不可靠的归因结果的问题。解决方案的关键在于首次将风险估计引入像素归因方法中:通过迭代应用分类模型与PA方法生成一系列PA图谱,并基于此构建像素级PA值分布;进一步利用该分布的期望值生成增强型PA图谱,同时采用变异系数(Coefficient of Variation, CV)量化每个像素的风险水平,从而在提升归因可信度的同时提供对输出结果不确定性的评估。

链接: https://arxiv.org/abs/2507.23709
作者: Alfie Roddan,Chi Xu,Serine Ajlouni,Irini Kakaletri,Patra Charalampaki,Stamatia Giannarou
机构: The Hamlyn Centre for Robotic Surgery, Imperial College London, UK (帝国理工学院机器人外科哈姆林中心); Medical Faculty, University Witten Herdecke, Germany (维滕/黑德克大学医学院); Medical Faculty, Rheinische Friedrich Wilhelms University of Bonn, Germany (波恩大学医学院); Department of Neurosurgery, Cologne Medical Center, Cologne, Germany (科隆医疗中心神经外科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model’s predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.
zh

[CV-12] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

【速读】:该论文旨在解决动态场景下3D高保真视频重建中存在的问题,尤其是在复杂运动和显著尺度变化条件下,传统基于变形场(deformation field)的3D Gaussian splatting方法易因高斯轨迹不规则而过拟合,导致视觉质量下降;同时,针对静态场景设计的基于梯度的密集化策略无法有效处理动态内容缺失的问题。其解决方案的关键在于提出一种面向高斯视频重建的流场增强型速度场建模框架FlowGaussian-VR,包含两个核心组件:一是速度场渲染(Velocity Field Rendering, VFR)管道,通过光流(optical flow)引导优化提升动态区域建模精度;二是流辅助自适应密集化(Flow-assisted Adaptive Densification, FAD)策略,根据动态区域特性动态调整高斯分布的数量与尺寸,从而实现更稳定、可追踪且视觉质量更高的动态重建效果。

链接: https://arxiv.org/abs/2507.23704
作者: Zhenyang Li,Xiaoyang Bai,Tongchen Zhang,Pengfei Shen,Weiwei Xu,Yifan Peng
机构: The University of Hong Kong (香港大学); State Key Laboratory of Computer-aided Design & Computer Graphics, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model’s effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.
zh

[CV-13] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

【速读】:该论文旨在解决全功能图像复原(All-in-One Image Restoration, AiOIR)中的核心挑战,即如何在统一框架下有效处理多种退化类型(如模糊、噪声、低分辨率等),同时保持高质量的细节恢复。其解决方案的关键在于:1)提出基于潜在扩散模型(Latent Diffusion Models, LDMs)的统一框架,通过将低质量视觉先验结构化地嵌入扩散过程,充分利用扩散模型强大的生成能力以适应多样化退化;2)设计退化感知特征融合模块(Degradation-Aware Feature Fusion, DAFF),实现对不同退化类型的自适应处理;3)引入**细节感知专家模块(Detail-Aware Expert Module, DAEM)**于解码器中,缓解因高压缩和迭代采样导致的细节丢失问题,显著提升纹理与精细结构的恢复效果。

链接: https://arxiv.org/abs/2507.23685
作者: Zihan Cheng,Liangtai Zhou,Dian Chen,Ni Tang,Xiaotong Luo,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.
zh

[CV-14] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation

【速读】:该论文旨在解决端到端自动驾驶系统中高质量驾驶数据获取成本高、效率低的问题,其核心挑战在于如何从稀疏的基础设施视角(Infrastructure view)图像中高效合成逼真的车辆视角(Vehicle view)数据。解决方案的关键在于提出I2V-GS框架,通过高斯点绘图(Gaussian Splatting)实现基础设施视角到车辆视角的映射:首先采用自适应深度扭曲(adaptive depth warp)生成密集训练视图,再结合级联填充策略(cascade strategy)扩展视角范围并保持多视图内容一致性;同时引入跨视图信息引导的置信度优化机制,提升扩散模型生成结果的可靠性。此外,作者构建了首个面向此任务的多模态、多视角真实场景数据集RoadSight,验证了I2V-GS在车辆视角合成质量上的显著优势,相较于StreetGaussian在NTA-Iou、NTL-Iou和FID指标上分别提升45.7%、34.2%和14.9%。

链接: https://arxiv.org/abs/2507.23683
作者: Jialei Chen,Wuhao Xu,Sipeng He,Baoru Huang,Dongchun Ren
机构: Yootta; Soochow University (苏州大学); Southeast University (东南大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.
zh

[CV-15] DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

【速读】:该论文旨在解决微生物组(microbiome)数据中因稀疏性和噪声导致的准确填补(imputation)难题,这一问题严重阻碍了后续如生物标志物发现等下游分析任务。其解决方案的关键在于提出了一种名为DepMicroDiff的新框架,该框架融合了基于扩散的生成建模与依赖感知Transformer(Dependency-Aware Transformer, DAT),能够显式捕捉微生物类群间的成对相互依赖关系和自回归结构;同时通过VAE预训练在多种癌症数据集上增强模型泛化能力,并利用大语言模型(LLM)编码患者元数据进行条件控制,从而显著提升填补精度,在TCGA微生物组数据集上实现了更高的皮尔逊相关系数(最高0.712)、余弦相似度(最高0.812)及更低的均方根误差(RMSE)和平均绝对误差(MAE)。

链接: https://arxiv.org/abs/2507.23676
作者: Rabeya Tus Sadia,Qiang Cheng
机构: University of Kentucky (肯塔基大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.
zh

[CV-16] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)在医学图像分割中面临的两大核心问题:一是数据稀缺导致的模型泛化能力弱,二是硬件差异引起的光谱波段数量和分辨率不一致带来的分割性能不稳定。解决方案的关键在于提出一种名为SAMSA的交互式分割框架,其创新性地将RGB基础模型与光谱分析相结合,通过用户点击引导RGB分割与光谱相似性计算的联合优化,并引入一种独立于光谱波段数量和分辨率的特征融合策略,从而实现少样本(few-shot)乃至零样本(zero-shot)学习下的高效分割。该方法显著提升了不同光谱特性数据集间的兼容性和分割精度,在神经外科和术中猪组织等多场景下验证了其鲁棒性与灵活性。

链接: https://arxiv.org/abs/2507.23673
作者: Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA’s effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.
zh

[CV-17] OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction

【速读】:该论文旨在解决大规模预训练模型在未见过的数据集上进行零样本迁移时面临的挑战,尤其是当新数据集具有不同的时间动态特性(如帧率或观测时长差异)时,现有模型通常需要微调才能适应,从而限制了其可扩展性和实用性。解决方案的关键在于将时间元数据(temporal metadata)显式地作为条件输入引入模型架构中,通过简单的显式条件机制实现对不同时间设置的鲁棒泛化能力。基于此洞察,作者提出了OmniTraj模型,该模型基于Transformer结构,在大规模异构数据集上预训练,并在零样本迁移任务中显著降低预测误差(超过70%),同时在微调后于NBA、JTA、WorldPose和ETH-UCY四个数据集上均达到最先进性能。

链接: https://arxiv.org/abs/2507.23657
作者: Yang Gao,Po-Chien Luan,Kaouther Messaoud,Lan Feng,Alexandre Alahi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: this https URL
zh

[CV-18] Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis MICCAI2025

【速读】:该论文旨在解决医学图像分割模型在训练过程中面临的两大挑战:一是隐私保护问题,由于医疗数据涉及患者敏感信息,难以获取大规模标注数据;二是人工标注成本高,导致模型性能受限。为此,作者提出了一种**自适应蒸馏ControlNet(Adaptively Distilled ControlNet)**框架,其核心创新在于采用双模型蒸馏机制:训练阶段,利用一个以掩码-图像对为条件的教师模型,在参数空间中通过预测噪声对齐来指导仅依赖掩码的Student模型学习,同时引入基于病灶-背景比例的自适应正则化策略提升定位精度;推理阶段仅使用轻量级Student模型,实现无需原始数据的隐私保护式图像生成。该方法显著提升了分割模型的泛化能力与性能,在KiTS19和Polyps两个医学数据集上分别使TransUNet和SANet的mDice/mIoU指标提升2.4%/4.2%和2.6%/3.5%,验证了其有效性与通用性。

链接: https://arxiv.org/abs/2507.23652
作者: Kunpeng Qiu,Zhiying Zhou,Yongxin Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025

点击查看摘要

Abstract:Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbfAdaptively Distilled ControlNet, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.
zh

[CV-19] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在训练过程中因非可微性导致的梯度近似困难问题,以及现有基于反向传播的方法在边缘设备部署时面临的计算复杂度高、资源消耗大等挑战。其解决方案的关键在于提出一种无需梯度近似的前向-前向(Forward-Forward, FF)训练框架,将脉冲激活视为黑盒模块,从而避免了传统梯度近似带来的精度损失,并显著降低计算复杂度;同时引入类感知的复杂度自适应机制,根据类别间难度差异动态调整损失函数,实现网络资源在不同类别上的高效分配。

链接: https://arxiv.org/abs/2507.23643
作者: Changqing Xu,Ziqiang Yang,Yi Liu,Xinfang Liao,Guiqi Mo,Hao Zeng,Yintang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.
zh

[CV-20] Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

【速读】:该论文旨在解决少样本分类与分割(Few-shot Classification and Segmentation, FS-CS)任务中对小目标识别准确率低的问题。当前最先进方法虽在整体性能上表现优异,但在处理小物体时仍存在显著不足。其解决方案的关键在于提出高效掩码注意力Transformer(Efficient Masked Attention Transformer, EMAT),通过三项核心改进实现性能提升:一是设计了一种新型内存高效的掩码注意力机制,增强对小目标的特征捕捉能力;二是引入可学习的下采样策略,优化多尺度特征表示;三是通过参数效率增强技术,在大幅减少可训练参数(至少减少四倍)的同时保持甚至超越现有方法的精度。此外,论文还提出了两种新评估设置,以更贴近实际应用中利用已有标注数据的场景。

链接: https://arxiv.org/abs/2507.23642
作者: Dustin Carrión-Ojeda,Stefan Roth,Simone Schaub-Meyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for GCPR 2025. Project page: this https URL

点击查看摘要

Abstract:Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5 ^i and COCO-20 ^i datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.
zh

[CV-21] DivControl: Knowledge Diversion for Controllable Image Generation

【速读】:该论文旨在解决当前图像到图像(I2I)生成模型在处理多种条件输入时存在的泛化能力差和适配成本高的问题,尤其是现有方法要么为每种条件训练独立模型,要么依赖统一架构但存在表征纠缠,导致对新条件的适应效率低下。解决方案的关键在于提出DivControl框架,通过奇异值分解(SVD)将ControlNet结构解耦为基础组件——奇异向量对,并在多条件预训练过程中利用知识分流机制,将这些组件进一步分离为与条件无关的“学习基因”(learngenes)和条件特异的“定制模块”(tailors);同时引入动态门控机制实现基于条件语义的软路由,从而支持零样本泛化和参数高效的适应新条件。此外,通过引入表示对齐损失提升条件嵌入与扩散早期特征的一致性,显著降低训练成本并增强生成质量与可扩展性。

链接: https://arxiv.org/abs/2507.23620
作者: Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng
机构: School of Computer Science and Engineering, Southeast University, Nanjing, China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China; Lenovo Research
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4 \times less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.
zh

[CV-22] LLM -Based Identification of Infostealer Infection Vectors from Screenshots: The Case of Aurora

【速读】:该论文旨在解决当前针对信息窃取类恶意软件(infostealer)的分析方法在大规模场景下效率低下、难以有效识别感染路径的问题。传统手段依赖于日志文件的手动分析,面对2024年超过2900万条stealer日志而言已不具可行性;同时,现有研究多聚焦于主动检测,忽视了对感染后产生的副产物(如截图等感染证据)进行系统性分析。解决方案的关键在于引入大型语言模型(Large Language Models, LLMs),特别是gpt-4o-mini,用于自动化解析感染截图以提取可操作的指示器(Indicators of Compromise, IoCs),并据此映射感染向量与追踪攻击活动。通过分析1000张截图,该方法成功识别出337个可行动URL和246个相关文件,并据此发现三个独立的恶意软件分发活动,验证了基于感染证据的反应式分析范式在提升威胁情报精度与可扩展性方面的潜力。

链接: https://arxiv.org/abs/2507.23611
作者: Estelle Ruellan,Eric Clay,Nicholas Ascoli
机构: Flare Systems(弗莱系统)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infostealers exfiltrate credentials, session cookies, and sensitive data from infected systems. With over 29 million stealer logs reported in 2024, manual analysis and mitigation at scale are virtually unfeasible/unpractical. While most research focuses on proactive malware detection, a significant gap remains in leveraging reactive analysis of stealer logs and their associated artifacts. Specifically, infection artifacts such as screenshots, image captured at the point of compromise, are largely overlooked by the current literature. This paper introduces a novel approach leveraging Large Language Models (LLMs), more specifically gpt-4o-mini, to analyze infection screenshots to extract potential Indicators of Compromise (IoCs), map infection vectors, and track campaigns. Focusing on the Aurora infostealer, we demonstrate how LLMs can process screenshots to identify infection vectors, such as malicious URLs, installer files, and exploited software themes. Our method extracted 337 actionable URLs and 246 relevant files from 1000 screenshots, revealing key malware distribution methods and social engineering tactics. By correlating extracted filenames, URLs, and infection themes, we identified three distinct malware campaigns, demonstrating the potential of LLM-driven analysis for uncovering infection workflows and enhancing threat intelligence. By shifting malware analysis from traditional log-based detection methods to a reactive, artifact-driven approach that leverages infection screenshots, this research presents a scalable method for identifying infection vectors and enabling early intervention.
zh

[CV-23] Consistent Point Matching

【速读】:该论文旨在解决医学图像中解剖位置匹配的鲁棒性问题,特别是在跨时间、跨模态(CT与MRI)图像间进行高精度配准时存在的挑战。其解决方案的关键在于将一致性启发式(consistency heuristic)引入点匹配算法中,通过利用解剖结构的空间一致性约束来提升匹配的准确性与稳定性,从而在无需训练数据或机器学习模型的情况下实现高精度的图像导航与定位。

链接: https://arxiv.org/abs/2507.23609
作者: Halid Ziya Yerebakan,Gerardo Hermosillo Valadez
机构: Siemens Medical Solutions (西门子医疗解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \citeyerebakan2023hierarchical improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.
zh

[CV-24] Medical Image De-Identification Benchmark Challenge

【速读】:该论文旨在解决医学影像中受保护健康信息(Protected Health Information, PHI)和 personally identifiable information (PII) 的去标识化(de-identification, deID)问题,以确保在公共数据库共享医学图像时符合患者隐私法规(如HIPAA Safe Harbor条款),同时保留对下游医学人工智能(AI)研究至关重要的非PHI元数据。解决方案的关键在于构建一个标准化的基准测试平台——MIDI-B Challenge,其基于DICOM属性保密配置文件(DICOM Attribute Confidentiality Profiles)及TCIA定义的最佳实践设计规则,采用包含合成PHI/PII的真实多中心、多模态放射学图像数据集进行三阶段评估(训练、验证与测试)。参赛团队使用开源或专有工具、大语言模型(Large Language Models, LLMs)和光学字符识别(Optical Character Recognition, OCR)等技术实现高精度去标识化,最终得分范围为97.91%至99.93%,表明规则驱动的方法在保持合规性的同时可有效保留研究所需元数据。

链接: https://arxiv.org/abs/2507.23608
作者: Linmin Pei,Granger Sutton,Michael Rutherford,Ulrike Wagner,Tracy Nolan,Kirk Smith,Phillip Farmer,Peter Gu,Ambar Rana,Kailing Chen,Thomas Ferleman,Brian Park,Ye Wu,Jordan Kojouharov,Gargi Singh,Jon Lemon,Tyler Willis,Milos Vukadinovic,Grant Duffy,Bryan He,David Ouyang,Marco Pereanez,Daniel Samber,Derek A. Smith,Christopher Cannistraci,Zahi Fayad,David S. Mendelson,Michele Bufano,Elmar Kotter,Hamideh Haghiri,Rajesh Baidya,Stefan Dvoretskii,Klaus H. Maier-Hein,Marco Nolden,Christopher Ablett,Silvia Siggillino,Sandeep Kaushik,Hongzhu Jiang,Sihan Xie,Zhiyu Wan,Alex Michie,Simon J Doran,Angeline Aurelia Waly,Felix A. Nathaniel Liang,Humam Arshad Mustagfirin,Michelle Grace Felicia,Kuo Po Chih,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Nikolas Koutsoubis,Kyle Naddeo,Kartik Pandit,Tony O’Sullivan,Raj Krish,Qinyan Pan,Scott Gustafson,Benjamin Kopchick,Laura Opsahl-Ong,Andrea Olvera-Morales,Jonathan Pinney,Kathryn Johnson,Theresa Do,Juergen Klenk,Maria Diaz,Arti Singh,Rong Chai,David A. Clunie,Fred Prior,Keyvan Farahani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 19 pages

点击查看摘要

Abstract:The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge’s design, implementation, results, and lessons learned. Comments: 19 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR) Cite as: arXiv:2507.23608 [cs.CV] (or arXiv:2507.23608v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.23608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

【速读】:该论文旨在解决视频伪装目标检测(Video Camouflaged Object Detection, VCOD)中因前景与背景在空间外观特征(如颜色、纹理)上高度相似而导致的判别能力不足问题,从而限制检测精度和完整性。解决方案的关键在于提出一种基于时空频域运动感知的新型架构Vcamba,其核心创新包括:1)设计接收场视觉状态空间(RFVSS)模块,在序列建模后提取多尺度空间特征;2)引入自适应频率成分增强(AFE)模块及新颖的频域序列扫描策略,以保持语义一致性并提升频率特征表达;3)构建基于空间域的长程运动感知(SLMP)与基于频率域的长程运动感知(FLMP)模块,分别建模空间-时序和频率-时序序列;4)通过空间与频率运动融合模块(SFMF)实现双域特征统一表示。该方法有效结合了频率特征对动态能量变化的敏感性与Mamba模型的线性时间长序列建模能力,显著提升了检测性能且计算成本更低。

链接: https://arxiv.org/abs/2507.23601
作者: Xin Li,Keren Fu,Qijun Zhao
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: this https URL.
zh

[CV-26] DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统中3D占用预测(3D occupancy prediction)的准确性与推理速度难以兼顾的问题。现有方法通常追求高精度而牺牲实时性,无法满足边缘设备部署需求。其解决方案的关键在于提出一种方向性的纯2D方法:通过将3D体素特征沿垂直方向切片以保留完整的高度信息,从而弥补鸟瞰图(Bird’s-Eye View, BEV)表示中高度线索的缺失,维持3D几何结构完整性;同时引入方向注意力机制,高效提取多方位的几何特征,在保证精度的同时显著提升计算效率。实验表明,该方法在Occ3D-nuScenes数据集上达到39.3%的mIoU和27.7 FPS的推理速度,并在边缘设备上实现14.8 FPS,验证了其在资源受限场景下的实用性。

链接: https://arxiv.org/abs/2507.23599
作者: Yuchen Zhou,Yan Luo,Xiangang Wang,Xingjian Gu,Mingzhou Lu
机构: Nanjing Agricultural University (南京农业大学); Desay SV Automotive co.,LTD. (德赛西威汽车电子股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.
zh

[CV-27] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction ICCV2025

【速读】:该论文旨在解决从单张图像重建高保真3D Gaussian人体模型的问题,核心挑战在于如何在保证几何与外观一致性的同时,准确恢复未被观测到的视角细节。传统方法依赖2D扩散模型生成多视角图像,但存在视图稀疏且不一致的问题,导致3D伪影和模糊外观。解决方案的关键在于提出MoGA框架,其创新性地将一个生成式人体模型(Generative Avatar Model)作为先验,通过将其投影至潜在空间并施加额外的3D外观与几何约束,实现对输入图像的隐式建模与优化;同时利用该生成模型提供有意义的初始化、3D正则化及姿态估计精调,从而显著提升重建质量与泛化能力。

链接: https://arxiv.org/abs/2507.23597
作者: Zijian Dong,Longteng Duan,Jie Song,Michael J. Black,Andreas Geiger
机构: ETH Zürich(瑞士联邦理工学院); University of Tübingen(图宾根大学); HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学); Max Planck Institute for Intelligent Systems(马普智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 (Highlight), Project Page: this https URL

点击查看摘要

Abstract:We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable
zh

[CV-28] MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model ICCV25

【速读】:该论文旨在解决车联网(V2X)场景下路侧摄像头的大规模高精度标定问题,传统人工标定方法存在耗时、依赖道路封闭及需特定参考物等局限性。解决方案的关键在于提出MamV2XCalib方法,其创新性地利用车载激光雷达(LiDAR)与路侧摄像头的协同感知能力,在无需人工干预或特定标定目标的情况下实现自动标定;核心技术创新包括:1)提出一种无目标的LiDAR-相机标定方法,通过多尺度特征融合与4D相关体积估计点云与图像间的对应关系;2)引入Mamba模型建模时序信息并估计旋转角度,有效缓解因车辆端数据缺陷(如遮挡)和视角差异导致的标定失败问题。实验表明,该方法在真实V2X数据集上具有更高鲁棒性和稳定性,且参数量更少。

链接: https://arxiv.org/abs/2507.23595
作者: Yaoye Zhu,Zhe Wang,Yan Wang
机构: Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV25 poster

点击查看摘要

Abstract:As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at this https URL.
zh

[CV-29] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation BMVC2025

【速读】:该论文旨在解决手语翻译(Sign Language Translation, SLT)中跨模态鸿沟难以弥合以及手势形状与运动细微差异难以捕捉的挑战。其解决方案的关键在于提出了一种新颖的无词典(gloss-free)框架 BeyondGloss,该框架利用视频大语言模型(Video Large Language Models, VideoLLMs)的时空推理能力,并通过以下创新实现性能提升:1)设计一种细粒度、时序感知的手部运动文本描述生成方法,以增强VideoLLMs对长视频细节建模的能力;2)引入对比对齐模块,在预训练阶段将手部运动描述与视频特征对齐,强化模型对手部中心时序动态的关注;3)从HaMeR模型蒸馏细粒度手部特征,丰富手部特异性表示;4)在预训练中施加符号视频表示与目标语言嵌入之间的对比损失,有效缩小模态差距。该方法在Phoenix14T和CSL-Daily基准上达到当前最优性能。

链接: https://arxiv.org/abs/2507.23575
作者: Sobhan Asasi,Mohamed Ilyas Lakhal,Ozge Mercanoglu Sincan,Richard Bowden
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025

点击查看摘要

Abstract:Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbfBeyondGloss, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbfBeyondGloss achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.
zh

[CV-30] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization CVPR2025

【速读】:该论文旨在解决视觉定位(Visual Localization)任务中如何实现高精度且具备隐私保护能力的相机位姿估计问题。其核心挑战在于如何在不依赖敏感图像内容的前提下,利用场景几何结构与特征表示的协同优化来提升定位鲁棒性。解决方案的关键在于提出了一种名为高斯点绘特征场(Gaussian Splatting Feature Fields, GSFFs)的新场景表示方法,该方法融合了显式几何模型(3D Gaussian Splatting, 3DGS)与隐式特征场,并通过对比学习框架将三维尺度感知特征场与二维特征编码器对齐至同一嵌入空间;同时,借助基于3D结构的聚类策略进一步正则化特征学习并生成可用于隐私保护的分割结果,最终通过特征图或分割图与GSFF渲染结果之间的配准完成位姿精修,从而在多个真实世界数据集上实现了当前最优性能。

链接: https://arxiv.org/abs/2507.23569
作者: Maxime Pietrantoni,Gabriela Csurka,Torsten Sattler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.
zh

[CV-31] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection ICCV2025

【速读】:该论文旨在解决单目3D目标检测在开放集(open-set)场景下的泛化能力问题,即模型在训练和测试阶段面对不同环境和新类别物体时性能下降的问题。传统方法局限于封闭集设置(closed-set),无法适应真实世界中不断出现的新场景与未知类别。解决方案的关键在于提出首个端到端的单目3D开放集目标检测器(3D-MOOD),其核心创新包括:设计用于将开放集2D检测结果映射至3D空间的3D边界框头(3D bounding box head),实现2D与3D任务的联合训练以提升整体性能;通过引入几何先验(geometry prior)对目标查询进行条件约束,增强跨场景3D估计的泛化能力;并构建规范图像空间(canonical image space)以支持更高效的跨数据集训练。实验表明,该方法在封闭集(Omni3D)和开放集(Omni3D→Argoverse 2、ScanNet)设置下均达到当前最优效果。

链接: https://arxiv.org/abs/2507.23567
作者: Yung-Hsu Yang,Luigi Piccinelli,Mattia Segu,Siyuan Li,Rui Huang,Yuqian Fu,Marc Pollefeys,Hermann Blum,Zuria Bauer
机构: ETH Zürich (苏黎世联邦理工学院); Tsinghua University (清华大学); INSAIT; Microsoft (微软); University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at this http URL.
zh

[CV-32] User Experience Estimation in Human-Robot Interaction Via Multi-Instance Learning of Multimodal Social Signals IROS2025

【速读】:该论文旨在解决社交机器人在人机交互(HRI)中难以准确评估用户体验(UX)的问题,尤其是在需要根据用户状态动态调整行为的场景下。现有方法通常仅关注UX的单一维度(如情绪或参与度),缺乏对多模态信息的整合与时间动态性的建模。解决方案的关键在于构建一个融合面部表情和语音信号的多模态UX数据集,并提出一种基于Transformer架构的模型,结合多实例学习(multi-instance learning)框架,以同时捕捉短期和长期交互模式,从而更全面地表征UX的时间演化特性。实验表明,该方法在UX估计上优于第三方人工评估者。

链接: https://arxiv.org/abs/2507.23544
作者: Ryo Miyoshi,Yuki Okafuji,Takuya Iwamoto,Junya Nakanishi,Jun Baba
机构: CyberAgent( CyberAgent); Osaka University (大阪大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: This paper has been accepted for presentation at IEEE/RSJ International Conference on Intelligent Robots and Systems 2025 (IROS 2025)

点击查看摘要

Abstract:In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users’ states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.
zh

[CV-33] ART: Adaptive Relation Tuning for Generalized Relation Prediction ICCV2025

【速读】:该论文旨在解决视觉关系检测(Visual Relation Detection, VRD)模型在训练数据中未见关系上的泛化能力不足的问题,尤其是传统方法依赖手工设计提示(prompt tuning)难以处理新颖或复杂关系的局限性。其解决方案的关键在于提出一种自适应关系微调框架(Adaptive Relation Tuning, ART),通过将VRD任务转化为指令微调(instruction tuning)格式,并结合自适应采样算法,引导视觉语言模型(Vision-Language Model, VLM)聚焦于信息量丰富的关系,从而提升对未见关系概念的推理能力,同时保持良好的跨数据集泛化性能。

链接: https://arxiv.org/abs/2507.23543
作者: Gopika Sudhakaran,Hikaru Shindo,Patrick Schramowski,Simone Schaub-Meyer,Kristian Kersting,Stefan Roth
机构: TU Darmstadt (达姆施塔特工业大学); hessian.AI (黑森州人工智能中心); DFKI (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in ICCV 2025

点击查看摘要

Abstract:Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by using the predicted relations for segmenting complex scenes.
zh

[CV-34] A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving

【速读】:该论文旨在解决自动驾驶系统在复杂开放环境中的适应性、鲁棒性和可解释性不足的问题,这些问题源于架构碎片化、对新场景泛化能力有限以及感知模块语义提取不充分。解决方案的关键在于提出一个统一的感知-语言-决策(Perception-Language-Action, PLA)框架,该框架通过多传感器融合(摄像头、激光雷达、雷达)与大语言模型(LLM)增强的视觉-语言-动作(Vision-Language-Action, VLA)架构相结合,特别是采用GPT-4.1作为推理核心,实现了低层感知与高层语义推理的紧密耦合,从而支持情境感知、可解释且安全约束下的自主决策,显著提升了轨迹跟踪、速度预测和自适应规划性能。

链接: https://arxiv.org/abs/2507.23540
作者: Yi Zhang,Erik Leo Haß,Kuo-Yi Chao,Nenad Petrovic,Yinglei Song,Chengdong Wu,Alois Knoll
机构: Technical University of Munich (TUM)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving systems face significant challenges in achieving human-like adaptability, robustness, and interpretability in complex, open-world environments. These challenges stem from fragmented architectures, limited generalization to novel scenarios, and insufficient semantic extraction from perception. To address these limitations, we propose a unified Perception-Language-Action (PLA) framework that integrates multi-sensor fusion (cameras, LiDAR, radar) with a large language model (LLM)-augmented Vision-Language-Action (VLA) architecture, specifically a GPT-4.1-powered reasoning core. This framework unifies low-level sensory processing with high-level contextual reasoning, tightly coupling perception with natural language-based semantic understanding and decision-making to enable context-aware, explainable, and safety-bounded autonomous driving. Evaluations on an urban intersection scenario with a construction zone demonstrate superior performance in trajectory tracking, speed prediction, and adaptive planning. The results highlight the potential of language-augmented cognitive frameworks for advancing the safety, interpretability, and scalability of autonomous driving systems.
zh

[CV-35] Continual Learning with Synthetic Boundary Experience Blending

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中因灾难性遗忘(Catastrophic Forgetting)导致模型在顺序学习多个任务时性能下降的问题。现有方法如经验回放(Experience Replay)常受限于存储的关键样本分布稀疏,从而导致决策边界过于简化。解决方案的关键在于提出一种名为“经验融合”(Experience Blending)的新训练框架,其核心创新包括:(1) 通过多变量差分隐私(Multivariate Differential Privacy, DP)噪声机制向低维特征表示注入批量噪声,生成位于决策边界附近的合成边界数据(Synthetic Boundary Data, SBD);(2) 设计端到端训练策略,联合利用存储的关键样本与SBD,以隐式正则化方式增强决策边界稳定性,从而有效缓解遗忘问题。实验表明,该方法在CIFAR-10、CIFAR-100和Tiny ImageNet上分别较九种基准方法提升准确率10%、6%和13%。

链接: https://arxiv.org/abs/2507.23534
作者: Chih-Fan Hsu,Ming-Ching Chang,Wei-Chao Chen
机构: Inventec(英业达); University of Albany, SUNY (纽约州立大学阿尔巴尼分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning (CL) aims to address catastrophic forgetting in models trained sequentially on multiple tasks. While experience replay has shown promise, its effectiveness is often limited by the sparse distribution of stored key samples, leading to overly simplified decision boundaries. We hypothesize that introducing synthetic data near the decision boundary (Synthetic Boundary Data, or SBD) during training serves as an implicit regularizer, improving boundary stability and mitigating forgetting. To validate this hypothesis, we propose a novel training framework, \bf Experience Blending, which integrates knowledge from both stored key samples and synthetic, boundary-adjacent data. Experience blending consists of two core components: (1) a multivariate Differential Privacy (DP) noise mechanism that injects batch-wise noise into low-dimensional feature representations, generating SBD; and (2) an end-to-end training strategy that jointly leverages both stored key samples and SBD. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet demonstrate that our method outperforms nine CL baselines, achieving accuracy improvements of 10%, 6%, and 13%, respectively.
zh

[CV-36] H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

【速读】:该论文旨在解决机器人操作中因高质量示范数据稀缺而导致的模仿学习性能受限问题。现有方法虽尝试通过跨机器人本体(cross-embodiment)数据预训练来扩大数据规模,但不同机器人形态和动作空间的差异使得统一训练困难。解决方案的关键在于利用大规模第一人称视角人类操作视频及其配对的3D手部姿态标注,作为行为先验来增强机器人策略学习能力;具体采用两阶段训练范式:首先在人类操作数据上预训练,随后在机器人特定数据上进行跨本体微调,并引入模块化动作编码器与解码器以适配不同机器人结构。基于20亿参数的扩散Transformer架构,结合流匹配(flow matching)建模复杂动作分布,最终实现在仿真和真实场景中的显著性能提升,验证了人类操作数据作为多指机器人操作策略学习基础的有效性。

链接: https://arxiv.org/abs/2507.23523
作者: Hongzhe Bi,Lingxuan Wu,Tianwei Lin,Hengkai Tan,Zhizhong Su,Hang Su,Jun Zhu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including Pi0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.
zh

[CV-37] I Am Big You Are Little; I Am Right You Are Wrong ICCV2025

【速读】:该论文旨在解决视觉模型在图像分类任务中决策机制不透明的问题,即如何量化和理解不同架构的模型对图像特征的关注程度。其核心解决方案是提出使用“最小充分像素集”(minimal sufficient pixel sets)来衡量模型的“集中度”(concentration),即捕捉图像本质信息所需的最少像素集合。通过比较不同模型在位置、重叠度和大小上的像素集差异,作者发现不同架构具有统计显著不同的集中模式,例如ConvNext和EVA模型表现尤为突出;同时,误分类样本对应的像素集显著大于正确分类样本,揭示了模型不确定性与注意力分布之间的关联。

链接: https://arxiv.org/abs/2507.23509
作者: David A. Kelly,Akchunya Chanchal,Nathan Blake
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, International Conference on Computer Vision, ICCV 2025

点击查看摘要

Abstract:Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model’s classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model’s `concentration’: the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications. Comments: 10 pages, International Conference on Computer Vision, ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.23509 [cs.CV] (or arXiv:2507.23509v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.23509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-38] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

【速读】:该论文旨在解决多模态图像融合中因跨模态配准不准确而导致的融合质量下降问题。现有基于欧几里得空间(Euclidean space)图像变换的配准方法难以有效处理不同模态间的几何差异,导致对齐效果不佳。其解决方案的关键在于提出一种基于双曲空间(hyperbolic space)的图像配准框架——Hyperbolic Cycle Alignment Network (Hy-CycleAlign),该方法首次将图像配准引入非欧几里得空间,并设计了双路径循环配准结构:前向网络实现跨模态对齐,后向网络重建原图形成闭环约束,同时引入Hyperbolic Hierarchy Contrastive Alignment (H²CA)模块,在双曲空间中映射图像并施加注册约束,从而显著提升跨模态图像的敏感性与对齐精度,最终改善融合性能。

链接: https://arxiv.org/abs/2507.23508
作者: Timing Li,Bing Cao,Jiahe Feng,Haifang Cao,Qinghau Hu,Pengfei Zhu
机构: Tianjin University (天津大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H ^2 CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.
zh

[CV-39] Causal Identification of Sufficient Contrastive and Complete Feature Sets in Image Classification

【速读】:该论文旨在解决图像分类器输出解释缺乏形式严谨性的问题,现有方法多基于启发式或黑箱策略,难以保证解释的逻辑一致性与可验证性;而传统逻辑-based解释虽形式严谨,却依赖对模型结构的强假设,不适用于图像分类场景。解决方案的关键在于引入因果解释(causal explanations),其不仅具备形式化定义和逻辑解释相同的严格性质,还能适配黑盒模型且天然契合图像分类任务。作者进一步提出对比因果解释(contrastive causal explanations)与完整因果解释(complete causal explanations),后者通过引入置信度感知机制,确保解释结果与原始图像具有完全一致的置信水平。实验表明,所提算法高效可计算(平均每个图像仅需6秒),无需访问模型内部结构、梯度或任何先验属性(如单调性),实现了真正意义上的黑盒解释。

链接: https://arxiv.org/abs/2507.23497
作者: David A Kelly,Hana Chockler
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 13 figures, appendix included

点击查看摘要

Abstract:Existing algorithms for explaining the outputs of image classifiers are based on a variety of approaches and produce explanations that lack formal rigor. On the other hand, logic-based explanations are formally and rigorously defined but their computability relies on strict assumptions about the model that do not hold on image classifiers. In this paper, we show that causal explanations, in addition to being formally and rigorously defined, enjoy the same formal properties as logic-based ones, while still lending themselves to black-box algorithms and being a natural fit for image classifiers. We prove formal properties of causal explanations and introduce contrastive causal explanations for image classifiers. Moreover, we augment the definition of explanation with confidence awareness and introduce complete causal explanations: explanations that are classified with exactly the same confidence as the original image. We implement our definitions, and our experimental results demonstrate that different models have different patterns of sufficiency, contrastiveness, and completeness. Our algorithms are efficiently computable, taking on average 6s per image on a ResNet50 model to compute all types of explanations, and are totally black-box, needing no knowledge of the model, no access to model internals, no access to gradient, nor requiring any properties, such as monotonicity, of the model. Comments: 13 pages, 13 figures, appendix included Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.23497 [cs.AI] (or arXiv:2507.23497v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.23497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-40] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions IROS2025

【速读】:该论文旨在解决在田间条件下对桌面上生长的草莓进行准确质量估计的问题,其主要挑战来自频繁的遮挡和姿态变化。解决方案的关键在于构建一个基于RGB-D感知与深度学习融合的视觉流水线:首先使用YOLOv8-Seg实现实例分割,通过CycleGAN完成遮挡区域补全(相比LaMa模型在像素面积比PAR和交并比IoU上表现更优),再结合倾斜角度校正优化前视投影面积计算,最后利用多项式回归模型将几何特征映射至质量。该方法实现了非破坏性、实时在线的质量估计,有效应对复杂遮挡场景,为自动化采摘与产量监测提供了鲁棒性方案。

链接: https://arxiv.org/abs/2507.23487
作者: Jinshan Zhen,Yuanyue Ge,Tianxiao Zhu,Hui Zhao,Ya Xiong
机构: Beijing Academy of Agriculture and Forestry Sciences (北京市农林科学院); Tianjin University of Technology (天津理工大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IROS 2025

点击查看摘要

Abstract:Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.
zh

[CV-41] Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion ICCV2025

【速读】:该论文旨在解决3D数据模拟中合成数据与真实采集数据之间的差距问题(即Sim2Real问题),这是实现真实世界3D视觉任务性能提升的关键瓶颈。传统方法依赖预定义的物理先验,难以捕捉真实数据的复杂性;而基于数据驱动的学习方法虽具潜力,但近期研究进展停滞。本文提出一种名为Stable-Sim2Real的新方案,其核心创新在于采用新颖的两阶段深度扩散模型:第一阶段微调Stable-Diffusion生成真实与合成深度图之间的残差,获得稳定但粗略的深度图;第二阶段将合成深度与初步输出同时输入至另一个扩散模型,并通过3D判别器识别局部差异区域,调整扩散损失以优先优化这些区域,从而显著提升模拟数据的真实感和实用性。实验表明,使用该方法生成的模拟数据可有效提升真实场景下3D视觉任务的性能,且与真实数据模式高度相似。

链接: https://arxiv.org/abs/2507.23483
作者: Mutian Xu,Chongjie Ye,Haolin Liu,Yushuang Wu,Jiahao Chang,Xiaoguang Han
机构: SSE, CUHKSZ; FNii-Shenzhen; Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHKSZ; Tencent Hunyuan3D; ByteDance Games
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 (Highlight). Project page: this https URL

点击查看摘要

Abstract:3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: this https URL.
zh

[CV-42] FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction ICCV2025

【速读】:该论文旨在解决深度神经网络在处理大规模、不规则点云时效率低下的问题,尤其是远点采样(farthest point sampling, FPS)和邻域搜索操作的计算开销过大。其解决方案的关键在于提出FastPoint——一种基于软件的加速技术,通过利用采样点间距离变化的可预测趋势,预先估计距离曲线,从而避免对所有点对进行冗余的距离计算,高效识别后续采样点。该方法在保持采样质量和模型性能的前提下,显著提升了FPS与邻域搜索的速度,在NVIDIA RTX 3090 GPU上实现了2.55倍的端到端加速。

链接: https://arxiv.org/abs/2507.23480
作者: Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon
机构: Seoul National University (首尔国立大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.
zh

[CV-43] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning MICCAI2025

【速读】:该论文旨在解决视频胶囊内窥镜(video capsule endoscopy)在小肠检查中因电池寿命短而导致的持续监测难题。传统设备受限于紧凑传感器边缘设备的资源,难以部署复杂的AI模型,且数据稀疏性进一步加剧了模型训练与推理的挑战。解决方案的关键在于设计了一个参数量仅为100万的多任务神经网络(multi-task neural network),将小肠内的精确定位(self-localization)与异常检测(anomaly detection)功能集成于单一模型中,通过整合成熟的多任务学习方法和Viterbi解码进行时序分析,在保证低计算开销的前提下显著提升性能:定位准确率达93.63%,异常检测准确率达87.48%,优于现有单任务模型,为嵌入式AI在医疗内窥镜领域的应用提供了高效可行的新路径。

链接: https://arxiv.org/abs/2507.23479
作者: Julia Werner,Oliver Bause,Julius Oexle,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Applications of Medical AI (AMAI workshop) at MICCAI 2025 (submitted version)

点击查看摘要

Abstract:Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.
zh

[CV-44] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

【速读】:该论文旨在解决当前3D视觉语言模型(3D VLMs)在复杂场景理解中推理能力不足与泛化性能差的问题,其核心挑战源于高质量空间数据稀缺以及视角假设的静态性。解决方案的关键在于构建一个高质量的合成数据集 Scene-30K(包含思维链 CoT),并基于 Gemini 2.5 Pro 数据引擎实现冷启动初始化;同时引入强化学习人类反馈(RLHF)策略(如 GRPO)和三类奖励函数(感知奖励、语义相似度奖励、格式奖励)以提升推理准确性和答案语义一致性;此外,设计动态视点选择策略以自适应选取最具信息量的视角,从而显著增强模型对3D场景的理解能力与泛化性能。

链接: https://arxiv.org/abs/2507.23478
作者: Ting Huang,Zeyu Zhang,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: this https URL. Website: this https URL.
zh

[CV-45] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes ICCV

【速读】:该论文旨在解决当前无人机(Unmanned Aerial Vehicle, UAV)目标跟踪数据集在真实复杂场景下适用性不足的问题,尤其是现有数据集普遍存在目标显著性高、场景复杂度低及属性标注不完整等缺陷,导致跟踪算法难以应对微小尺寸无人机在多样化复杂环境中的检测与追踪挑战。解决方案的关键在于构建首个针对复杂场景中微小无人机单目标跟踪(Single Object Tracking, SOT)任务的热红外数据集——CST Anti-UAV,其包含超过24万帧高质量边界框标注和完整的逐帧属性标注,能够支持对多种挑战因素(如尺度变化、遮挡、背景干扰等)的精准评估;同时,通过在该数据集上系统评测20种主流SOT方法,揭示了当前最优方法在复杂场景下仅达35.92%的精度,远低于其他基准(如Anti-UAV410的67.69%),凸显了现有技术瓶颈并推动更鲁棒的抗无人机跟踪方法发展。

链接: https://arxiv.org/abs/2507.23473
作者: Bin Xie,Congxuan Zhang,Fagan Wang,Peng Liu,Feng Lu,Zhen Chen,Weiming Hu
机构: Nanchang Hangkong University (南昌航空大学); Beihang University (北京航空航天大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCVW2025

点击查看摘要

Abstract:The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.
zh

[CV-46] Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非分类任务(如人体姿态估计)中因客户端间图像分辨率差异导致的性能显著下降问题,即“分辨率漂移”(resolution-drift)。这一现象揭示了分辨率作为另一维度的非独立同分布(non-IID)数据的重要性,区别于传统的类别分布异质性。解决方案的关键在于提出一种分辨率自适应联邦学习(Resolution-Adaptive Federated Learning, RAF),其核心机制是基于热力图(heatmap)的知识蒸馏:通过在高分辨率输出(教师模型)与低分辨率输出(学生模型)之间进行多分辨率知识传递,增强模型对分辨率变化的鲁棒性,同时避免过拟合。实验与理论分析表明,RAF能有效缓解分辨率漂移并提升性能,且可无缝集成至现有FL框架中。

链接: https://arxiv.org/abs/2507.23461
作者: Taeheon Lim,Joohyung Lee,Kyungjae Lee,Jungchan Cho
机构: Chung-Ang University (中央大学); Gachon University (嘉泉大学); Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,‘’ where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.
zh

[CV-47] Machine learning and machine learned prediction in chest X-ray images

【速读】:该论文旨在解决医学影像中疾病预测的复杂分类问题,特别是通过机器学习方法从胸部X光图像中自动识别患者是否患有相关病症。其解决方案的关键在于利用深度学习模型(包括基准卷积神经网络(Convolutional Neural Network, CNN)和DenseNet-121)对大规模胸部X光图像数据进行训练,并通过梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)验证模型决策过程的可解释性与准确性。结果表明,DenseNet-121在捕捉输入图像关键区域方面优于基准CNN,从而提升了预测性能与临床可信度。

链接: https://arxiv.org/abs/2507.23455
作者: Shereiff Garrett,Abhinav Adhikari,Sarina Gautam,DaShawn Marquis Morris,Chandra Mani Adhikari
机构: Fayetteville State University (弗吉尼亚州立大学); University of Nebraska Omaha (内布拉斯加大学奥马哈分校); Jack Britt High School (杰克·布里特高中)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.
zh

[CV-48] Adjustable Spatio-Spectral Hyperspectral Image Compression Network

【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域中高光谱图像(Hyperspectral Image, HSI)压缩效率不足的问题,特别是针对学习-based HSI压缩中,光谱与空间维度压缩的独立及联合影响尚未被系统研究这一关键瓶颈。现有方法往往忽视了光谱冗余、空间冗余以及两者协同冗余在不同压缩比(Compression Ratio, CR)下的作用机制,导致压缩性能难以优化。为应对这一挑战,作者提出了一种可调式光谱-空间高光谱图像压缩网络(Adjustable Spatio-Spectral Hyperspectral Image Compression Network, HyCASS),其核心创新在于引入两个可调节模块:压缩比适配编码器(CR adapter encoder)和压缩比适配解码器(CR adapter decoder),结合卷积层与Transformer块以同时捕捉局部与长程冗余信息,从而实现对光谱和空间维度的灵活控制与协同优化。实验表明,该模型在两个基准数据集上优于现有学习型压缩方法,并提供了基于空间分辨率的压缩策略指导原则,为高效HSI压缩提供了理论依据与实用工具。

链接: https://arxiv.org/abs/2507.23447
作者: Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学); Berlin Institute for the Foundations of Learning and Data (BIFOLD) (柏林基础学习与数据研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio (CR) adapter encoder; 4) CR adapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at this https URL .
zh

[CV-49] Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification

【速读】:该论文旨在解决艺术风格分类(art style classification)中因专家标注数据稀缺以及风格元素之间复杂非线性交互关系导致的计算美学建模难题。其解决方案的关键在于改进双教师自监督知识蒸馏框架,通过用基于样条函数激活的科尔莫戈罗夫-阿诺德网络(Kolmogorov-Arnold Networks, KANs)替代传统的多层感知机(MLP)投影和预测头,从而更精确地建模全局构图上下文与复杂的风格特征非线性关联,显著提升分类准确率与线性探测性能。

链接: https://arxiv.org/abs/2507.23436
作者: Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdelmalik Taleb-Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs’ spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.
zh

[CV-50] Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning

【速读】:该论文旨在解决蜂蜜掺假问题,特别是通过添加糖浆(sugar syrup)对蜂蜜进行掺杂的行为,这在食品工业中是一个重要的质量控制挑战。解决方案的关键在于构建一个基于机器学习的自动检测系统,该系统利用高光谱成像(hyperspectral imaging)数据,分为两个子系统:一是基于线性判别分析(Linear Discriminant Analysis, LDA)提取特征并结合K近邻(K-Nearest Neighbors, KNN)模型识别蜂蜜的植物来源;二是同样采用LDA特征提取与KNN分类器来检测糖浆掺假并量化其浓度。实验表明,该系统在公开数据集上实现了96.39%的整体交叉验证准确率,显著优于传统化学检测方法。

链接: https://arxiv.org/abs/2507.23416
作者: Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.
zh

[CV-51] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories MICCAI2025

【速读】:该论文旨在解决医学影像中低发病率病灶的无监督异常检测(unsupervised out-of-distribution, OOD)问题,尤其针对现有生成式方法在计算成本高、可靠性差以及对分布变化敏感等局限性。其解决方案的关键在于提出一种无需重建过程的OOD检测方法,利用Stein score-based denoising diffusion model (SBDDM) 的前向扩散轨迹,通过估计Stein score捕捉轨迹曲率来实现异常评分,仅需五步扩散即可完成准确判定。该方法在预训练于大规模语义对齐医学数据集后,能跨多个近域(Near-OOD)和远域(Far-OOD)基准泛化,显著提升检测性能并大幅降低推理开销,相较现有方法在Near-OOD和Far-OOD检测上分别实现最高达10.43%和18.10%的相对改进。

链接: https://arxiv.org/abs/2507.23411
作者: Lemar Abdi,Francisco Caetano,Amaan Valiuddin,Christiaan Viviers,Hamdi Joudeh,Fons van der Sommen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, MICCAI 2025

点击查看摘要

Abstract:In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.
zh

[CV-52] AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

【速读】:该论文旨在解决医学视觉-语言预训练中两个关键问题:一是现有方法通常将临床报告简化为单一实体或碎片化标记,忽略了报告的结构化语义;二是对比学习框架依赖大量难负样本,在小规模医学数据集上难以实现。解决方案的核心是提出自适应分组对齐(Adaptive Grouped Alignment, AGA)框架,其关键创新在于引入基于稀疏相似度矩阵的双向分组机制:每个文本标记选择与其最匹配的m个图像块形成视觉组,每个图像块选择最相关的文本标记形成语言组;并通过语言组阈值门控和视觉组阈值门控模块动态学习分组阈值,从而生成加权平均的组表示;进一步设计实例感知组对齐损失(Instance Aware Group Alignment loss),在每一对图像-文本内部完成对齐,无需外部负样本,同时通过双向跨模态组对齐模块增强细粒度的视觉与语言组表示对齐。

链接: https://arxiv.org/abs/2507.23402
作者: Wei Li,Xun Gong,Jiao Li,Xiaobin Sun
机构: Southwest Jiaotong University (西南交通大学); The Third People’s Hospital of Chengdu (成都市第三人民医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.
zh

[CV-53] NeRF Is a Valuable Assistant for 3D Gaussian Splatting ICCV

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在实际应用中面临的三大局限性:对高斯初始化的敏感性、有限的空间感知能力以及高斯点之间关联建模不足的问题。为此,作者提出 NeRF-GS 框架,其核心创新在于将 Neural Radiance Fields (NeRF) 与 3DGS 进行联合优化,利用 NeRF 的连续空间表示特性来增强 3DGS 的几何一致性与局部细节表达能力。关键解决方案包括:1)重新设计 3DGS 的空间特征结构,使其逐步与 NeRF 对齐,从而在共享的 3D 空间信息下实现两者的协同优化;2)通过优化隐式特征和高斯位置的残差向量,提升 3DGS 的个性化建模能力。实验表明,该方法在基准数据集上达到当前最优性能,验证了 NeRF 与 3DGS 具有互补性而非竞争关系,为高效 3D 场景表示提供了新的混合建模思路。

链接: https://arxiv.org/abs/2507.23374
作者: Shuangkang Fang,I-Chao Shen,Takeo Igarashi,Yufeng Wang,ZeSheng Wang,Yi Yang,Wenrui Ding,Shuchang Zhou
机构: Beihang University (北京航空航天大学); The University of Tokyo (东京大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV

点击查看摘要

Abstract:We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.
zh

[CV-54] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

【速读】:该论文旨在解决基于CLIP(Contrastive Language-Image Pretraining)的无监督域适应(Unsupervised Domain Adaptation, UDA)方法在多源场景下因伪标签噪声和样本难度差异导致的特征对齐不稳定问题。现有方法通常采用“一次性”对齐策略,即同时使用所有伪标签数据进行域对齐,容易受难分类样本和噪声标签干扰,引发误差传播,尤其在多源域适应(Multi-Source UDA, MS-UDA)中更为显著。解决方案的关键在于提出一种渐进式对齐策略(MP²A),其核心思想是:首先在高置信度目标域样本上训练模型以建立可靠的初始表示,随后逐步引入更具挑战性的样本,引导模型在不被初始噪声干扰的前提下持续优化特征学习。该机制有效缓解了确认偏倚(confirmation bias),促进更稳健的收敛,从而实现真正域不变特征的学习。

链接: https://arxiv.org/abs/2507.23373
作者: Haoran Chen,Zexiao Wang,Haidong Cao,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.
zh

[CV-55] UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

【速读】:该论文旨在解决情感理解(Emotional Understanding)与情感生成(Emotional Generation)长期被视作独立任务所带来的局限性问题,提出一个统一框架UniEmo以实现二者协同增强。其解决方案的关键在于构建一个分层的情感理解链(hierarchical emotional understanding chain),通过可学习的专家查询(learnable expert queries)逐步提取多尺度情感特征,作为统一建模的基础;同时将这些查询与情感表征融合以引导扩散模型(diffusion model)生成具有情绪诱发能力的图像,并引入情感相关系数(emotional correlation coefficient)和情感条件损失(emotional condition loss)提升生成图像的多样性与保真度。此外,通过生成驱动的双重反馈机制——即生成模块为理解模块提供隐式反馈、以及高质量生成图像经筛选后对理解模块的显式反馈——显著增强了模型的整体情感理解能力。

链接: https://arxiv.org/abs/2507.23372
作者: Yijie Zhu,Lingsen Zhang,Zitong Yu,Rui Shao,Tao Tan,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Great Bay University (大湾区大学); Macao Polytechnic University (澳门理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model’s understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at this https URL.
zh

[CV-56] VMatcher: State-Space Semi-Dense Local Feature Matching

【速读】:该论文旨在解决基于学习的特征匹配方法在计算效率上的瓶颈问题,尤其是依赖Transformer注意力机制所带来的二次复杂度导致的高计算开销。现有方法虽性能优异,但在实时应用中难以满足快速推理需求。解决方案的关键在于提出VMatcher——一种混合Mamba-Transformer网络架构,通过引入具有线性复杂度的Selective State-Space Model (SSM) 来替代或补充Transformer的注意力模块,在保持甚至提升匹配精度的同时显著降低计算成本,从而实现高效且鲁棒的半密集特征匹配。

链接: https://arxiv.org/abs/2507.23371
作者: Ali Youssef
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer’s attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba’s highly efficient long-sequence processing with the Transformer’s attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: this https URL
zh

[CV-57] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers ACM-MM25

【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models, LVLMs)因参数量庞大和计算成本高昂而限制其实际应用的问题。现有自然语言处理(Natural Language Processing, NLP)领域的层剪枝(layer pruning)方法虽在文本模型中有效,但因视觉与语言模态差异,直接应用于LVLMs效果不佳。研究发现,非关键的视觉-语言(Vision-Language, VL)token以及层间特征差距是阻碍LVLM层剪枝的关键挑战。为此,作者提出一种新颖的训练-free、模型无关且高度兼容的框架Short-LVLM(SVL),其核心在于识别并保留重要VL token,并缓解层间特征差距,从而在不牺牲性能的前提下显著提升效率。

链接: https://arxiv.org/abs/2507.23362
作者: Ji Ma,Wei Suo,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology (综合航空-航天-地面-海洋大数据应用技术国家工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted By ACM MM 25

点击查看摘要

Abstract:Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at this https URL.
zh

[CV-58] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report Summer 2025

【速读】:该论文旨在系统梳理计算机视觉领域关键设计模式的演进路径,解决的问题是如何通过架构创新提升模型在图像识别、生成与自监督学习等方面的性能与效率。其解决方案的关键在于:首先引入残差连接(ResNet)以缓解梯度消失问题,从而训练更深的卷积网络;其次采用基于注意力机制的Vision Transformer(ViT),将Transformer成功迁移至图像块序列建模;接着利用生成对抗网络(GANs)和潜在扩散模型(LDMs)实现高质量图像生成,其中LDMs通过在感知压缩的潜在空间中进行逐步去噪,在保持高保真度的同时显著提升计算效率;最后借助DINO和掩码自编码器(MAE)等自监督方法,减少对大量标注数据的依赖,其中MAE采用非对称编码器-解码器结构重建大规模遮蔽输入,为大规模视觉模型预训练提供了高效且可扩展的新范式。

链接: https://arxiv.org/abs/2507.23357
作者: Radu-Andrei Bourceanu,Neil De La Fuente,Jan Grimm,Andrei Jardan,Andriy Manucharyan,Cornelius Weiss,Roman Pflugfelder
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.
zh

[CV-59] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

【速读】:该论文旨在解决当前AI生成对话头像(AI-Generated Talking Heads, AGTHs)质量评估缺乏系统性数据集与客观评价方法的问题。现有研究在AGTH的生成质量、泛化能力及失真类型分析方面仍存在显著空白,制约了该领域的进一步发展。解决方案的关键在于构建迄今为止规模最大的AGTH质量评估数据集THQA-10K,涵盖12个主流文本到图像(Text-to-Image, T2I)模型和14种先进语音驱动人脸生成方法(Talkers),共包含10,457个高质量AGTH样本,并通过主观评分明确其失真类别;在此基础上,提出一种基于首帧一致性、Y-T切片特征与音唇同步性的客观质量评估方法,实验表明该方法在AGTH质量预测上达到当前最优性能(State-of-the-Art, SOTA)。

链接: https://arxiv.org/abs/2507.23343
作者: Yingjie Zhou,Jiezhang Cao,Zicheng Zhang,Farong Wen,Yanwei Jiang,Jun Jia,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); PengCheng Laboratory (鹏城实验室); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Speech-driven methods for portraits are figuratively known as “Talkers” because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at this https URL.
zh

[CV-60] he Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN YOLOv XI and YOLOv XII models

【速读】:该论文旨在解决低分辨率图像条件下人脸检测(face detection)性能下降的问题,这在实际应用如监控、生物特征识别和人机交互中尤为突出。解决方案的关键在于系统性地评估三种主流深度学习人脸检测模型(YOLOv11、YOLOv12 和 MTCNN)在不同输入分辨率(160×160、320×320、640×640)下的检测精度与鲁棒性,通过多指标(如精确率、召回率、mAP50、mAP50-95 和推理时间)量化分析,揭示模型对分辨率变化的敏感性及其权衡关系,从而为在不同计算资源和实时性要求下选择最优人脸检测方案提供实证依据。

链接: https://arxiv.org/abs/2507.23341
作者: Ahmet Can Ömercikoğlu(1),Mustafa Mansur Yönügül(1),Pakize Erdoğmuş(1) ((1) Düzce University, Department of Computer Engineering, Düzce, Türkiye)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model’s performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.
zh

[CV-61] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

【速读】:该论文旨在解决复杂城市环境下道路表面重建的鲁棒性问题,尤其针对动态物体遮挡、静态障碍物干扰以及光照与天气变化导致的外观退化等挑战。其解决方案的关键在于提出一种融合遮挡感知的二维高斯面元(occlusion-aware 2D Gaussian surfels)与语义引导的颜色增强机制的框架:首先利用平面自适应的高斯表示实现高效的大规模建模;其次通过分割引导的视频修复技术移除动态及静态前景对象;最后在HSV空间中基于语义信息进行颜色一致性校正,从而生成视觉连贯且几何准确的道路表面重建结果。

链接: https://arxiv.org/abs/2507.23340
作者: Xingyue Peng,Yuandong Lyu,Lang Zhang,Jian Zhu,Songtao Wang,Jiaxin Deng,Songxin Lu,Weiliang Ma,Dangen She,Peng Jia,XianPeng Lang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban this http URL recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.
zh

[CV-62] Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)中的两个关键问题:一是数据集存在明显的长尾分布,导致传统卷积网络在低频类和分布外类别上的识别性能显著下降;二是真实场景中交通标志多为小目标且尺度变化大,难以有效提取多尺度特征。解决方案的关键在于提出一种两阶段框架,结合开放词汇检测与跨模态学习:第一阶段采用NanoVerse YOLO模型,通过可重参数化的视觉-语言路径聚合网络(RepVL-PAN)和SPD-Conv模块增强小目标和多尺度特征的提取能力;第二阶段设计了交通标志识别跨模态对比学习模型(TSR-MCL),利用视觉Transformer提取的图像特征与基于规则的BERT生成的语义特征进行对比学习,从而构建频率无关的鲁棒表示,缓解数据不平衡引发的类别混淆问题。该方法在TT100K数据集上实现了78.4%的mAP(长尾检测任务)以及91.8%准确率和88.9%召回率,显著优于主流算法,在复杂开放世界场景中展现出更强的泛化能力。

链接: https://arxiv.org/abs/2507.23331
作者: Qiang Lu,Waikit Xiu,Xiying Li,Shenyu Hu,Shengbo Sun
机构: Sun Yat-sen University (中山大学); Guangdong Provincial Key Laboratory of Intelligent Transportation System (广东省智能交通系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages, 5 figures

点击查看摘要

Abstract:Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale this http URL overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.
zh

[CV-63] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

【速读】:该论文旨在解决医学图像分割(Medical Image Segmentation)在跨域应用时因域偏移(Domain Shift)导致性能下降的问题,其核心挑战源于成像条件、扫描设备和采集协议的差异。解决方案的关键在于提出一种面向医学图像分割的领域泛化(Domain Generalization, DG)框架,通过引入由域统计信息引导的隐式特征扰动机制来增强模型对域特异性变化的鲁棒性。具体而言,该方法设计了可学习的语义方向选择器(Semantic Direction Selector)与基于协方差的语义强度采样器(Covariance-based Semantic Intensity Sampler),以调节域变异特征并保持任务相关的解剖结构一致性;同时引入自适应一致性约束,在特征调整导致分割性能下降时主动施加约束,促使调整后的特征与原始预测对齐,从而稳定特征选择过程并提升分割可靠性。

链接: https://arxiv.org/abs/2507.23326
作者: Yingkai Wang,Yaoyao Zhu,Xiuding Cai,Yuhao Xiao,Haotian Wu,Yu Yao
机构: Chengdu Institute of Compute Application, Chinese Academy of Sciences (中国科学院成都计算机应用研究所); China Zhenhua Research Institute Co., Ltd. (中国振华研究院有限公司); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.23326 [cs.CV] (or arXiv:2507.23326v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.23326 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-64] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

【速读】:该论文旨在解决现有车道段拓扑推理方法在利用时序信息提升检测与推理性能方面的不足,尤其是针对基于流的时序传播方法中存在的对历史查询过度依赖、易受位姿估计失败影响以及时序传播不充分等问题。其解决方案的关键在于提出FASTopoWM框架,该框架通过引入潜空间世界模型(latent world models)来增强快慢双通道系统:一方面,通过并行监督历史查询与新初始化查询,实现快慢系统的相互强化以缓解位姿估计误差的影响;另一方面,设计了基于动作潜变量条件化的潜查询和BEV世界模型,用于从过去观测中传播状态表示至当前时刻,显著提升了慢速管道中的时序感知性能。

链接: https://arxiv.org/abs/2507.23325
作者: Yiming Yang,Hongbin Lin,Yueru Luo,Suzhong Fu,Chao Zheng,Xinrui Yan,Shuqi Mei,Kun Tang,Shuguang Cui,Zhen Li
机构: 1. Tsinghua University (清华大学); 2. Institute for AI Industry Research, Tsinghua University (清华大学人工智能研究院); 3. Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).
zh

[CV-65] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶场景中因长视觉标记(visual tokens)导致的高计算成本问题。现有视觉标记剪枝方法依赖于视觉标记相似性或视觉-文本注意力机制,在自动驾驶场景下表现不佳,因其未能有效保留对决策至关重要的前景信息。解决方案的关键在于提出一种基于重建的视觉标记剪枝框架FastDriveVLA,其中核心组件ReconPruner通过MAE风格的像素级重建任务优先保留前景区域对应的视觉标记,并引入对抗性的前景-背景重建策略训练该剪枝器,从而实现高效且鲁棒的视觉标记压缩。该方法可无缝适配不同VLA模型(共享同一视觉编码器),无需重新训练,且在nuScenes闭环规划基准上实现了不同剪枝率下的最优性能。

链接: https://arxiv.org/abs/2507.23318
作者: Jiajun Cao,Qizhe Zhang,Peidong Jia,Xuhui Zhao,Bo Lan,Xiaoan Zhang,Xiaobao Wei,Sixiang Chen,Zhuo Li,Yang Wang,Liyun Li,Xianming Liu,Ming Lu,Shanghang Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.
zh

[CV-66] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

【速读】:该论文旨在解决在资源受限场景下(如嵌入式系统和边缘设备)实现高效实时图像分类的问题,重点分析超参数调整对七种轻量级卷积与Transformer模型(包括EfficientNetV2-S、ConvNeXt-T、MobileViT v2、MobileNetV3-L、TinyViT-21M及RepVGG-A2)的准确性与收敛行为的影响。解决方案的关键在于通过系统的消融实验分离关键超参数的作用,发现余弦学习率衰减(cosine learning rate decay)和可调批量大小(adjustable batch size)能显著提升模型精度与收敛速度,同时保持低延迟和低内存开销;其中RepVGG-A2在保持80%以上Top-1准确率的同时展现出优异的推理效率,为VGG类模型提供了高性价比的部署方案。

链接: https://arxiv.org/abs/2507.23315
作者: Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Sarbajit Pal,Amitabha Das
机构: Homi Bhabha National Institute (印度原子能委员会哈姆·巴哈国家研究所); Variable Energy Cyclotron Centre (变能回旋加速器中心); Gargi Memorial Institute of Technology (加尔吉纪念技术学院); Jadavpur University (贾达普大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 4 tables. Includes ablation study and evaluation on 7 lightweight deep learning models. Code and logs available at this https URL

点击查看摘要

Abstract:Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at this https URL.
zh

[CV-67] he Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)中的文本到图像扩散模型如何在内部表征艺术内容(content)与风格(style)这一复杂概念,尤其是在训练过程中并未显式指导二者分离的情况下。解决方案的关键在于利用交叉注意力热图(cross-attention heatmaps),将生成图像中的像素归因于特定提示词(prompt tokens),从而区分由内容描述词和风格描述词分别影响的图像区域。研究发现,扩散模型在不同艺术提示下展现出不同程度的内容-风格分离现象,通常内容词主导物体区域,而风格词作用于背景与纹理区域,表明模型在无监督条件下可 emergent 地理解内容与风格的区分。

链接: https://arxiv.org/abs/2507.23313
作者: Alfio Ferrara,Sergio Picascia,Elisabetta Rocchetti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at this https URL.
zh

[CV-68] Forgetting of task-specific knowledge in model merging-based continual learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型合并(model merging)所面临的知识保留与退化问题,即如何在合并多个任务模型时有效保留共享知识并最小化任务特有知识的损失。其解决方案的关键在于:通过控制视觉线索的计算机视觉实验发现,模型线性合并能较好地保持或增强共享知识,而任务特有知识则易快速退化;此外,采用增量训练过程中获得的模型进行合并,相较并行训练模型的合并方式,性能更优,表明训练顺序对合并效果具有显著影响。

链接: https://arxiv.org/abs/2507.23311
作者: Timm Hess,Gido M van de Ven,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.
zh

[CV-69] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

【速读】:该论文旨在解决复杂环境下无高精度地图支持时,自动驾驶车辆对道路元素(road elements)感知不准确、不规则的问题。现有方法因未能充分挖掘道路元素中固有的结构先验信息(structured priors),导致预测结果存在几何失真和语义模糊。其解决方案的关键在于提出 PriorFusion 框架,通过融合语义、几何与生成式先验(semantic, geometric, and generative priors),引入基于形状先验特征引导的实例感知注意力机制,并构建数据驱动的形状模板空间以提取低维表示并聚类生成锚点作为参考先验;进一步设计基于扩散模型(diffusion-based)的生成框架,利用这些先验锚点实现更精确、完整的道路元素预测,从而显著提升在挑战性场景下的感知鲁棒性与准确性。

链接: https://arxiv.org/abs/2507.23309
作者: Xuewei Tang,Mengmeng Yang,Tuopu Wen,Peijin Jia,Le Cui,Mingshang Luo,Kehua Sheng,Bo Zhang,Diange Yang,Kun Jiang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.
zh

[CV-70] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection ACM-MM2025

【速读】:该论文旨在解决半监督伪装目标检测(Semi-supervised Camouflaged Object Detection, SSCOD)中因标注数据稀缺导致的预测偏差严重和误差传播问题,以及现有基于教师-学生框架的多网络架构带来的高计算开销与可扩展性差的问题。其解决方案的关键在于提出一种名为ST-SAM的简洁且高度注释效率的单模型框架:通过自训练策略动态筛选并扩展高置信度伪标签,从根本上避免了多模型间预测偏差;同时将伪标签转化为包含领域知识的混合提示(hybrid prompts),有效利用Segment Anything Model(SAM)在特定任务中的潜力,从而抑制自训练过程中的误差累积。实验表明,ST-SAM仅需1%的标注数据即可达到当前最优性能,甚至媲美全监督方法,并且仅需训练单一网络,不依赖特定模型或损失函数,为注释高效型SSCOD树立了新范式。

链接: https://arxiv.org/abs/2507.23307
作者: Xihang Hu,Fuming Sun,Jiazhe Liu,Feilong Xu,Xiaoli Zhang
机构: Jilin University (吉林大学); Dalian Minzu University (大连民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, ACM MM 2025

点击查看摘要

Abstract:Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model’s potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at this https URL.
zh

[CV-71] raining-free Geometric Image Editing on Diffusion Models ICCV

【速读】:该论文旨在解决几何图像编辑(geometric image editing)中的挑战,即在保持场景整体一致性的前提下对图像中的对象进行重定位、重定向或重塑。以往基于扩散模型的编辑方法通常试图在一个步骤中完成所有子任务,但在变换幅度大或结构复杂时难以实现精准控制。其解决方案的关键在于提出一个解耦(decoupled)的处理流程,将对象变换、源区域修复(source region inpainting)和目标区域优化(target region refinement)分离执行;其中修复与优化均采用无需训练的扩散方法 FreeFine,从而显著提升编辑精度与图像保真度,尤其在复杂几何变换下表现优异。

链接: https://arxiv.org/abs/2507.23300
作者: Hanshen Zhu,Zhen Zhu,Kaile Zhang,Yiming Gong,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 22 figures, ICCV

点击查看摘要

Abstract:We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: this https URL
zh

[CV-72] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

【速读】:该论文旨在解决当前文档版面分析(Document Layout Analysis, DLA)中因生成式模型(如大语言模型和多模态模型)导致的结构错误难以被传统评估指标(如IoU和mAP)有效识别的问题。这些问题包括区域合并、分割错误及内容缺失等结构性缺陷,而现有指标仅关注空间重叠度,无法捕捉此类语义层面的布局错误。解决方案的关键在于提出一个新的基准测试框架——Layout Error Detection (LED),其核心创新包括:定义八类标准化结构错误类型,并设计三个互补任务(错误存在检测、错误类型分类、元素级错误分类),同时构建LED-Dataset这一合成数据集,通过基于真实DLA模型误差分布注入结构错误来模拟更贴近实际的场景。实验表明,LED能够有效区分不同模型在结构理解上的鲁棒性差异,揭示出传统指标未暴露的模态偏倚与性能权衡。

链接: https://arxiv.org/abs/2507.23295
作者: Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.
zh

[CV-73] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval ICCV2025

【速读】:该论文针对文本-视频检索任务中因直接使用多模态大语言模型(MLLM)进行候选匹配时引入的候选先验偏差(candidate prior bias)问题展开研究,即模型倾向于选择具有高先验概率的候选内容而非与查询真正相关的项。解决方案的关键在于提出双向似然估计框架(Bidirectional Likelihood Estimation with MLLM, BLiM),通过联合训练模型从视频生成文本以及从文本生成视频特征,实现对查询与候选之间关系的双向建模;同时引入无需训练的候选先验归一化模块(Candidate Prior Normalization, CPN),有效校准候选似然得分以消除先验偏好,从而显著提升检索相关性。实验表明,BLiM结合CPN在四个基准上平均R@1指标提升6.4%,且CPN在多种多模态任务中展现出广泛适用性,可增强视觉理解能力并降低对文本先验的依赖。

链接: https://arxiv.org/abs/2507.23284
作者: Dohwan Ko,Ji Soo Lee,Minhyuk Choi,Zihang Meng,Hyunwoo J. Kim
机构: Korea University (韩国大学); Meta GenAI; KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Highlight

点击查看摘要

Abstract:Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at this https URL.
zh

[CV-74] UniLiP: Adapting CLIP for Unified Multimodal Understanding Generation and Editing

【速读】:该论文旨在解决现有基于CLIP(Contrastive Language–Image Pretraining)的统一模型在图像重建、生成与编辑任务中性能受限的问题,尤其是传统方法因引入额外扩散解码器或量化机制而导致理解能力退化或重建不一致。其关键解决方案在于提出一种两阶段训练策略和自蒸馏(self-distillation)机制,逐步将重建能力融入CLIP架构,从而在保持原始图文理解性能的同时实现高效的图像重建;此外,设计了一种双条件架构,联合使用可学习查询(learnable queries)与多模态隐藏状态作为扩散Transformer的条件输入,有效结合了多模态大语言模型(Multimodal Large Language Model, MLLM)的推理能力和UniLIP特征中的丰富信息,显著提升了文本到图像生成与图像编辑任务的表现。

链接: https://arxiv.org/abs/2507.23278
作者: Hao Tang,Chenwei Xie,Xiaoyi Bao,Tingyu Weng,Pandeng Li,Yun Zheng,Liwei Wang
机构: Peking University (北京大学); Alibaba Group; CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension this http URL contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM’s strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.
zh

[CV-75] LRM: An Iterative Large 3D Reconstruction Model

【速读】:该论文旨在解决当前基于Transformer架构的前向3D重建方法在多视图输入时面临的严重可扩展性问题,即随着视图数量或图像分辨率的增加,全注意力机制导致计算成本急剧上升。其解决方案的关键在于提出一种迭代式大型3D重建模型(iLRM),通过三个核心原则实现高效且高质量的3D重建:(1) 将场景表示与输入视图图像解耦,以生成紧凑的3D表示;(2) 将全注意力的多视图交互分解为两阶段注意力机制,显著降低计算复杂度;(3) 在每一层注入高分辨率信息,从而实现高保真重建。该方法在RE10K和DL3DV等数据集上验证了其优越的重建质量和计算效率,尤其在同等计算开销下能更有效地利用更多输入视图,展现出优异的可扩展性。

链接: https://arxiv.org/abs/2507.23277
作者: Gyeongjin Kang,Seungtae Nam,Xiangyu Sun,Sameh Khamis,Abdelrahman Mohamed,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学); Rembrand (雷姆布兰特)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.
zh

[CV-76] GSFusion:Globally Optimized LiDAR-Inertial-Visual Mapping for Gaussian Splatting

【速读】:该论文旨在解决传统基于相机传感器(如RGB-D)的3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在纹理缺失、光照不良或远距离场景中表现不佳,以及LiDAR与3DGS融合时面临的全局对齐精度低和优化时间长的问题。解决方案的关键在于提出GSFusion系统,其通过在全局位姿图优化中引入surfel-to-surfel约束来确保地图一致性,并采用像素感知的高斯初始化策略和有界sigmoid约束以高效处理稀疏LiDAR数据,从而提升渲染质量和建图效率。

链接: https://arxiv.org/abs/2507.23273
作者: Jaeseok Park,Chanoh Park,Minsu Kim,Soohwan Kim
机构: RovifyLab; Kwangwoon University (光云大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has revolutionized photorealistic mapping, conventional approaches based on camera sensor, even RGB-D, suffer from fundamental limitations such as high computational load, failure in environments with poor texture or illumination, and short operational ranges. LiDAR emerges as a robust alternative, but its integration with 3DGS introduces new challenges, such as the need for exceptional global alignment for photorealistic quality and prolonged optimization times caused by sparse data. To address these challenges, we propose GSFusion, an online LiDAR-Inertial-Visual mapping system that ensures high-precision map consistency through a surfel-to-surfel constraint in the global pose-graph optimization. To handle sparse data, our system employs a pixel-aware Gaussian initialization strategy for efficient representation and a bounded sigmoid constraint to prevent uncontrolled Gaussian growth. Experiments on public and our datasets demonstrate our system outperforms existing 3DGS SLAM systems in terms of rendering quality and map-building efficiency.
zh

[CV-77] owards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2 MICCAI

【速读】:该论文旨在解决乳腺磁共振成像(Breast MRI)中3D肿瘤分割依赖人工标注、耗时且主观性强的问题,同时应对低收入和中等收入国家在采用商业医疗AI产品时面临的高成本、专有软件限制及基础设施要求高等障碍。其解决方案的关键在于利用无需特定训练的通用基础模型Segment Anything Model 2 (SAM2),仅通过单个切片上的边界框标注,结合三种不同的切片级传播策略(自上而下、自下而上、由中心向外),实现对整个3D体积的高效分割。实验表明,由中心向外的传播策略表现最优,验证了SAM2在极小监督条件下即可达成稳定准确的3D医学图像分割能力,为资源受限环境提供了低成本、易部署的替代方案。

链接: https://arxiv.org/abs/2507.23272
作者: Solha Kang,Eugene Kim,Joris Vankerschaver,Utku Ozbulak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2nd Deep Breast Workshop on AI and Imaging for Diagnostic and Treatment Challenges in Breast Care (DeepBreath), 2025

点击查看摘要

Abstract:Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.
zh

[CV-78] PixNerd: Pixel Neural Field Diffusion

【速读】:该论文旨在解决扩散模型(diffusion models)在使用预训练变分自编码器(VAE)压缩潜在空间时所引发的累积误差与解码伪影问题,这些问题源于两阶段训练范式中潜在表示的失真。为克服这一局限,作者提出一种单尺度、单阶段、端到端的高效解决方案——像素神经场扩散模型(Pixel Neural Field Diffusion, PixNerd),其核心创新在于利用神经场(neural field)建模图像块(patch-wise)的逐像素解码过程,从而直接在像素空间中进行生成,无需依赖VAE或复杂的级联管道。该方法在ImageNet 256×256和512×512上分别实现了2.15 FID和2.84 FID,显著提升了生成质量与效率。

链接: https://arxiv.org/abs/2507.23268
作者: Shuai Wang,Ziteng Gao,Chenhui Zhu,Weilin Huang,Limin Wang
机构: Nanjing University (南京大学); ByteDance Seed (字节跳动种子项目); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: a single-scale, single-stage, efficient, end-to-end pixel space diffusion model

点击查看摘要

Abstract:The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256\times256 and 2.84 FID on ImageNet 512\times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.
zh

[CV-79] Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels

【速读】:该论文旨在解决多标签图像识别中部分标签(Multi-label Image Recognition with Partial Labels, MLR-PL)场景下,传统伪标签生成方法因忽视类别间得分分布差异而导致伪标签不准确和不完整的问题。其解决方案的关键在于提出语义感知阈值学习(Semantic-Aware Threshold Learning, SATL)算法:该算法通过动态计算每个类别正负样本的得分分布,并据此学习类别特定的阈值,同时引入差分排序损失(differential ranking loss)以扩大正负样本得分分布之间的差距,从而提升阈值的判别能力与伪标签质量,最终在大规模多标签数据集(如Microsoft COCO和VG-200)上显著改善有限标签条件下的模型性能。

链接: https://arxiv.org/abs/2507.23263
作者: Haoxian Ruan,Zhihua Xu,Zhijing Yang,Guang Ma,Jieming Xie,Changxiang Fan,Tianshui Chen
机构: Guangdong University of Technology (广东工业大学); University of York (约克大学); Guangdong Academy of Agricultural Sciences (广东省农业科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 figures, publish to ESWA (Expert Systems With Applications)

点击查看摘要

Abstract:Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.
zh

[CV-80] owards Measuring and Modeling Geometric Structures in Time Series Forecasting via Image Modality

【速读】:该论文旨在解决传统时间序列预测评估指标(如均方误差,MSE)仅能衡量点对点精度,而无法有效评估时间序列几何结构的问题,从而难以捕捉时间序列的时序动态特性。其解决方案的关键在于提出一种新的评估指标——时间序列几何结构指数(Time Series Geometric Structure Index, TGSI),通过将时间序列转换为图像以利用其二维几何表示;同时设计了一种可微分的多组件训练损失函数——形状感知时间损失(Shape-Aware Temporal Loss, SATL),该损失由一阶差分损失、频域损失和感知特征损失构成,能够在训练阶段提升模型对时间序列几何结构的建模能力,且无需额外推理开销。

链接: https://arxiv.org/abs/2507.23253
作者: Mingyang Yu,Xiahui Guo,Peng chen,Zhenkai Li,Yang Shu
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.
zh

[CV-81] A Deep Dive into Generic Object Tracking: A Survey

【速读】:该论文旨在解决通用目标跟踪(Generic Object Tracking)在复杂时空动态环境下的挑战,尤其是遮挡、相似干扰项和外观变化等问题。其解决方案的关键在于对三类主流跟踪方法——基于Siamese的追踪器、判别式追踪器以及近年来兴起的基于Transformer的方法进行全面系统性综述,并特别聚焦于Transformer-based方法的快速演进。论文通过定性和定量对比分析各类方法的核心设计原则、创新点与局限性,提出了一种新的分类体系,并提供统一的可视化与表格化比较,从而清晰揭示Transformer模型在时空建模能力上的优势如何推动了该领域的快速发展。

链接: https://arxiv.org/abs/2507.23251
作者: Fereshteh Aghaee Meibodi,Shadi Alijani,Homayoun Najjaran
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 55 pages, 29 figures, 9 tables

点击查看摘要

Abstract:Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.
zh

[CV-82] Automated Mapping the Pathways of Cranial Nerve II III V and VII/VIII: A Multi-Parametric Multi-Stage Diffusion Tractography Atlas

【速读】:该论文旨在解决颅神经(cranial nerves, CNs)在人类大脑中路径映射的难题,特别是如何实现对多对颅神经的自动化、高精度追踪与可视化,以提供术前关于颅神经与周围关键组织空间关系的详细信息。其解决方案的关键在于提出了一种基于多参数纤维束成像(multi-parametric fiber tractography)的新型多阶段纤维聚类策略,用于构建首个全面的颅神经扩散张量成像(dMRI)纤维束图谱。该方法通过处理来自人类连接组计划(HCP)50名受试者生成的大约100万条纤维轨迹,实现了对8个与5对颅神经(包括视神经CN II、动眼神经CN III、三叉神经CN V及面神经-前庭蜗神经CN VII/VIII)相关的纤维束的自动识别,并在多个数据集上验证了其空间对应性和鲁棒性。

链接: https://arxiv.org/abs/2507.23245
作者: Lei Xie,Jiahao Huang,Jiawei Zhang,Jianzhong He,Yiang Pan,Guoqiang Xie,Mengjun Li,Qingrun Zeng,Mingchu Li,Yuanjing Feng
机构: Zhejiang University of Technology (浙江工业大学); Central South University (中南大学); Capital Medical University (首都医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cranial nerves (CNs) play a crucial role in various essential functions of the human brain, and mapping their pathways from diffusion MRI (dMRI) provides valuable preoperative insights into the spatial relationships between individual CNs and key tissues. However, mapping a comprehensive and detailed CN atlas is challenging because of the unique anatomical structures of each CN pair and the complexity of the skull base this http URL this work, we present what we believe to be the first study to develop a comprehensive diffusion tractography atlas for automated mapping of CN pathways in the human brain. The CN atlas is generated by fiber clustering by using the streamlines generated by multi-parametric fiber tractography for each pair of CNs. Instead of disposable clustering, we explore a new strategy of multi-stage fiber clustering for multiple analysis of approximately 1,000,000 streamlines generated from the 50 subjects from the Human Connectome Project (HCP). Quantitative and visual experiments demonstrate that our CN atlas achieves high spatial correspondence with expert manual annotations on multiple acquisition sites, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset and two clinical cases of pituitary adenoma patients. The proposed CN atlas can automatically identify 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve CN II, oculomotor nerve CN III, trigeminal nerve CN V and facial-vestibulocochlear nerve CN VII/VIII, and its robustness is demonstrated experimentally. This work contributes to the field of diffusion imaging by facilitating more efficient and automated mapping the pathways of multiple pairs of CNs, thereby enhancing the analysis and understanding of complex brain structures through visualization of their spatial relationships with nearby anatomy.
zh

[CV-83] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning

【速读】:该论文旨在解决半监督少样本类增量学习(Semi-supervised Few-shot Class-Incremental Learning, Semi-FSCIL)中因 unlabeled 数据来源受限而带来的现实场景适配性不足的问题。传统方法仅假设未标注样本来自当前会话的新类别(novel classes),忽略了真实场景中可能包含已见类别(包括基础类别 base classes 和历史新类别)的混合数据,导致模型难以区分不同类别的未标注样本,进而引发特征分布偏移与灾难性遗忘。为应对这一挑战,作者提出了一种基于模糊性引导的可学习分布校准策略(Ambiguity-guided Learnable Distribution Calibration, ALDC),其核心在于利用丰富的基础类别样本动态校正少样本新类别特征分布,从而缓解因类别混淆导致的偏差问题。实验表明,该方法在三个基准数据集上显著优于现有方法,实现了新的性能上限。

链接: https://arxiv.org/abs/2507.23237
作者: Fan Lyu,Linglan Zhao,Chengyan Liu,Yinying Mei,Zhang Zhang,Jian Zhang,Fuyuan Hu,Liang Wang
机构: New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别新实验室); Shanghai Jiao Tong University (上海交通大学); School of Electronics and Information Engineering, Suzhou University of Science and Technology (苏州科技大学电子与信息工程学院); Jiangsu Industrial Intelligent Low Carbon Technology Engineering Center (江苏省工业智能低碳技术工程中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.
zh

[CV-84] oward Safe Trustworthy and Realistic Augmented Reality User Experience

【速读】:该论文旨在解决增强现实(Augmented Reality, AR)环境中虚拟内容可能带来的安全风险问题,特别是那些会干扰关键信息或潜移默化地操控用户感知的任务有害型AR内容。其解决方案的关键在于利用视觉-语言模型(Vision-Language Models, VLMs)与多模态推理模块,构建了两个系统——ViDDAR和VIM-Sense,用于检测此类攻击;并进一步提出三项未来方向:面向感知对齐的自动化虚拟内容质量评估、多模态攻击检测以及VLM在AR设备上的轻量化与用户中心化部署,从而建立可扩展且以人为中心的AR内容安全保障框架。

链接: https://arxiv.org/abs/2507.23226
作者: Yanming Xiu
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 4 figures

点击查看摘要

Abstract:As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.
zh

[CV-85] YOLO-ROC: A High-Precision and Ultra-Lightweight Model for Real-Time Road Damage Detection

【速读】:该论文旨在解决道路损伤检测中两个核心问题:一是现有深度学习模型对不同尺度目标(如裂缝和坑洞)的多尺度特征提取能力不足,导致小尺度损伤漏检率高;二是主流模型参数量大、计算复杂度高,难以在实际场景中实现高效实时检测。解决方案的关键在于提出一种高精度且轻量化的YOLO-ROC模型,其核心创新包括:设计了双向多尺度空间金字塔池化快速模块(BMS-SPPF),通过双向空间-通道注意力机制增强小目标检测能力;同时引入分层通道压缩策略,在保持检测性能的同时将参数量从3.01M降至0.89M,浮点运算次数(GFLOPs)从8.1降至2.6,最终模型仅占用2.0 MB存储空间,显著提升了检测效率与泛化能力。

链接: https://arxiv.org/abs/2507.23225
作者: Zicheng Lin,Weichao Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road damage detection is a critical task for ensuring traffic safety and maintaining infrastructure integrity. While deep learning-based detection methods are now widely adopted, they still face two core challenges: first, the inadequate multi-scale feature extraction capabilities of existing networks for diverse targets like cracks and potholes, leading to high miss rates for small-scale damage; and second, the substantial parameter counts and computational demands of mainstream models, which hinder their deployment for efficient, real-time detection in practical applications. To address these issues, this paper proposes a high-precision and lightweight model, YOLO - Road Orthogonal Compact (YOLO-ROC). We designed a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module to enhance multi-scale feature extraction and implemented a hierarchical channel compression strategy to reduce computational complexity. The BMS-SPPF module leverages a bidirectional spatial-channel attention mechanism to improve the detection of small targets. Concurrently, the channel compression strategy reduces the parameter count from 3.01M to 0.89M and GFLOPs from 8.1 to 2.6. Experiments on the RDD2022_China_Drone dataset demonstrate that YOLO-ROC achieves a mAP50 of 67.6%, surpassing the baseline YOLOv8n by 2.11%. Notably, the mAP50 for the small-target D40 category improved by 16.8%, and the final model size is only 2.0 MB. Furthermore, the model exhibits excellent generalization performance on the RDD2022_China_Motorbike dataset.
zh

[CV-86] Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images

【速读】:该论文旨在解决食品晶体团聚(food crystal agglomeration)在显微图像中难以准确标注的问题,尤其针对水相透明导致的界面模糊以及单张切片视角局限性带来的挑战。其解决方案的关键在于提出一种结合监督基线模型与实例分类模型的两阶段方法:首先利用监督模型生成像素级伪标签以增强粗粒度标注的数据集,随后训练一个能同时完成像素级分割和实例分类的模型;此外,设计了后处理模块以保留晶体形态特征,并在推理阶段融合两类模型的优势,从而提升团聚检测的准确性与尺寸分布预测性能。

链接: https://arxiv.org/abs/2507.23206
作者: Xiaoyu Ji,Ali Shakouri,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.
zh

[CV-87] Adversarial-Guided Diffusion for Multimodal LLM Attacks

【速读】:该论文旨在解决利用扩散模型生成对抗性图像以欺骗多模态大语言模型(Multimodal Large Language Models, MLLMs)产生特定响应的问题,同时尽量避免对原始图像造成显著失真。其核心挑战在于如何在保持图像视觉质量的前提下实现高效攻击,并提升对抗样本对常见防御机制(如低通滤波)的鲁棒性。解决方案的关键在于提出一种对抗引导的扩散方法(Adversarial-Guided Diffusion, AGD),该方法不直接将高频扰动嵌入图像,而是将目标语义信息注入扩散过程中的噪声分量中;由于扩散模型的噪声覆盖全频谱特性,所嵌入的对抗信号也具备全频谱属性,在逆向扩散过程中,对抗图像表现为干净图像与噪声的线性组合,使得基于频域过滤的防御策略难以有效抑制噪声中的对抗信号,从而显著增强攻击的隐蔽性和鲁棒性。

链接: https://arxiv.org/abs/2507.23202
作者: Chengwei Xia,Fan Ma,Ruijie Quan,Kun Zhan,Yi Yang
机构: Lanzhou University (兰州大学); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.
zh

[CV-88] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery

【速读】:该论文旨在解决卫星遥感图像中洪水区域分割(flooded area segmentation)缺乏高质量、标准化数据集的问题。现有77个基准数据集在该任务上存在不足,无法有效支持模型训练与评估。解决方案的关键在于构建一个统一分辨率、包含10个地点(每州2个,共5个州)的新型遥感影像数据集,每个地点提供10幅含洪涝与非洪涝区域的卫星图像,并公开发布以供研究使用。该数据集为后续基于深度学习的语义分割模型提供了可靠的基础,同时通过测试先进计算机视觉模型和进行窗口大小的消融实验,揭示了当前方法性能有限,亟需引入多模态融合与时间序列建模策略以提升泛化能力。

链接: https://arxiv.org/abs/2507.23193
作者: Youngsun Jang,Dongyoun Kim,Chulwoo Pack,Kwanghee Won
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures. Presented at ACM RACS 2024 (Pompei, Italy, Nov 5-8, 2024)

点击查看摘要

Abstract:This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \copyright 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on this https URL.
zh

[CV-89] Accessibility Scout: Personalized Accessibility Scans of Built Environments

【速读】:该论文旨在解决无障碍环境评估中面临的两大难题:一是传统人工评估方式效率低、难以扩展,二是现有自动机器学习方法缺乏对个体用户独特需求的个性化考量。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建一个名为Accessibility Scout的可适应性扫描系统,该系统通过分析建筑环境图像识别无障碍问题,并借助人机协同评估机制不断学习用户的移动能力水平、偏好及特定环境关注点,从而生成高度个性化的无障碍评估结果,突破了传统仅依赖美国残疾人法案(ADA)标准的局限性。

链接: https://arxiv.org/abs/2507.23190
作者: William Huang,Xia Su,Jon E. Froehlich,Yang Zhang
机构: University of California, Los Angeles(加州大学洛杉矶分校); University of Washington(华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 18 pages, 16 figures. Presented at ACM UIST 2025

点击查看摘要

Abstract:Assessing the accessibility of unfamiliar built environments is critical for people with disabilities. However, manual assessments, performed by users or their personal health professionals, are laborious and unscalable, while automatic machine learning methods often neglect an individual user’s unique needs. Recent advances in Large Language Models (LLMs) enable novel approaches to this problem, balancing personalization with scalability to enable more adaptive and context-aware assessments of accessibility. We present Accessibility Scout, an LLM-based accessibility scanning system that identifies accessibility concerns from photos of built environments. With use, Accessibility Scout becomes an increasingly capable “accessibility scout”, tailoring accessibility scans to an individual’s mobility level, preferences, and specific environmental interests through collaborative Human-AI assessments. We present findings from three studies: a formative study with six participants to inform the design of Accessibility Scout, a technical evaluation of 500 images of built environments, and a user study with 10 participants of varying mobility. Results from our technical evaluation and user study show that Accessibility Scout can generate personalized accessibility scans that extend beyond traditional ADA considerations. Finally, we conclude with a discussion on the implications of our work and future steps for building more scalable and personalized accessibility assessments of the physical world.
zh

[CV-90] Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

【速读】:该论文旨在解决现有运动检索(motion retrieval)方法在交互方式上不够直观、用户友好性不足,以及对多模态序列特征利用不充分的问题。解决方案的关键在于提出一个四模态(文本、音频、视频与运动)对齐的细粒度联合嵌入空间框架,首次将音频模态引入运动检索任务中以增强沉浸感与便捷性,并通过序列级对比学习(sequence-level contrastive learning)捕捉跨模态的关键时序细节,从而实现更精确的跨模态对齐与检索性能提升。

链接: https://arxiv.org/abs/2507.23188
作者: Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); ShanghaiTech University (上海科技大学); Nanyang Technological University (南洋理工大学); Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology (深圳先进研究生院计算机科学与控制工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM 2025

点击查看摘要

Abstract:Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities – text, audio, video, and motion – within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.
zh

[CV-91] Single Image Rain Streak Removal Using Harris Corner Loss and R-CBAM Network

【速读】:该论文旨在解决单张图像中雨痕去除问题,该问题不仅涉及噪声抑制,更需在恢复过程中同时保留物体边界和细节纹理等结构信息,以维持整体视觉质量。其解决方案的关键在于引入一种Corner Loss(角点损失),通过约束恢复过程来防止关键结构信息的丢失;同时,在编码器与解码器中嵌入残差卷积块注意力模块(Residual Convolutional Block Attention Module, R-CBAM),动态调整特征在空间和通道维度上的重要性,使网络能聚焦于受雨痕干扰严重的区域,从而提升恢复精度与细节保真度。

链接: https://arxiv.org/abs/2507.23185
作者: Jongwook Si,Sungyoung Kim
机构: Kumoh National Institute of Technology (庆尚国立科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 21 pages

点击查看摘要

Abstract:The problem of single-image rain streak removal goes beyond simple noise suppression, requiring the simultaneous preservation of fine structural details and overall visual quality. In this study, we propose a novel image restoration network that effectively constrains the restoration process by introducing a Corner Loss, which prevents the loss of object boundaries and detailed texture information during restoration. Furthermore, we propose a Residual Convolutional Block Attention Module (R-CBAM) Block into the encoder and decoder to dynamically adjust the importance of features in both spatial and channel dimensions, enabling the network to focus more effectively on regions heavily affected by rain streaks. Quantitative evaluations conducted on the Rain100L and Rain100H datasets demonstrate that the proposed method significantly outperforms previous approaches, achieving a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H.
zh

[CV-92] CNN-based solution for mango classification in agricultural environments

【速读】:该论文旨在解决农业环境中芒果果实自动检测与分类问题,以实现农场库存管理中的水果品质自动化评估。其关键解决方案在于结合使用卷积神经网络(Convolutional Neural Networks, CNN)与级联检测器(cascade detector):采用ResNet-18作为分类模型以保证精度,同时利用级联检测器在保证执行速度的前提下降低计算资源消耗,最终通过MatLab App Designer构建图形化界面实现结果可视化,从而为农业质量控制提供可靠、高效的智能化方案。

链接: https://arxiv.org/abs/2507.23174
作者: Beatriz Díaz Peón,Jorge Torres Gómez,Ariel Fajardo Márquez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.23174 [cs.CV] (or arXiv:2507.23174v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.23174 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Beatriz Díaz [view email] [v1] Thu, 31 Jul 2025 00:58:34 UTC (2,029 KB) Full-text links: Access Paper: View a PDF of the paper titled CNN-based solution for mango classification in agricultural environments, by Beatriz D’iaz Pe’on and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-07 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-93] Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues ICCV2025

【速读】:该论文旨在解决从多视角图像中联合重建几何结构、空间变化的反射特性(spatially varying reflectance)以及光照条件的问题,克服了传统多视角光度立体方法需依赖光源标定或中间线索(如每视图的法向量图)的局限性。其解决方案的关键在于:将几何与反射特性建模为神经隐式场(neural implicit fields),并通过引入阴影感知的体渲染(shadow-aware volume rendering)提升重建精度;具体而言,首先由一个空间网络预测每个场景点的符号距离函数和反射潜变量(reflectance latent code),再通过一个反射网络基于该潜变量及编码后的表面法向、观察方向和光照方向估计反射值,从而实现从原始图像中端到端的单阶段联合优化。

链接: https://arxiv.org/abs/2507.23162
作者: Xu Cao,Takafumi Taketomi
机构: CyberAgent, Japan (CyberAgent, 日本)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:We propose a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. Unlike prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, our method jointly optimizes all scene parameters from raw images in a single stage. We represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. A spatial network first predicts the signed distance and a reflectance latent code for each scene point. A reflectance network then estimates reflectance values conditioned on the latent code and angularly encoded surface normal, view, and light directions. The proposed method outperforms state-of-the-art normal-guided approaches in shape and lighting estimation accuracy, generalizes to view-unaligned multi-light images, and handles objects with challenging geometry and reflectance.
zh

[CV-94] FuseTen: A Generative Model for Daily 10 m Land Surface Temperature Estimation from Spatio-Temporal Satellite Observations

【速读】:该论文旨在解决卫星遥感中土地表面温度(Land Surface Temperature, LST)观测在空间分辨率与时间分辨率之间存在的固有权衡问题,即如何在保持高时间频次的同时实现细粒度(如10米)的空间分辨率。传统线性融合方法难以有效整合来自不同传感器(Sentinel-2、Landsat 8和Terra MODIS)的多源时空数据以生成高质量的每日高分辨率LST产品。解决方案的关键在于提出一种名为FuseTen的生成式框架,其核心创新包括:基于物理原理的平均监督策略、融合注意力机制与归一化模块以增强跨传感器特征对齐,并引入PatchGAN判别器以提升生成结果的真实性。该方法首次实现了非线性方式下每日10米分辨率的LST估计,显著优于现有线性基线方法,在定量指标和视觉保真度上分别提升32.06%和31.42%。

链接: https://arxiv.org/abs/2507.23154
作者: Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
机构: INSA CVL, Université d’Orléans, PRISME UR 4229 (INSA CVL, 图尔国立应用科学学院,奥尔良大学,PRISME UR 4229 研究中心); Université d’Orléans, CEDETE, UR 1210 (奥尔良大学,CEDETE,UR 1210 研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in the 2025 International Conference on Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS)

点击查看摘要

Abstract:Urban heatwaves, droughts, and land degradation are pressing and growing challenges in the context of climate change. A valuable approach to studying them requires accurate spatio-temporal information on land surface conditions. One of the most important variables for assessing and understanding these phenomena is Land Surface Temperature (LST), which is derived from satellites and provides essential information about the thermal state of the Earth’s surface. However, satellite platforms inherently face a trade-off between spatial and temporal resolutions. To bridge this gap, we propose FuseTen, a novel generative framework that produces daily LST observations at a fine 10 m spatial resolution by fusing spatio-temporal observations derived from Sentinel-2, Landsat 8, and Terra MODIS. FuseTen employs a generative architecture trained using an averaging-based supervision strategy grounded in physical principles. It incorporates attention and normalization modules within the fusion process and uses a PatchGAN discriminator to enforce realism. Experiments across multiple dates show that FuseTen outperforms linear baselines, with an average 32.06% improvement in quantitative metrics and 31.42% in visual fidelity. To the best of our knowledge, this is the first non-linear method to generate daily LST estimates at such fine spatial resolution.
zh

[CV-95] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention ICLR2025

【速读】:该论文旨在解决现有基于扩散模型的肖像动画方法中存在的身份泄露(identity leakage)以及难以捕捉细微和极端表情的问题。其关键解决方案在于提出了一种端到端训练框架,通过从驱动视频中提取一维无身份信息的潜在运动描述符(1D identity-agnostic latent motion descriptor),并在图像生成过程中利用交叉注意力(cross-attention)机制控制运动,从而避免将驱动条件中的空间结构线索传递给扩散主干网络,显著减少身份泄露;同时,结合双生成对抗网络(dual GAN decoder)监督学习、空间与色彩增强策略,进一步提升表情表现力并实现运动潜变量与身份特征的有效解耦。

链接: https://arxiv.org/abs/2507.23143
作者: Xiaochen Zhao,Hongyi Xu,Guoxian Song,You Xie,Chenxu Zhang,Xiu Li,Linjie Luo,Jinli Suo,Yebin Liu
机构: Tsinghua University (清华大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025, code is available at this https URL

点击查看摘要

Abstract:We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.
zh

[CV-96] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation ICCV2025

【速读】:该论文旨在解决开放词汇表三维实例分割(open-vocabulary 3D instance segmentation, OV-3DIS)中的关键挑战,即如何在不依赖固定类别集合的前提下,利用视觉-语言模型(VLMs)实现高精度的3D实例检测与分类。其解决方案的关键在于提出一种两阶段框架:首先通过基于3D跟踪的提案聚合策略生成鲁棒的3D实例提案,并结合迭代合并/删除机制消除重叠或不完整提案;其次,在分类阶段引入Alpha-CLIP模型,利用物体掩码作为alpha通道以抑制背景噪声并获得更聚焦于目标对象的表示,同时设计标准化最大相似度(standardized maximum similarity, SMS)评分来归一化文本到提案的相似性,从而有效过滤假阳性、提升精确率。该方法在ScanNet200和S3DIS数据集上实现了当前最优性能,甚至超越了端到端的闭合词汇方法。

链接: https://arxiv.org/abs/2507.23134
作者: Sanghun Jung,Jingjing Zheng,Ke Zhang,Nan Qiao,Albert Y. C. Chen,Lu Xia,Chi Liu,Yuyin Sun,Xiao Zeng,Hsiang-Wei Huang,Byron Boots,Min Sun,Cheng-Hao Kuo
机构: University of Washington (华盛顿大学); Amazon Lab126; National Tsing Hua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
zh

[CV-97] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model ICCV2025

【速读】:该论文旨在解决细粒度图像分类(fine-grained image classification)中因依赖固定词汇表和闭集分类范式而导致的可扩展性与适应性不足的问题,尤其是在现实场景中频繁出现新类别时。现有方法虽尝试结合大语言模型(LLM)与视觉语言模型(VLM)实现开放集识别(open-set recognition),但往往未能充分挖掘LLM在分类阶段的潜力,且对LLM生成的类别名称缺乏系统分析与优化。其解决方案的关键在于提出一种无需训练(training-free)的方法Enriched-FineR(E-FineR),通过增强语言驱动的语义理解与类别描述的精细化处理,在不依赖人工标注的前提下显著提升细粒度识别性能,并支持零样本(zero-shot)与少样本(few-shot)分类任务,展现出良好的可解释性与实际应用潜力。

链接: https://arxiv.org/abs/2507.23070
作者: Dmitry Demidov,Zaigham Zaheer,Omkar Thawakar,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on this https URL.
zh

[CV-98] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera HD-Map Waypoints

【速读】:该论文旨在解决自动驾驶系统中几何精度(geometric accuracy)与语义理解(semantic understanding)通常被分离处理的问题,从而提升复杂环境下的导航能力。其解决方案的关键在于提出一个统一的视觉-语言模型 XYZ-Drive,通过在 token 层级上实现意图(waypoint)与地图布局的早期融合:引入轻量级的目标中心交叉注意力层(goal-centered cross-attention layer),使 waypoint tokens 能够引导图像和地图特征提取相关区域;随后将融合后的 tokens 输入部分微调的 LLaMA-3.2 11B 模型进行决策输出,同时支持动作控制与文本解释。这一设计显著提升了驾驶成功率(95%)和路径长度加权成功指标(SPL=0.80),并大幅减少碰撞次数,证明了多模态信息在早期阶段协同作用对实现高效、透明、实时自动驾驶的重要性。

链接: https://arxiv.org/abs/2507.23064
作者: Santosh Patapati,Trisanth Srinivasan,Murari Ambati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 5 pages

点击查看摘要

Abstract:Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m \times 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving. Comments: 5 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) ACMclasses: I.4.8; I.2.10; I.2.6; C.3.3; I.4.9 Cite as: arXiv:2507.23064 [cs.CV] (or arXiv:2507.23064v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.23064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-99] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

【速读】:该论文旨在解决安全关键应用(如自动驾驶和医学图像分析)中真实多模态数据获取成本高、难度大的问题,同时满足合成数据对真实性和可控性的严苛要求。解决方案的关键在于提出两种基于扩散模型的参考引导式图像修复方法:MObI(Multimodal Object Inpainting)首次实现了跨感知模态(如相机与激光雷达)的可控对象插入,通过3D边界框条件控制实现精准空间定位与尺度还原;AnydoorMed则将该范式扩展至医学影像领域,针对乳腺X线摄影中的异常区域进行高保真修复,保持病灶结构完整性的同时实现语义融合。二者共同表明,基础模型可有效迁移至不同感知模态,从而构建高度真实、可控且多模态的反事实场景生成系统。

链接: https://arxiv.org/abs/2507.23058
作者: Alexandru Buburuzan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A dissertation submitted to The University of Manchester for the degree of Bachelor of Science in Artificial Intelligence

点击查看摘要

Abstract:Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly’s structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.
zh

[CV-100] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

【速读】:该论文旨在解决自动驾驶车辆在复杂交通场景中实时决策与路径规划的问题,核心挑战在于如何高效融合多模态感知信息(如前视图像、高精地图、LiDAR深度图和文本路点)并实现低延迟、高安全性的控制输出。解决方案的关键在于提出一种单分支视觉语言架构NovaDrive,其创新性地引入轻量级两阶段交叉注意力模块:首先将文本路点标记(waypoint tokens)与高精地图对齐,再细化图像与深度图块之间的注意力机制;同时设计了一种新颖的平滑损失函数以抑制急促的转向和速度变化,从而避免使用递归记忆结构即可实现稳定可靠的轨迹生成。该方法在nuScenes/Waymo子集上显著提升成功率至84%(+4%)、路径效率(SPL)达0.66(+0.11),并将碰撞率从2.6%降至1.2%,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2507.23042
作者: Santosh Patapati,Trisanth Srinivasan
机构: Cyrion Labs( Cyrion 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注: 6 pages

点击查看摘要

Abstract:Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive’s shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.
zh

[CV-101] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

【速读】:该论文旨在解决基于神经辐射场(Neural Radiance Fields, NeRF)的三维重建与渲染模型在资源受限场景(如边缘计算)中因密集射线采样导致浮点运算量激增、计算效率低下的问题。其解决方案的关键在于提出一种基于脉冲神经网络(Spiking Neural Networks, SNNs)的NeRF框架,并引入预训练自适应时间步调整策略(Pretrain-Adaptive Time-step Adjustment, PATA),通过动态调整推理过程中的时间步长,在保证渲染质量的同时显著降低计算资源消耗——实验表明,该方法可在保持渲染保真度的前提下将推理时间步减少64%,运行功耗降低61.55%。

链接: https://arxiv.org/abs/2507.23033
作者: Ranxi Lin,Canming Yao,Jiayi Li,Weihang Liu,Xin Lou,Pingqiang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64% and running power by 61.55%.
zh

[CV-102] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging MICCAI

【速读】:该论文旨在解决资源受限环境(Resource-Constrained Settings, RCS)中因超声心动图图像质量差而导致自动心脏解读效果不佳的问题,从而限制了下游诊断模型的性能。其解决方案的关键在于利用基于深度学习的超分辨率(Super-Resolution, SR)技术对低质量二维超声心动图进行增强,特别是采用SRResNet和SRGAN两种主流SR模型,在公开的CAMUS数据集上验证其在两类临床任务(视图分类与心室相位分类)中的有效性。实验表明,SRResNet不仅显著提升了分类准确率,还具备更高的计算效率,证明SR技术能够有效恢复退化图像中的诊断信息,为RCS中AI辅助诊疗提供了可行路径。

链接: https://arxiv.org/abs/2507.23027
作者: Krishan Agyakari Raja Babu,Om Prabhu,Annu,Mohanasankar Sivaprakasam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the MICCAI Workshop on “Medical Image Computing in Resource Constrained Settings Knowledge Interchange (MIRASOL)” 2025

点击查看摘要

Abstract:Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.
zh

[CV-103] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction ICCV

【速读】:该论文旨在解决现有深度学习模型在预测人类注视扫描路径(gaze scanpaths)时普遍存在的问题:即生成的是平均化的行为模式,无法捕捉人类视觉探索的多样性。解决方案的关键在于提出ScanDiff架构,该架构将扩散模型(diffusion models)与视觉Transformer(Vision Transformers)相结合,利用扩散模型的随机性显式建模扫描路径的变异性,从而生成多样且逼真的注视轨迹;同时引入文本条件控制,实现任务驱动的扫描路径生成,使模型能够根据不同的视觉搜索目标进行自适应调整。实验表明,该方法在基准数据集上优于现有最先进方法,在自由观看和任务驱动场景中均表现出更高的准确性和多样性。

链接: https://arxiv.org/abs/2507.23021
作者: Giuseppe Cartella,Vittorio Cuculo,Alessandro D’Amelio,Marcella Cornia,Giuseppe Boccignone,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); University of Milan (米兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

点击查看摘要

Abstract:Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at this https URL.
zh

[CV-104] Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods

【速读】:该论文旨在解决多模态潜在空间(multimodal latent spaces)在任务特定人工智能模型中缺乏可靠逆向映射能力的问题。当前主流模型虽在正向任务(如文本到图像生成、语音到文本转录)表现优异,但其潜在空间结构是否支持语义一致且感知连贯的逆向推理仍不明确。解决方案的关键在于提出一种基于优化的框架,通过迭代调整输入隐变量以逼近目标输出,从而实现双向逆向映射——即从图像反推文本描述或从音频反推原始语音内容。实验表明,尽管优化可使模型生成与目标文本匹配的输出(如图像被描述正确或音频被准确转录),但其感知质量和语义一致性显著下降,且重构的潜在嵌入常对应无意义词汇标记,揭示了现有潜在空间在逆向任务中缺乏结构性语义支撑的本质局限。

链接: https://arxiv.org/abs/2507.23010
作者: Siwoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2507.23010 [cs.LG] (or arXiv:2507.23010v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.23010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-105] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

【速读】:该论文旨在解决大规模城市场景的快速重建与实时渲染问题,同时应对多视角捕获中外观变化带来的挑战。其核心解决方案包括:通过基于可见性的图像选择策略优化训练效率,并采用可控的细节层次(level-of-detail, LOD)策略在用户定义的资源预算下精确调控高斯密度,从而实现高效训练与高质量渲染的平衡;此外,引入外观变换模块以缓解多视角图像间的外观不一致性,并结合深度正则化、尺度正则化和抗锯齿等增强模块进一步提升重建精度与视觉保真度。

链接: https://arxiv.org/abs/2507.23006
作者: Zhensheng Yuan,Haozhi Huang,Zhen Xiong,Di Wang,Guanghua Yang
机构: Jinan University (暨南大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: this https URL.
zh

[CV-106] Noise-Coded Illumination for Forensic and Photometric Video Analysis SIGGRAPH2025

【速读】:该论文旨在解决由视频篡改技术快速发展引发的信息安全问题,即伪造视频(deepfake)日益难以与真实视频区分,导致虚假信息传播风险加剧。其核心挑战在于篡改者与检测者之间存在信息不对称:篡改者可轻易获取自然视频分布以生成逼真伪造内容,而检测方缺乏有效手段识别此类伪造。解决方案的关键在于通过在场景照明中嵌入微弱的噪声型调制信号(即“编码光照”),为视频添加一个时间域水印——该水印并非携带特定信息,而是编码了仅由编码光源照射时场景应有的原始图像。即使攻击者知晓该技术的存在,生成看似真实的伪造视频仍需解决一个更困难的逆向问题,且处于信息劣势,从而显著提升伪造难度,适用于公共事件和采访等高风险场景中的内容真实性保护。

链接: https://arxiv.org/abs/2507.23002
作者: Peter F. Michael,Zekun Hao,Serge Belongie,Abe Davis
机构: Cornell University (康奈尔大学); Cornell Tech (康奈尔技术学院); University of Copenhagen (哥本哈根大学)
类目: Graphics (cs.GR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Transactions on Graphics (2025), presented at SIGGRAPH 2025

点击查看摘要

Abstract:The proliferation of advanced tools for manipulating video has led to an arms race, pitting those who wish to sow disinformation against those who want to detect and expose it. Unfortunately, time favors the ill-intentioned in this race, with fake videos growing increasingly difficult to distinguish from real ones. At the root of this trend is a fundamental advantage held by those manipulating media: equal access to a distribution of what we consider authentic (i.e., “natural”) video. In this paper, we show how coding very subtle, noise-like modulations into the illumination of a scene can help combat this advantage by creating an information asymmetry that favors verification. Our approach effectively adds a temporal watermark to any video recorded under coded illumination. However, rather than encoding a specific message, this watermark encodes an image of the unmanipulated scene as it would appear lit only by the coded illumination. We show that even when an adversary knows that our technique is being used, creating a plausible coded fake video amounts to solving a second, more difficult version of the original adversarial content creation problem at an information disadvantage. This is a promising avenue for protecting high-stakes settings like public events and interviews, where the content on display is a likely target for manipulation, and while the illumination can be controlled, the cameras capturing video cannot.
zh

[CV-107] Planning for Cooler Cities: A Multimodal AI Framework for Predicting and Mitigating Urban Heat Stress through Urban Landscape Transformation

【速读】:该论文旨在解决城市尺度下户外热应激(outdoor heat stress)评估的计算效率与空间分辨率之间的矛盾问题。传统物理模型如SOLWEIG和ENVI-met虽能提供高精度的人体感知热暴露评估,但其计算复杂度限制了在城市范围内的应用。解决方案的关键在于提出GSM-UTCI——一个基于多模态深度学习的框架,通过特征级线性调制(FiLM)架构融合地表形态(nDSM)、高分辨率土地覆盖数据和小时气象条件,动态地将大气环境信息条件化到空间特征上,从而实现以1米超本地分辨率预测日间平均通用热气候指数(Universal Thermal Climate Index, UTCI)。该方法在保持接近物理模型精度(R²=0.9151,MAE=0.41°C)的同时,将单个城市推理时间从数小时缩短至不足五分钟,显著提升了城市热环境规划的可扩展性和实用性。

链接: https://arxiv.org/abs/2507.23000
作者: Shengao Yi,Xiaojiang Li,Wei Tu,Tianhong Zhao
机构: University of Pennsylvania (宾夕法尼亚大学); Shenzhen University (深圳大学); Shenzhen Technology University (深圳技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As extreme heat events intensify due to climate change and urbanization, cities face increasing challenges in mitigating outdoor heat stress. While traditional physical models such as SOLWEIG and ENVI-met provide detailed assessments of human-perceived heat exposure, their computational demands limit scalability for city-wide planning. In this study, we propose GSM-UTCI, a multimodal deep learning framework designed to predict daytime average Universal Thermal Climate Index (UTCI) at 1-meter hyperlocal resolution. The model fuses surface morphology (nDSM), high-resolution land cover data, and hourly meteorological conditions using a feature-wise linear modulation (FiLM) architecture that dynamically conditions spatial features on atmospheric context. Trained on SOLWEIG-derived UTCI maps, GSM-UTCI achieves near-physical accuracy, with an R2 of 0.9151 and a mean absolute error (MAE) of 0.41°C, while reducing inference time from hours to under five minutes for an entire city. To demonstrate its planning relevance, we apply GSM-UTCI to simulate systematic landscape transformation scenarios in Philadelphia, replacing bare earth, grass, and impervious surfaces with tree canopy. Results show spatially heterogeneous but consistently strong cooling effects, with impervious-to-tree conversion producing the highest aggregated benefit (-4.18°C average change in UTCI across 270.7 km2). Tract-level bivariate analysis further reveals strong alignment between thermal reduction potential and land cover proportions. These findings underscore the utility of GSM-UTCI as a scalable, fine-grained decision support tool for urban climate adaptation, enabling scenario-based evaluation of greening strategies across diverse urban environments.
zh

[CV-108] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在评估手写数学解答时的不足,特别是其在理解学生解题过程、识别错误以及依据固定评分标准进行打分方面的能力欠缺。现有基准主要聚焦于模型的问题求解能力,而忽视了对学习者作答内容的准确分析与评价。本文的关键解决方案是构建了一个新的基准——EGE-Math Solutions Assessment Benchmark,该基准包含122份来自俄罗斯统一国家考试(EGE)的手写数学解答及其官方专家评分,并在此基础上对七种主流VLMs(来自Google、OpenAI、Arcee AI和阿里巴巴云)在三种推理模式下进行系统评估。结果揭示了当前模型在数学推理能力和与人工评分标准一致性方面的局限性,为未来AI辅助评估研究提供了明确方向。

链接: https://arxiv.org/abs/2507.22958
作者: Ruslan Khrulev
机构: Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 3 figures, 10 tables. Code is available at: this https URL

点击查看摘要

Abstract:This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in this https URL
zh

[CV-109] Automated Label Placement on Maps via Large Language Models KDD2025

【速读】:该论文旨在解决自动标签放置(Automatic Label Placement, ALP)在地图设计中长期存在的难题,即现有自动化系统难以整合制图规范、适应上下文环境或准确理解标注指令,导致其在实际应用中仍高度依赖人工操作且难以规模化。解决方案的关键在于将ALP任务重新定义为数据编辑问题,并利用大语言模型(Large Language Models, LLMs)实现上下文感知的空间注释;具体而言,通过检索增强生成(Retrieval-Augmented Generation, RAG)技术获取与地标类型相关的标注指南,并将其结构化地融入提示词中,进而驱动指令微调后的LLMs生成符合专家制图标准的标签坐标。这一方法显著提升了标签放置的准确性与可泛化性,为AI辅助地图完成提供了可扩展的新范式。

链接: https://arxiv.org/abs/2507.22952
作者: Harry Shomer,Jiejun Xu
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); HRL Laboratories (HRL 实验室)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Workshop on AI for Data Editing (AI4DE) at KDD 2025

点击查看摘要

Abstract:Label placement is a critical aspect of map design, serving as a form of spatial annotation that directly impacts clarity and interpretability. Despite its importance, label placement remains largely manual and difficult to scale, as existing automated systems struggle to integrate cartographic conventions, adapt to context, or interpret labeling instructions. In this work, we introduce a new paradigm for automatic label placement (ALP) that formulates the task as a data editing problem and leverages large language models (LLMs) for context-aware spatial annotation. To support this direction, we curate MAPLE, the first known benchmarking dataset for evaluating ALP on real-world maps, encompassing diverse landmark types and label placement annotations from open-source data. Our method retrieves labeling guidelines relevant to each landmark type leveraging retrieval-augmented generation (RAG), integrates them into prompts, and employs instruction-tuned LLMs to generate ideal label coordinates. We evaluate four open-source LLMs on MAPLE, analyzing both overall performance and generalization across different types of landmarks. This includes both zero-shot and instruction-tuned performance. Our results demonstrate that LLMs, when guided by structured prompts and domain-specific retrieval, can learn to perform accurate spatial edits, aligning the generated outputs with expert cartographic standards. Overall, our work presents a scalable framework for AI-assisted map finishing and demonstrates the potential of foundation models in structured data editing tasks. The code and data can be found at this https URL.
zh

[CV-110] LearnRobot: An Interactive Learning-Based Multi-Modal Robot with Continuous Improvement

【速读】:该论文旨在解决机器人在部署后性能难以提升的问题,尤其是在面对从未见过的新场景时缺乏适应能力。解决方案的关键在于提出了一种基于多模态大语言模型(Multi-modal Large Language Model, MLLM)的交互式学习机器人系统,其核心创新在于能够通过与非专家用户的自然对话进行持续学习,并引入“问题链”机制以明确用户意图,同时采用双模态检索模块利用交互事件避免重复错误,从而在模型更新前保障无缝用户体验,显著提升了机器人在多样化环境中的适应性与性能表现。

链接: https://arxiv.org/abs/2507.22896
作者: Kohou Wang,ZhaoXiang Liu,Lin Bai,Kun Fan,Xiang Liu,Huan Hu,Kai Wang,Shiguo Lian
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:It is crucial that robots’ performance can be improved after deployment, as they are inherently likely to encounter novel scenarios never seen before. This paper presents an innovative solution: an interactive learning-based robot system powered by a Multi-modal Large Language Model(MLLM). A key feature of our system is its ability to learn from natural dialogues with non-expert users. We also propose chain of question to clarify the exact intent of the question before providing an answer and dual-modality retrieval modules to leverage these interaction events to avoid repeating same mistakes, ensuring a seamless user experience before model updates, which is in contrast to current mainstream MLLM-based robotic systems. Our system marks a novel approach in robotics by integrating interactive learning, paving the way for superior adaptability and performance in diverse environments. We demonstrate the effectiveness and improvement of our method through experiments, both quantitively and qualitatively.
zh

[CV-111] opology Optimization in Medical Image Segmentation with Fast Euler Characteristic

【速读】:该论文旨在解决深度学习在医学图像分割中面临的拓扑结构不准确问题,即现有全自动方法虽在Dice分数等像素级指标上表现良好,却常因无法满足连续边界或封闭表面等拓扑约束而导致临床可用性不足。其核心解决方案是引入基于欧拉示性数(Euler Characteristic, χ)的快速拓扑感知机制:首先提出适用于2D和3D数据的χ快速计算公式,将其作为拓扑误差度量;进而构建拓扑违规图(topological violation map),定位分割结果中拓扑错误的空间区域;最后通过拓扑感知修正网络对任意分割模型输出进行精细化拓扑校正,从而在保持像素级精度的同时显著提升拓扑正确性。

链接: https://arxiv.org/abs/2507.23763
作者: Liu Li,Qiang Ma,Cheng Ouyang,Johannes C. Paetzold,Daniel Rueckert,Bernhard Kainz
机构: Imperial College London (帝国理工学院); University of Oxford (牛津大学); Institute of Clinical Sciences, Imperial College London (帝国理工学院临床科学研究所); Weill Cornell Medicine (威尔康奈尔医学中心); Cornell Tech (康奈尔科技学院); Technical University of Munich (慕尼黑工业大学); SBEIS, King’s College London (国王学院SBEIS); Department AIBE of Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学AIBE系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based medical image segmentation techniques have shown promising results when evaluated based on conventional metrics such as the Dice score or Intersection-over-Union. However, these fully automatic methods often fail to meet clinically acceptable accuracy, especially when topological constraints should be observed, e.g., continuous boundaries or closed surfaces. In medical image segmentation, the correctness of a segmentation in terms of the required topological genus sometimes is even more important than the pixel-wise accuracy. Existing topology-aware approaches commonly estimate and constrain the topological structure via the concept of persistent homology (PH). However, these methods are difficult to implement for high dimensional data due to their polynomial computational complexity. To overcome this problem, we propose a novel and fast approach for topology-aware segmentation based on the Euler Characteristic ( \chi ). First, we propose a fast formulation for \chi computation in both 2D and 3D. The scalar \chi error between the prediction and ground-truth serves as the topological evaluation metric. Then we estimate the spatial topology correctness of any segmentation network via a so-called topological violation map, i.e., a detailed map that highlights regions with \chi errors. Finally, the segmentation results from the arbitrary network are refined based on the topological violation maps by a topology-aware correction network. Our experiments are conducted on both 2D and 3D datasets and show that our method can significantly improve topological correctness while preserving pixel-wise segmentation accuracy.
zh

[CV-112] owards Field-Ready AI-based Malaria Diagnosis: A Continual Learning Approach MICCAI2025

【速读】:该论文旨在解决疟疾(malaria)计算机辅助诊断(CAD)模型在不同医疗站点间因数据分布差异(domain shift)导致的泛化能力不足问题,尤其关注低资源环境中模型部署的实际挑战。其解决方案的关键在于采用持续学习(continual learning, CL)策略,将模型训练建模为域增量学习(domain-incremental learning)场景,使基于YOLO的目标检测器能够逐步适应新采集站点的数据,同时保留对先前域的性能。实验结果表明,特别是基于回放(rehearsal-based)的CL方法能显著提升模型鲁棒性,验证了持续学习在构建可实际部署的疟疾CAD工具中的潜力。

链接: https://arxiv.org/abs/2507.23648
作者: Louise Guillon,Soheib Biga,Yendoube E. Kantchire,Mouhamadou Lamine Sane,Grégoire Pasquier,Kossi Yakpa,Stéphane E. Sossou,Marc Thellier,Laurent Bonnardot,Laurence Lachaud,Renaud Piarroux,Ameyo M. Dorkenoo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 AMAI Workshop, Accepted, Submitted Manuscript Version

点击查看摘要

Abstract:Malaria remains a major global health challenge, particularly in low-resource settings where access to expert microscopy may be limited. Deep learning-based computer-aided diagnosis (CAD) systems have been developed and demonstrate promising performance on thin blood smear images. However, their clinical deployment may be hindered by limited generalization across sites with varying conditions. Yet very few practical solutions have been proposed. In this work, we investigate continual learning (CL) as a strategy to enhance the robustness of malaria CAD models to domain shifts. We frame the problem as a domain-incremental learning scenario, where a YOLO-based object detector must adapt to new acquisition sites while retaining performance on previously seen domains. We evaluate four CL strategies, two rehearsal-based and two regularization-based methods, on real-life conditions thanks to a multi-site clinical dataset of thin blood smear images. Our results suggest that CL, and rehearsal-based methods in particular, can significantly improve performance. These findings highlight the potential of continual learning to support the development of deployable, field-ready CAD tools for malaria.
zh

[CV-113] JPEG Processing Neural Operator for Backward-Compatible Coding

【速读】:该论文旨在解决学习型有损压缩算法在标准化编码器(codec)方面面临的挑战,尤其是如何在不破坏现有JPEG格式兼容性的前提下提升图像压缩性能。其解决方案的关键在于提出JPEG Processing Neural Operator (JPNeO),一种在编码和解码阶段均引入神经算子(neural operator)的下一代JPEG算法,通过优化色度分量保留与重建保真度,实现更高的压缩效率与更低的内存占用及参数数量。

链接: https://arxiv.org/abs/2507.23521
作者: Woo Kyoung Han,Yongjun Lee,Byeonghun Lee,Sang Hyun Park,Sunghoon Im,Kyong Hwan Jin
机构: Korea University (韩国大学); DGIST (韩国科学技术院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding’s protocol. Our source code is available at this https URL.
zh

[CV-114] Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation ICONIP2025

【速读】:该论文旨在解决资源受限的传感器边缘设备(如视频胶囊内窥镜)在执行图像分类任务时面临的能效与计算能力瓶颈问题。传统深度神经网络模型参数量大、运算复杂,难以部署于电池寿命有限的医疗设备中;同时,图像从拜耳(Bayer)格式转换为RGB格式的过程耗能显著,应在可能情况下跳过。解决方案的关键在于:直接在拜耳图像上进行准确的器官分类,采用仅含63,000参数的卷积神经网络(CNN)结合维特比(Viterbi)解码进行时序分析,并通过定制化的PULPissimo系统级芯片(SoC)集成RISC-V核心与超低功耗硬件加速器,实现每帧图像仅需5.31 μJ的能量消耗,相较传统视频胶囊内窥镜平均节能89.9%。

链接: https://arxiv.org/abs/2507.23398
作者: Oliver Bause,Julia Werner,Paul Palomero Bernardo,Oliver Bringmann
机构: 未知
类目: Image and Video Processing (eess.IV); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 32nd International Conference on Neural Information Processing - ICONIP 2025

点击查看摘要

Abstract:For many real-world applications involving low-power sensor edge devices deep neural networks used for image classification might not be suitable. This is due to their typically large model size and require- ment of operations often exceeding the capabilities of such resource lim- ited devices. Furthermore, camera sensors usually capture images with a Bayer color filter applied, which are subsequently converted to RGB images that are commonly used for neural network training. However, on resource-constrained devices, such conversions demands their share of energy and optimally should be skipped if possible. This work ad- dresses the need for hardware-suitable AI targeting sensor edge devices by means of the Video Capsule Endoscopy, an important medical proce- dure for the investigation of the small intestine, which is strongly limited by its battery lifetime. Accurate organ classification is performed with a final accuracy of 93.06% evaluated directly on Bayer images involv- ing a CNN with only 63,000 parameters and time-series analysis in the form of Viterbi decoding. Finally, the process of capturing images with a camera and raw image processing is demonstrated with a customized PULPissimo System-on-Chip with a RISC-V core and an ultra-low power hardware accelerator providing an energy-efficient AI-based image clas- sification approach requiring just 5.31 \muJ per image. As a result, it is possible to save an average of 89.9% of energy before entering the small intestine compared to classic video capsules.
zh

[CV-115] Pixel Embedding Method for Tubular Neurite Segmentation

【速读】:该论文旨在解决神经元拓扑结构自动分割问题,尤其针对神经纤维复杂形态及相互遮挡带来的深度学习分割难题。其关键解决方案在于:首先设计了一种输出像素级嵌入向量的深度网络,并引入相应的损失函数,使模型学习到的特征能够在遮挡区域中有效区分不同的神经连接;其次构建了一个端到端的处理流程,直接从原始神经成像数据生成SWC格式的神经结构树;最后提出一种新的拓扑评估指标,以更准确地量化神经元分割与重建的质量。

链接: https://arxiv.org/abs/2507.23359
作者: Huayu Fu,Jiamin Li,Haozhi Qu,Xiaolin Hu,Zengcai Guo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Automatic segmentation of neuronal topology is critical for handling large scale neuroimaging data, as it can greatly accelerate neuron annotation and analysis. However, the intricate morphology of neuronal branches and the occlusions among fibers pose significant challenges for deep learning based segmentation. To address these issues, we propose an improved framework: First, we introduce a deep network that outputs pixel level embedding vectors and design a corresponding loss function, enabling the learned features to effectively distinguish different neuronal connections within occluded regions. Second, building on this model, we develop an end to end pipeline that directly maps raw neuronal images to SWC formatted neuron structure trees. Finally, recognizing that existing evaluation metrics fail to fully capture segmentation accuracy, we propose a novel topological assessment metric to more appropriately quantify the quality of neuron segmentation and reconstruction. Experiments on our fMOST imaging dataset demonstrate that, compared to several classical methods, our approach significantly reduces the error rate in neuronal topology reconstruction.
zh

[CV-116] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision MICCAI2025

【速读】:该论文旨在解决在撒哈拉以南非洲(Sub-Saharan Africa, SSA)地区因医疗资源匮乏、MRI图像质量差及标注专家稀缺而导致的脑肿瘤(尤其是胶质瘤)自动分割不准确的问题。当前依赖人工分割的多参数MRI量化方法耗时且不可行,尤其在低收入地区难以实施。解决方案的关键在于提出EMedNeXt框架——一个基于MedNeXt V2改进的脑肿瘤分割模型,其核心创新包括:1)扩大感兴趣区域(Region of Interest, ROI)以提升边界敏感性;2)采用优化的nnU-Net v2架构作为骨干网络,增强特征提取能力;3)设计鲁棒的模型集成系统,并结合定制化后处理流程,显著提升在低质量图像和小样本数据下的分割性能,在隐藏验证集上达到平均LesionWise Dice相似系数(DSC)0.897,以及在0.5 mm和1.0 mm容差下分别获得0.84和0.541的LesionWise Normalized Surface Distance(NSD)。

链接: https://arxiv.org/abs/2507.23256
作者: Ahmed Jaheen,Abdelrahman Elsayed,Damir Kim,Daniil Tikhonov,Matheus Scatolin,Mohor Banerjee,Qiankun Ji,Mostafa Salem,Hu Wang,Sarim Hashmi,Mohammad Yaqub
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the BraTS-Lighthouse 2025 Challenge (MICCAI 2025)

点击查看摘要

Abstract:Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.
zh

[CV-117] Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction ACM-MM2025

【速读】:该论文旨在解决高分辨率(HR)图像在sRGB域中下采样时常见的细节模糊和异常伪影问题,以及RAW图像缺乏专用下采样框架的挑战。其核心解决方案是提出一种基于小波变换的递归重建框架,利用小波变换的信息无损特性,实现任意尺度的RAW图像下采样,并采用粗到细的策略进行重构。关键创新包括:低频任意尺度下采样模块(LASDM)与高频预测模块(HFPM),用于保持重建低分辨率(LR)RAW图像的结构和纹理完整性;同时引入能量最大化损失函数,以对齐HR与LR域之间的高频能量分布。该方法显著优于现有最先进方法,在定量和视觉效果上均表现出优越性。

链接: https://arxiv.org/abs/2507.23219
作者: Yang Ren,Hai Jiang,Wei Li,Menglong Yang,Heng Zhang,Zehua Sheng,Qingsheng Ye,Shuaicheng Liu
机构: Sichuan University (四川大学); Xiaomi Inc. (小米公司); University of Electronic Science and Technology of China (电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3 \times , and incorporate it with publicly available datasets with integer factors (2 \times , 3 \times , 4 \times ) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at this https URL.
zh

[CV-118] owards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery

【速读】:该论文旨在解决高分辨率卫星影像在不同传感器间因空间分辨率差异导致的数据融合与下游应用困难的问题,特别是针对异构卫星传感器(如Landsat与Sentinel)在光谱、时间特性上的不一致性,现有超分辨率(Super-resolution)技术多依赖人工下采样图像训练,难以适配真实传感器数据。其解决方案的关键在于构建一个初步框架,利用Harmonized Landsat Sentinel 30m(HLS 30)影像作为目标,以Harmonized Landsat Sentinel 10m(HLS 10)影像为参考进行对齐和增强,从而有效弥合传感器间的分辨率差距,并提升Landsat影像的超分辨重建质量。

链接: https://arxiv.org/abs/2507.23150
作者: Philip Wootaek Shin,Vishal Gaur,Rahul Ramachandran,Manil Maskey,Jack Sampson,Vijaykrishnan Narayanan,Sujit Roy
机构: The Pennsylvania State University (宾夕法尼亚州立大学); NASA Marshall Space Flight Center (美国国家航空航天局马歇尔太空飞行中心); University of Alabama in Huntsville (阿拉巴马大学亨茨维尔分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution satellite imagery is essential for geospatial analysis, yet differences in spatial resolution across satellite sensors present challenges for data fusion and downstream applications. Super-resolution techniques can help bridge this gap, but existing methods rely on artificially downscaled images rather than real sensor data and are not well suited for heterogeneous satellite sensors with differing spectral, temporal characteristics. In this work, we develop a preliminary framework to align and Harmonized Landsat Sentinel 30m(HLS 30) imagery using Harmonized Landsat Sentinel 10m(HLS10) as a reference from the HLS dataset. Our approach aims to bridge the resolution gap between these sensors and improve the quality of super-resolved Landsat imagery. Quantitative and qualitative evaluations demonstrate the effectiveness of our method, showing its potential for enhancing satellite-based sensing applications. This study provides insights into the feasibility of heterogeneous satellite image super-resolution and highlights key considerations for future advancements in the field.
zh

[CV-119] MRpro - open PyTorch-based MR reconstruction and processing package

【速读】:该论文旨在解决磁共振成像(MRI)图像重建中缺乏统一、可复现且易于扩展的开源框架的问题。现有方法在数据处理、算法实现和深度学习集成方面常存在碎片化和兼容性差的问题,限制了研究效率与协作。解决方案的关键在于提出MRpro,一个基于PyTorch构建的开放源代码图像重建包,其核心创新包括:统一的数据结构用于一致管理MR数据集及其元数据(如k空间轨迹);一组可组合的操作符、近似可微函数及优化算法(如适用于所有常见轨迹的统一傅里叶算子和扩展相位图模拟用于定量MRI);以及面向深度学习的模块化组件(如数据一致性层、可微优化层和先进骨干网络),并整合公开数据集以提升可复现性。该框架通过自动化质量控制支持协作开发,为多种应用场景(如笛卡尔、径向和螺旋采集、运动校正、心脏MR指纹成像等)提供即插即用的重建实现,从而推动MRI重建技术的标准化与持续演进。

链接: https://arxiv.org/abs/2507.23129
作者: Felix Frederik Zimmermann,Patrick Schuenke,Christoph S. Aigner,Bill A. Bernhardt,Mara Guastini,Johannes Hammacher,Noah Jaitner,Andreas Kofler,Leonid Lunin,Stefan Martin,Catarina Redshaw Kranich,Jakob Schattenfroh,David Schote,Yanglei Wu,Christoph Kolbitsch
机构: Physikalisch-Technische Bundesanstalt (PTB); Charité – Universitätsmedizin Berlin; Max Planck Institute for Human Development
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Submitted to Magnetic Resonance in Medicine

点击查看摘要

Abstract:We introduce MRpro, an open-source image reconstruction package built upon PyTorch and open data formats. The framework comprises three main areas. First, it provides unified data structures for the consistent manipulation of MR datasets and their associated metadata (e.g., k-space trajectories). Second, it offers a library of composable operators, proximable functionals, and optimization algorithms, including a unified Fourier operator for all common trajectories and an extended phase graph simulation for quantitative MR. These components are used to create ready-to-use implementations of key reconstruction algorithms. Third, for deep learning, MRpro includes essential building blocks such as data consistency layers, differentiable optimization layers, and state-of-the-art backbone networks and integrates public datasets to facilitate reproducibility. MRpro is developed as a collaborative project supported by automated quality control. We demonstrate the versatility of MRpro across multiple applications, including Cartesian, radial, and spiral acquisitions; motion-corrected reconstruction; cardiac MR fingerprinting; learned spatially adaptive regularization weights; model-based learned image reconstruction and quantitative parameter estimation. MRpro offers an extensible framework for MR image reconstruction. With reproducibility and maintainability at its core, it facilitates collaborative development and provides a foundation for future MR imaging research.
zh

[CV-120] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

【速读】:该论文旨在解决医学影像领域中胰腺分割任务因跨中心(cross-center)和跨序列(cross-sequence)差异导致的域泛化(domain generalization)性能下降问题,尤其针对当前公开基准数据集对胰腺这一临床重要器官覆盖不足、且现有深度学习方法在低T1对比度下分割精度严重受限(Dice分数损失达20–30%)的现状。其关键解决方案是构建了一个大规模多中心3D MRI胰腺分割数据集PancreasDG,包含来自六个机构的563例扫描,涵盖静脉期与反相位序列,并采用双盲两阶段标注协议生成像素级掩膜;在此基础上提出一种利用解剖不变性的半监督方法,在跨序列分割任务上显著优于现有域泛化技术,Dice分数提升达61.63%,并在两个测试中心达到87.00%的性能表现。

链接: https://arxiv.org/abs/2507.23110
作者: Zheyuan Zhang,Linkai Peng,Wanying Dou,Cuiling Sun,Halil Ertugrul Aktas,Andrea M. Bejar,Elif Keles,Gorkem Durak,Ulas Bagci
机构: Northwestern University (西北大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve 90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at this https URL.
zh

[CV-121] LesionGen: A Concept-Guided Diffusion Model for Dermatology Image Synthesis MICCAI2025

【速读】:该论文旨在解决皮肤疾病分类任务中高质量、多样化且标注详尽的医学图像数据稀缺问题,这一瓶颈主要受限于隐私保护、高昂的标注成本以及人群代表性不足。其解决方案的关键在于提出LesionGen,一个基于临床知识的文本到图像扩散概率模型(text-to-image diffusion probabilistic model, T2I-DPM)框架,通过在结构化、概念丰富的皮肤病描述(源自专家标注与伪生成的概念引导报告)上微调预训练扩散模型,实现以有意义的皮肤病学描述为条件生成逼真且多样化的皮肤病变图像。该方法显著提升了仅使用合成数据训练模型时的分类准确率,并在最差子群表现上取得明显改善。

链接: https://arxiv.org/abs/2507.23001
作者: Jamil Fayyad,Nourhan Bayasi,Ziyang Yu,Homayoun Najjaran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MICCAI 2025 ISIC Workshop

点击查看摘要

Abstract:Deep learning models for skin disease classification require large, diverse, and well-annotated datasets. However, such resources are often limited due to privacy concerns, high annotation costs, and insufficient demographic representation. While text-to-image diffusion probabilistic models (T2I-DPMs) offer promise for medical data synthesis, their use in dermatology remains underexplored, largely due to the scarcity of rich textual descriptions in existing skin image datasets. In this work, we introduce LesionGen, a clinically informed T2I-DPM framework for dermatology image synthesis. Unlike prior methods that rely on simplistic disease labels, LesionGen is trained on structured, concept-rich dermatological captions derived from expert annotations and pseudo-generated, concept-guided reports. By fine-tuning a pretrained diffusion model on these high-quality image-caption pairs, we enable the generation of realistic and diverse skin lesion images conditioned on meaningful dermatological descriptions. Our results demonstrate that models trained solely on our synthetic dataset achieve classification accuracy comparable to those trained on real images, with notable gains in worst-case subgroup performance. Code and data are available here.
zh

人工智能

[AI-0] Distributed AI Agents for Cognitive Underwater Robot Autonomy

【速读】:该论文旨在解决机器人在复杂、不可预测环境中实现鲁棒认知自主性的根本挑战,特别是在水下自主航行器(Autonomous Underwater Vehicles, AUVs)的应用场景中。其解决方案的核心在于提出一种名为Underwater Robot Self-Organizing Autonomy (UROSA) 的新型架构,该架构基于分布式大语言模型(Large Language Model, LLM)AI代理,并集成于Robot Operating System 2(ROS 2)框架内,将认知功能去中心化为多个专业化AI代理,分别负责多模态感知、自适应推理、动态任务规划和实时决策;关键创新包括:可动态调整角色的灵活代理、利用向量数据库实现检索增强生成(Retrieval-Augmented Generation, RAG)的知识管理机制、基于强化学习的行为优化策略,以及运行时自动生成ROS 2节点以支持功能扩展性。实证结果表明,UROSA在模拟与真实水下任务中展现出显著优于传统规则驱动架构的适应性和可靠性。

链接: https://arxiv.org/abs/2507.23735
作者: Markus Buchholz,Ignacio Carlucho,Michele Grimaldi,Yvan R. Petillot
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Achieving robust cognitive autonomy in robots navigating complex, unpredictable environments remains a fundamental challenge in robotics. This paper presents Underwater Robot Self-Organizing Autonomy (UROSA), a groundbreaking architecture leveraging distributed Large Language Model AI agents integrated within the Robot Operating System 2 (ROS 2) framework to enable advanced cognitive capabilities in Autonomous Underwater Vehicles. UROSA decentralises cognition into specialised AI agents responsible for multimodal perception, adaptive reasoning, dynamic mission planning, and real-time decision-making. Central innovations include flexible agents dynamically adapting their roles, retrieval-augmented generation utilising vector databases for efficient knowledge management, reinforcement learning-driven behavioural optimisation, and autonomous on-the-fly ROS 2 node generation for runtime functional extensibility. Extensive empirical validation demonstrates UROSA’s promising adaptability and reliability through realistic underwater missions in simulation and real-world deployments, showing significant advantages over traditional rule-based architectures in handling unforeseen scenarios, environmental uncertainties, and novel mission objectives. This work not only advances underwater autonomy but also establishes a scalable, safe, and versatile cognitive robotics framework capable of generalising to a diverse array of real-world applications.
zh

[AI-1] Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在视觉运动代理(visuomotor agents)中难以实现跨环境泛化的问题,尤其是针对任务或环境过拟合导致的通用行为能力不足。其关键解决方案包括:首先,提出跨视角目标规范(cross-view goal specification)作为统一的多任务目标空间,以增强策略在不同任务间的表征一致性;其次,利用Minecraft环境的高度可定制性,设计自动化任务合成方法,结合分布式强化学习框架实现大规模多任务训练,从而显著提升代理的空间推理能力和零样本泛化性能。实验表明,该方案使交互成功率提升4倍,并可在未见过的世界中实现空间推理的零样本迁移,验证了在3D模拟环境中进行大规模RL训练对提升视觉运动代理能力的巨大潜力。

链接: https://arxiv.org/abs/2507.23698
作者: Shaofei Cai,Zhancun Mu,Haiwen Xia,Bowei Zhang,Anji Liu,Yitao Liang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn’t yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL’s potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4\times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents’ spatial reasoning.
zh

[AI-2] A survey of multi-agent geosimulation methodologies: from ABM to LLM

【速读】:该论文旨在解决如何将大语言模型(Large Language Models, LLMs)有效整合进地理模拟(geosimulation)平台以构建下一代多智能体系统的问题。其核心挑战在于确保LLMs作为智能体组件时能够遵循结构化架构,从而与感知(perception)、记忆(memory)、规划(planning)和行动(action)等基本智能体活动相匹配。解决方案的关键在于提出并验证了一个形式化的框架,该框架明确规范了智能体在多智能体系统、仿真及信息系统中的行为逻辑,并证明LLMs若依循此架构即可成为稳定、可扩展的智能体组件,从而为高保真地理模拟提供坚实基础。

链接: https://arxiv.org/abs/2507.23694
作者: Virginia Padilla,Jacinto Dávila
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 20 pages, 1 table

点击查看摘要

Abstract:We provide a comprehensive examination of agent-based approaches that codify the principles and linkages underlying multi-agent systems, simulations, and information systems. Based on two decades of study, this paper confirms a framework intended as a formal specification for geosimulation platforms. Our findings show that large language models (LLMs) can be effectively incorporated as agent components if they follow a structured architecture specific to fundamental agent activities such as perception, memory, planning, and action. This integration is precisely consistent with the architecture that we formalize, providing a solid platform for next-generation geosimulation systems.
zh

[AI-3] villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

【速读】:该论文旨在解决机器人操作策略在复杂场景中泛化能力不足的问题,尤其是在面对未见过的任务和环境时,传统视觉-语言-动作(Visual-Language-Action, VLA)模型难以有效学习可迁移的控制策略。其解决方案的关键在于提出了一种新的视觉-语言-潜在动作(Visual-Language-Latent-Action, ViLLA)框架——villa-X,通过改进潜在动作(latent action)的学习方式及其在VLA预训练中的融合机制,使模型能够更好地捕捉视觉状态间的抽象变化,并将其用于指导更鲁棒和通用的机器人操作策略生成。这一方法在多个仿真环境(如SIMPLER和LIBERO)及两个真实机器人平台(夹爪与灵巧手操作)上均展现出优越性能,验证了ViLLA范式的有效性与潜力。

链接: https://arxiv.org/abs/2507.23682
作者: Xiaoyu Chen,Hangxing Wei,Pushi Zhang,Chuheng Zhang,Kaixin Wang,Yanjiang Guo,Rushuai Yang,Yucen Wang,Xinquan Xiao,Li Zhao,Jianyu Chen,Jiang Bian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Visual-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent work has begun to explore the incorporation of latent actions, an abstract representation of visual change between two frames, into VLA pre-training. In this paper, we introduce villa-X, a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. Together, these contributions enable villa-X to achieve superior performance across simulated environments including SIMPLER and LIBERO, as well as on two real-world robot setups including gripper and dexterous hand manipulation. We believe the ViLLA paradigm holds significant promise, and that our villa-X provides a strong foundation for future research.
zh

[AI-4] Automating AI Failure Tracking: Semantic Association of Reports in AI Incident Database ECAI2025

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在高风险领域部署中因系统性漏洞导致的社会危害问题,特别是针对AI事件数据库(AI Incident Database, AIID)中新增报告难以高效、自动关联已有AI事故记录的瓶颈。其核心挑战在于人工标注方式限制了可扩展性与对新兴失败模式的及时识别。解决方案的关键在于提出一种基于语义相似度建模的检索框架,将新报告与历史AI事故的匹配任务形式化为排序问题,利用Transformer-based句向量模型计算标题与描述文本的嵌入余弦相似度,从而实现自动化关联。实验证明,该方法在性能上显著优于传统词法方法和交叉编码器架构,且结合标题与全文描述能大幅提升准确率,同时具备对描述长度变化的鲁棒性,并随训练数据规模扩大而持续优化。

链接: https://arxiv.org/abs/2507.23669
作者: Diego Russo,Gian Marco Orlando,Valerio La Gatta,Vincenzo Moscato
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the 28th European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are transforming critical sectors such as healthcare, finance, and transportation, enhancing operational efficiency and decision-making processes. However, their deployment in high-stakes domains has exposed vulnerabilities that can result in significant societal harm. To systematically study and mitigate these risk, initiatives like the AI Incident Database (AIID) have emerged, cataloging over 3,000 real-world AI failure reports. Currently, associating a new report with the appropriate AI Incident relies on manual expert intervention, limiting scalability and delaying the identification of emerging failure patterns. To address this limitation, we propose a retrieval-based framework that automates the association of new reports with existing AI Incidents through semantic similarity modeling. We formalize the task as a ranking problem, where each report-comprising a title and a full textual description-is compared to previously documented AI Incidents based on embedding cosine similarity. Benchmarking traditional lexical methods, cross-encoder architectures, and transformer-based sentence embedding models, we find that the latter consistently achieve superior performance. Our analysis further shows that combining titles and descriptions yields substantial improvements in ranking accuracy compared to using titles alone. Moreover, retrieval performance remains stable across variations in description length, highlighting the robustness of the framework. Finally, we find that retrieval performance consistently improves as the training set expands. Our approach provides a scalable and efficient solution for supporting the maintenance of the AIID. Comments: Accepted at the 28th European Conference on Artificial Intelligence (ECAI 2025) Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.23669 [cs.CY] (or arXiv:2507.23669v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2507.23669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] Personalized Education with Ranking Alignment Recommendation

【速读】:该论文旨在解决个性化问题推荐(Personalized Question Recommendation)中因探索效率低下而导致的难题,即传统基于强化学习(Reinforcement Learning, RL)的方法在有限训练轮次内难以识别出最适合每个学生的最优问题。其解决方案的关键在于提出Ranking Alignment Recommendation (RAR) 框架,通过将协同思想引入探索机制,显著提升探索效率,从而在有限训练样本下更有效地优化推荐性能。该框架可兼容任意基于RL的问题推荐模型,具有良好的通用性和实用性。

链接: https://arxiv.org/abs/2507.23664
作者: Haipeng Liu,Yuxuan Liu,Ting Long
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized question recommendation aims to guide individual students through questions to enhance their mastery of learning targets. Most previous methods model this task as a Markov Decision Process and use reinforcement learning to solve, but they struggle with efficient exploration, failing to identify the best questions for each student during training. To address this, we propose Ranking Alignment Recommendation (RAR), which incorporates collaborative ideas into the exploration mechanism, enabling more efficient exploration within limited training episodes. Experiments show that RAR effectively improves recommendation performance, and our framework can be applied to any RL-based question recommender. Our code is available in this https URL.
zh

[AI-6] OptiGradTrust: Byzantine-Robust Federated Learning with Multi-Feature Gradient Analysis and Reinforcement Learning-Based Trust Weighting

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗场景下面临的两大核心挑战:拜占庭攻击(Byzantine attacks)和数据统计异质性(statistical heterogeneity)。针对拜占庭攻击,作者提出OptiGradTrust防御框架,其关键在于通过六维梯度指纹(包括VAE重构误差、余弦相似度、L₂范数、符号一致性比率及蒙特卡洛Shapley值)构建自适应信任评分机制,并结合强化学习与注意力机制实现动态权重调整;为缓解数据异质性导致的收敛困难,进一步设计FedBN-Prox(FedBN-P)算法,融合联邦批量归一化(Federated Batch Normalization)与近端正则化(proximal regularization),在准确率与收敛性之间取得最优平衡。实验表明,该方案在MNIST、CIFAR-10和阿尔茨海默症MRI数据集上均显著优于现有先进方法,在非独立同分布(non-IID)条件下较FLGuard提升达+1.6个百分点。

链接: https://arxiv.org/abs/2507.23638
作者: Mohammad Karami,Fatemeh Ghassemi,Hamed Kebriaei,Hamid Azadegan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed medical institutions while preserving patient privacy, but remains vulnerable to Byzantine attacks and statistical heterogeneity. We present OptiGradTrust, a comprehensive defense framework that evaluates gradient updates through a novel six-dimensional fingerprint including VAE reconstruction error, cosine similarity metrics, L_2 norm, sign-consistency ratio, and Monte Carlo Shapley value, which drive a hybrid RL-attention module for adaptive trust scoring. To address convergence challenges under data heterogeneity, we develop FedBN-Prox (FedBN-P), combining Federated Batch Normalization with proximal regularization for optimal accuracy-convergence trade-offs. Extensive evaluation across MNIST, CIFAR-10, and Alzheimer’s MRI datasets under various Byzantine attack scenarios demonstrates significant improvements over state-of-the-art defenses, achieving up to +1.6 percentage points over FLGuard under non-IID conditions while maintaining robust performance against diverse attack patterns through our adaptive learning approach.
zh

[AI-7] MemoCue: Empowering LLM -Based Agents for Human Memory Recall via Strategy-Guided Querying

【速读】:该论文旨在解决传统代理辅助记忆回忆方法中因记忆模块容量有限而导致的完整记忆获取困难及回忆性能受限的问题。其核心解决方案是提出一种策略引导的代理辅助记忆回忆方法,关键在于设计了一个名为Recall Router的框架,通过构建5W回忆地图(5W Recall Map)对记忆查询进行分类,并定义十五种回忆策略模式;同时引入分层回忆树结合蒙特卡洛树搜索算法优化策略选择与响应生成,从而提升代理在不同遗忘场景下的记忆激活能力。

链接: https://arxiv.org/abs/2507.23633
作者: Qian Zhao,Zhuo Sun,Bin Guo,Zhiwen Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent-assisted memory recall is one critical research problem in the field of human-computer interaction. In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories. The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice. Memory theories suggest that the person’s relevant memory can be proactively activated through some effective cues. Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories. To this end, there are two key challenges. (1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics? (2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and sparsely annotated strategy patterns? To address the challenges, we propose a Recall Router framework. Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios. We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses. We construct an instruction tuning dataset and fine-tune multiple open-source large language models (LLMs) to develop MemoCue, an agent that excels in providing memory-inspired responses. Experiments on three representative datasets show that MemoCue surpasses LLM-based methods by 17.74% in recall inspiration. Further human evaluation highlights its advantages in memory-recall applications.
zh

[AI-8] L-GTA: Latent Generative Modeling for Time Series Augmentation

【速读】:该论文旨在解决时间序列数据增强(data augmentation)中生成数据质量不高、可控性差以及难以保持原始数据内在特性的问题。其解决方案的关键在于提出Latent Generative Transformer Augmentation (L-GTA) 模型,该模型基于变分循环自编码器(variational recurrent autoencoder)的Transformer架构,在潜在空间(latent space)内实施受控变换,从而生成保留原始数据内在特性的合成时间序列。通过在潜在空间中对基本变换(如抖动 jittering 和幅度扭曲 magnitude warping)进行组合,L-GTA 能够生成更复杂且一致的增强数据,显著提升预测准确性和相似度指标,优于直接对原始时间序列进行变换的传统方法。

链接: https://arxiv.org/abs/2507.23615
作者: Luis Roque,Carlos Soares,Vitor Cerqueira,Luis Torgo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data augmentation is gaining importance across various aspects of time series analysis, from forecasting to classification and anomaly detection tasks. We introduce the Latent Generative Transformer Augmentation (L-GTA) model, a generative approach using a transformer-based variational recurrent autoencoder. This model uses controlled transformations within the latent space of the model to generate new time series that preserve the intrinsic properties of the original dataset. L-GTA enables the application of diverse transformations, ranging from simple jittering to magnitude warping, and combining these basic transformations to generate more complex synthetic time series datasets. Our evaluation of several real-world datasets demonstrates the ability of L-GTA to produce more reliable, consistent, and controllable augmented data. This translates into significant improvements in predictive accuracy and similarity measures compared to direct transformation methods.
zh

[AI-9] Can LLM -Reasoning Models Replace Classical Planning ? A Benchmark Study

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在机器人任务规划中的有效性问题,特别是其生成结构化且可执行计划的能力是否可靠。尽管LLMs具备强大的生成能力,但在实际应用中能否准确转化为机器人可执行的动作序列仍不明确。解决方案的关键在于通过系统性评估多种前沿LLMs,在多个基准测试中直接使用Planning Domain Definition Language (PDDL)格式输入进行提示,并将其规划性能与经典规划器Fast Downward进行对比,从而量化LLMs在复杂场景下的表现,识别其在资源管理、状态跟踪和约束遵守方面的局限性。研究结果表明,LLMs在简单任务中表现良好,但在复杂环境中仍面临根本性挑战,因此提出未来应探索将LLMs与传统规划算法相结合的混合方法,以提升自主机器人系统中规划的可靠性与可扩展性。

链接: https://arxiv.org/abs/2507.23589
作者: Kai Goebel,Patrik Zips
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models have sparked interest in their potential for robotic task planning. While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain. This paper presents a systematic evaluation of a broad spectrum of current state of the art language models, each directly prompted using Planning Domain Definition Language domain and problem files, and compares their planning performance with the Fast Downward planner across a variety of benchmarks. In addition to measuring success rates, we assess how faithfully the generated plans translate into sequences of actions that can actually be executed, identifying both strengths and limitations of using these models in this setting. Our findings show that while the models perform well on simpler planning tasks, they continue to struggle with more complex scenarios that require precise resource management, consistent state tracking, and strict constraint compliance. These results underscore fundamental challenges in applying language models to robotic planning in real world environments. By outlining the gaps that emerge during execution, we aim to guide future research toward combined approaches that integrate language models with classical planners in order to enhance the reliability and scalability of planning in autonomous robotics.
zh

[AI-10] Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agent ic AI

【速读】:该论文旨在解决分布式协同系统中因任务复杂性、设备资源的时空动态性以及评估开销导致的信任评估过程效率低下问题,进而影响受限资源的利用率和协同任务执行效果。解决方案的关键在于提出一种基于语义信任链(semantic chain-of-trust)的自主信任编排方法,其核心是利用代理型人工智能(agentic AI)与超图(hypergraph)技术:通过 agentic AI 在设备空闲时段自主感知状态并基于历史性能数据进行信任评估,实现资源高效利用;同时,结合任务需求与资源能力的匹配度完成任务特定的信任评估,并借助嵌入信任语义的局部超图实现协作方的分层管理与精准筛选,从而在评估开销与信任准确性之间取得平衡;此外,多设备本地信任超图可构成多跳协作链,支持大规模系统的高效协同。

链接: https://arxiv.org/abs/2507.23565
作者: Botao Zhu,Xianbin Wang,Dusit Niyato
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In collaborative systems, the effective completion of tasks hinges on task-specific trust evaluations of potential devices for distributed collaboration. However, the complexity of tasks, the spatiotemporal dynamism of distributed device resources, and the inevitable assessment overhead dramatically increase the complexity and resource consumption of the trust evaluation process. As a result, ill-timed or overly frequent trust evaluations can reduce utilization rate of constrained resources, negatively affecting collaborative task execution. To address this challenge, this paper proposes an autonomous trust orchestration method based on a new concept of semantic chain-of-trust. Our technique employs agentic AI and hypergraph to establish and maintain trust relationships among devices. By leveraging its strengths in autonomous perception, task decomposition, and semantic reasoning, we propose agentic AI to perceive device states and autonomously perform trust evaluations of collaborators based on historical performance data only during device idle periods, thereby enabling efficient utilization of distributed resources. In addition, agentic AI performs task-specific trust evaluations on collaborator resources by analyzing the alignment between resource capabilities and task requirements. Moreover, by maintaining a trust hypergraph embedded with trust semantics for each device, agentic AI enables hierarchical management of collaborators and identifies collaborators requiring trust evaluation based on trust semantics, thereby achieving a balance between overhead and trust accuracy. Furthermore, local trust hypergraphs from multiple devices can be chained together to support multi-hop collaboration, enabling efficient coordination in large-scale systems. Experimental results demonstrate that the proposed method achieves resource-efficient trust evaluation.
zh

[AI-11] DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在利用上下文学习(In-Context Learning, ICL)时,因演示样本(demonstration)选择不当而导致性能不稳定或下降的问题。现有方法多依赖启发式规则或任务特定设计,缺乏普适且理论严谨的示范选择标准。其解决方案的关键在于提出DICE(Dynamic In-Context Example Selection for LLM Agents),一种基于因果视角将演示知识分解为可迁移与不可迁移成分的方法,并据此构建了一个分步选择准则,该准则具有提升代理性能的理论保障。DICE无需额外训练即可作为插件模块集成至现有代理框架中,展现出良好的通用性与有效性。

链接: https://arxiv.org/abs/2507.23554
作者: Ruoyu Wang,Junda Wu,Yu Xia,Tong Yu,Ryan A. Rossi,Julian McAuley,Lina Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks. However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance. While prior work has explored example selection, including in some agentic or multi-step settings, existing approaches typically rely on heuristics or task-specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. Therefore, it is non-trivial to develop a principled, general-purpose method for selecting demonstrations that consistently benefit agent performance. In this paper, we address this challenge with DICE, Dynamic In-Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning. Our approach decomposes demonstration knowledge into transferable and non-transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization. We further propose a stepwise selection criterion with a formal guarantee of improved agent performance. Importantly, DICE is a general, framework-agnostic solution that can be integrated as a plug-in module into existing agentic frameworks without any additional training cost. Extensive experiments across diverse domains demonstrate our method’s effectiveness and generality, highlighting the importance of principled, context-aware demo selection for robust and efficient LLM agents.
zh

[AI-12] From LLM s to Edge: Parameter-Efficient Fine-Tuning on Edge Devices

【速读】:该论文旨在解决在资源受限的边缘设备上对卷积神经网络(Convolutional Neural Networks, CNNs)进行高效微调的问题,尤其是在参数效率(Parameter Efficiency)方面相较于大语言模型(Large Language Models, LLMs)研究不足的现状。其关键解决方案是系统性地评估并比较LoRA、DoRA和GaLore等主流参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在标准与深度可分离卷积架构上的性能表现与计算开销,结合PyTorch性能分析工具量化更新过程中的内存占用与浮点运算量(FLOPs),从而揭示不同PEFT方法在边缘部署场景下的适用性差异,为硬件约束下模型更新策略提供实证依据。

链接: https://arxiv.org/abs/2507.23536
作者: Georg Slamanig,Francesco Corti,Olga Saukh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods reduce the computational costs of updating deep learning models by minimizing the number of additional parameters used to adapt a model to a down- stream task. While extensively researched in large language models (LLMs), their application to smaller models used on edge devices, such as convolutional neural networks, remains underexplored. This paper benchmarks and analyzes popular PEFT methods on convolutional architectures typically deployed in resource-constrained edge environments. We evaluate LoRA, DoRA, and GaLore for updating standard and depthwise convolutional architectures to handle distribution shifts and accommodate unseen classes. We utilize recently proposed PyTorch profilers to compare the updated model performance and computational costs of these PEFT methods with traditional fine-tuning approaches. With resource efficiency in mind, we investigate their update behavior across different rank dimensions. We find that the evaluated PEFT methods are only half as memory-efficient when applied to depthwise-separable convolution architectures, compared to their efficiency with LLMs. Conversely, when targeting convolu- tional architectures optimized for edge deployment, adapter-based PEFT methods can reduce floating point operations (FLOPs) during model updates by up to 95%. These insights offer valuable guidance for selecting PEFT methods based on hardware constraints, performance requirements, and application needs. Our code is online.
zh

[AI-13] ransparent AI: The Case for Interpretability and Explainability

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在高风险决策场景中缺乏透明性的问题,从而影响其责任性和可信度。解决方案的关键在于将可解释性(Interpretability)作为核心设计原则而非事后补充,通过在不同成熟度的组织中实施结构化策略,推动可解释性从研发初期就深度集成到AI系统的构建流程中,以实现负责任且可信的AI部署。

链接: https://arxiv.org/abs/2507.23535
作者: Dhanesh Ramachandram,Himanshu Joshi,Judy Zhu,Dhari Gandhi,Lucas Hartman,Ananya Raval
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As artificial intelligence systems increasingly inform high-stakes decisions across sectors, transparency has become foundational to responsible and trustworthy AI implementation. Leveraging our role as a leading institute in advancing AI research and enabling industry adoption, we present key insights and lessons learned from practical interpretability applications across diverse domains. This paper offers actionable strategies and implementation guidance tailored to organizations at varying stages of AI maturity, emphasizing the integration of interpretability as a core design principle rather than a retrospective add-on.
zh

[AI-14] Digital literacy interventions can boost humans in discerning deepfakes

【速读】:该论文旨在解决生成式 AI (Generative AI) 产生的深度伪造(deepfakes)对公众信任和选举公正性的潜在威胁,即人们难以辨别真实图像与深度伪造图像的问题。解决方案的关键在于设计并验证五种数字素养干预措施的有效性,其中最显著的是通过解释深度伪造的生成机制(即如何利用AI生成虚假图像)来提升用户识别能力,同时保持对真实图像的信任,实验证明该方法可在短期内提升识别准确率达13个百分点,并具备良好的可扩展性和人群适应性。

链接: https://arxiv.org/abs/2507.23492
作者: Dominique Geissler,Claire Robertson,Stefan Feuerriegel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deepfakes, i.e., images generated by artificial intelligence (AI), can erode trust in institutions and compromise election outcomes, as people often struggle to discern real images from deepfakes. Improving digital literacy can help address these challenges, yet scalable and effective approaches remain largely unexplored. Here, we compare the efficacy of five digital literacy interventions to boost people’s ability to discern deepfakes: (1) textual guidance on common indicators of deepfakes; (2) visual demonstrations of these indicators; (3) a gamified exercise for identifying deepfakes; (4) implicit learning through repeated exposure and feedback; and (5) explanations of how deepfakes are generated with the help of AI. We conducted an experiment with N=1,200 participants from the United States to test the immediate and long-term effectiveness of our interventions. Our results show that our interventions can boost deepfake discernment by up to 13 percentage points while maintaining trust in real images. Altogether, our approach is scalable, suitable for diverse populations, and highly effective for boosting deepfake detection while maintaining trust in truthful information.
zh

[AI-15] Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery

【速读】:该论文旨在解决大语言模型在因果发现(causal discovery)任务中表现不佳的问题,尤其是传统模型在数据扰动下容易过拟合、性能接近随机水平的局限性。其解决方案的关键在于引入一种受思维树(Tree-of-Thoughts)和思维链(Chain-of-Thoughts)启发的模块化上下文推理流水线,利用先进推理架构(如OpenAI的o系列与DeepSeek-R模型)的原生能力,通过结构化的推理路径显著提升因果发现的准确性,相较传统基线实现近三倍的性能提升。研究进一步表明,仅依赖模型本身的推理能力尚不足以最大化效果,精心设计的上下文框架是释放其潜力的核心要素。

链接: https://arxiv.org/abs/2507.23488
作者: Kacper Kadziolka,Saber Salehkaleybar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal inference remains a fundamental challenge for large language models. Recent advances in internal reasoning with large language models have sparked interest in whether state-of-the-art reasoning models can robustly perform causal discovery-a task where conventional models often suffer from severe overfitting and near-random performance under data perturbations. We study causal discovery on the Corr2Cause benchmark using the emergent OpenAI’s o-series and DeepSeek-R model families and find that these reasoning-first architectures achieve significantly greater native gains than prior approaches. To capitalize on these strengths, we introduce a modular in-context pipeline inspired by the Tree-of-Thoughts and Chain-of-Thoughts methodologies, yielding nearly three-fold improvements over conventional baselines. We further probe the pipeline’s impact by analyzing reasoning chain length, complexity, and conducting qualitative and quantitative comparisons between conventional and reasoning models. Our findings suggest that while advanced reasoning models represent a substantial leap forward, carefully structured in-context frameworks are essential to maximize their capabilities and offer a generalizable blueprint for causal discovery across diverse domains.
zh

[AI-16] Automated Feedback on Student-Generated UML and ER Diagrams Using Large Language Models

【速读】:该论文旨在解决UML(统一建模语言)和ER(实体关系)图在计算机科学教育中因抽象思维、上下文理解及语法与语义掌握需求而带来的学习挑战,尤其是在大规模班级中传统教学方法难以提供可扩展且个性化的反馈问题。解决方案的关键在于提出DUET(Diagrammatic UML & ER Tutor),一个基于大语言模型(LLM)的原型工具,其通过将参考图与学生提交的图转换为文本表示,并利用多阶段LLM流水线比较差异、生成结构化反思性反馈,从而实现高效、可扩展的学习支持;同时,该工具还为教师提供分析洞察,助力教学策略优化,推动自适应学习发展。

链接: https://arxiv.org/abs/2507.23470
作者: Sebastian Gürtl,Gloria Schimetta,David Kerschbaumer,Michael Liut,Alexander Steinmaurer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Learnersourcing: Student-generated Content @ Scale Workshop at L@S 2025

点击查看摘要

Abstract:UML and ER diagrams are foundational in computer science education but come with challenges for learners due to the need for abstract thinking, contextual understanding, and mastery of both syntax and semantics. These complexities are difficult to address through traditional teaching methods, which often struggle to provide scalable, personalized feedback, especially in large classes. We introduce DUET (Diagrammatic UML ER Tutor), a prototype of an LLM-based tool, which converts a reference diagram and a student-submitted diagram into a textual representation and provides structured feedback based on the differences. It uses a multi-stage LLM pipeline to compare diagrams and generate reflective feedback. Furthermore, the tool enables analytical insights for educators, aiming to foster self-directed learning and inform instructional strategies. We evaluated DUET through semi-structured interviews with six participants, including two educators and four teaching assistants. They identified strengths such as accessibility, scalability, and learning support alongside limitations, including reliability and potential misuse. Participants also suggested potential improvements, such as bulk upload functionality and interactive clarification features. DUET presents a promising direction for integrating LLMs into modeling education and offers a foundation for future classroom integration and empirical evaluation.
zh

[AI-17] KLAN: Kuaishou Landing-page Adaptive Navigator UAI

【速读】:该论文旨在解决在线平台中用户页面导航(Stage I)的优化问题,即在用户进入应用时,如何基于其个性化偏好主动选择最合适的落地页(landing page),以提升短期点击率(PDR)及长期用户参与度和满意度。现有研究多集中于页面内推荐(Stage II),而忽视了首页到目标页的导航策略。为此,作者正式提出了**个性化落地页建模(Personalized Landing Page Modeling, PLPM)**任务,并设计了KLAN(Kuaishou Landing-page Adaptive Navigator)框架作为解决方案。其核心创新在于构建一个分层决策系统:(1) KLAN-ISP捕获用户跨日静态偏好;(2) KLAN-IIT捕捉日内动态兴趣迁移;(3) KLAN-AM自适应融合两者实现最优导航决策。该方案已在快手平台大规模部署,显著提升了DAU和用户生命周期(LT)。

链接: https://arxiv.org/abs/2507.23459
作者: Fan Li,Chang Meng,Jiaqi Fu,Shuchang Liu,Jiashuo Zhang,Tianke Zhang,Xueliang Wang,Xiaoqiang Feng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: We propose PLPM, a new task for selecting optimal landing pages upon user entry. Our solution, KLAN, models static and dynamic user interests and is successfully deployed on Kuaishou, improving DAU and user lifetime

点击查看摘要

Abstract:Modern online platforms configure multiple pages to accommodate diverse user needs. This multi-page architecture inherently establishes a two-stage interaction paradigm between the user and the platform: (1) Stage I: page navigation, navigating users to a specific page and (2) Stage II: in-page interaction, where users engage with customized content within the specific page. While the majority of research has been focusing on the sequential recommendation task that improves users’ feedback in Stage II, there has been little investigation on how to achieve better page navigation in Stage I. To fill this gap, we formally define the task of Personalized Landing Page Modeling (PLPM) into the field of recommender systems: Given a user upon app entry, the goal of PLPM is to proactively select the most suitable landing page from a set of candidates (e.g., functional tabs, content channels, or aggregation pages) to optimize the short-term PDR metric and the long-term user engagement and satisfaction metrics, while adhering to industrial constraints. Additionally, we propose KLAN (Kuaishou Landing-page Adaptive Navigator), a hierarchical solution framework designed to provide personalized landing pages under the formulation of PLPM. KLAN comprises three key components: (1) KLAN-ISP captures inter-day static page preference; (2) KLAN-IIT captures intra-day dynamic interest transitions and (3) KLAN-AM adaptively integrates both components for optimal navigation decisions. Extensive online experiments conducted on the Kuaishou platform demonstrate the effectiveness of KLAN, obtaining +0.205% and +0.192% improvements on in Daily Active Users (DAU) and user Lifetime (LT). Our KLAN is ultimately deployed on the online platform at full traffic, serving hundreds of millions of users. To promote further research in this important area, we will release our dataset and code upon paper acceptance.
zh

[AI-18] Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation ACL2025

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)指令合成中存在的重要挑战:即如何在无需大量人工标注的情况下,高效生成具有足够多样性和难度的指令数据。现有自动化合成方法虽缓解了人工依赖,但仍难以保证合成指令的质量与复杂度。为此,作者提出 Self-Foveate 方法,其核心创新在于引入“微观-散射-宏观”(Micro-Scatter-Macro)多层级聚焦机制,通过引导 LLM 深度挖掘无监督文本中的细粒度信息,显著提升合成指令的多样性与难度,从而增强模型的泛化能力与问题求解性能。

链接: https://arxiv.org/abs/2507.23440
作者: Mingzhe Li,Xin Lu,Yanyan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by Findings of ACL 2025

点击查看摘要

Abstract:Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation. Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis. This approach introduces a “Micro-Scatter-Macro” multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method. We publicly release our data and codes: this https URL
zh

[AI-19] Chatting with your ERP: A Recipe

【速读】:该论文旨在解决如何让大型语言模型(Large Language Model, LLM)能够可靠地与工业级企业资源计划(Enterprise Resource Planning, ERP)系统交互的问题,具体表现为将自然语言查询准确转化为可执行的SQL语句。其解决方案的关键在于提出了一种新颖的双代理(dual-agent)架构,该架构包含推理(reasoning)和批判(critique)两个阶段,通过分阶段的协同机制提升SQL生成的准确性与可靠性,从而增强LLM在复杂工业场景下的实用性和鲁棒性。

链接: https://arxiv.org/abs/2507.23429
作者: Jorge Ruiz Gómez,Lidia Andrés Susinos,Jorge Alamo Olivé,Sonia Rey Osorno,Manuel Luis Gonzalez Hernández
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 11 pages, includes 3 tables summarizing schema and model performance. Submitted on July 31, 2025. Targets integration of LLM agents with ERP systems using open-weight models and Ollama deployment

点击查看摘要

Abstract:This paper presents the design, implementation, and evaluation behind a Large Language Model (LLM) agent that chats with an industrial production-grade ERP system. The agent is capable of interpreting natural language queries and translating them into executable SQL statements, leveraging open-weight LLMs. A novel dual-agent architecture combining reasoning and critique stages was proposed to improve query generation reliability.
zh

[AI-20] LLM 4Rail: An LLM -Augmented Railway Service Consulting Platform

【速读】:该论文旨在解决铁路服务个性化不足的问题,特别是在票务、餐饮推荐、天气信息及闲聊等场景中缺乏智能化响应能力。其核心解决方案是构建一个基于大语言模型(Large Language Models, LLMs)的铁路服务咨询平台——LLM4Rail,其关键创新在于提出了一种迭代式的“问题-思考-行动-观察(Question-Thought-Action-Observation, QTAO)”提示框架,通过将语义推理与任务导向型动作有机结合,实现对外部铁路运营和服务信息的有效检索与精准响应。此外,为支持个性化列车餐饮推荐,作者还构建了公开的中国铁路饮食数据集(Chinese Railway Food and Drink, CRFD-25),并引入基于特征相似性的后处理机制,确保推荐结果与数据集一致,从而提升推荐的准确性与实用性。

链接: https://arxiv.org/abs/2507.23377
作者: Zhuo Li,Xianghuai Deng,Chiwei Feng,Hanmeng Li,Shenjie Wang,Haichao Zhang,Teng Jia,Conlin Chen,Louis Linchun Wu,Jia Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly reshaped different walks of business. To meet the increasing demands for individualized railway service, we develop LLM4Rail - a novel LLM-augmented railway service consulting platform. Empowered by LLM, LLM4Rail can provide custom modules for ticketing, railway food drink recommendations, weather information, and chitchat. In LLM4Rail, we propose the iterative “Question-Thought-Action-Observation (QTAO)” prompting framework. It meticulously integrates verbal reasoning with task-oriented actions, that is, reasoning to guide action selection, to effectively retrieve external observations relevant to railway operation and service to generate accurate responses. To provide personalized onboard dining services, we first construct the Chinese Railway Food and Drink (CRFD-25) - a publicly accessible takeout dataset tailored for railway services. CRFD-25 covers a wide range of signature dishes categorized by cities, cuisines, age groups, and spiciness levels. We further introduce an LLM-based zero-shot conversational recommender for railway catering. To address the unconstrained nature of open recommendations, the feature similarity-based post-processing step is introduced to ensure all the recommended items are aligned with CRFD-25 dataset.
zh

[AI-21] rae Agent : An LLM -based Agent for Software Engineering with Test-time Scaling

【速读】:该论文旨在解决软件工程中软件问题(software issue)自动修复的挑战,特别是针对基于大语言模型(LLM)的解决方案在处理大规模集成空间(large ensemble spaces)和缺乏代码仓库级理解(repository-level understanding)方面的局限性。其关键解决方案是提出Trae Agent,一种基于代理(agent-based)的集成推理方法,通过模块化代理实现生成(generation)、剪枝(pruning)和选择(selection)三个核心步骤,从而在代码仓库层面高效探索并优化候选解决方案空间,显著提升问题修复的准确性和鲁棒性。

链接: https://arxiv.org/abs/2507.23370
作者: Trae Research Team:Pengfei Gao,Zhao Tian,Xiangxin Meng,Xinchen Wang,Ruida Hu,Yuanan Xiao,Yizhou Liu,Zhao Zhang,Junjie Chen,Cuiyun Gao,Yun Lin,Yingfei Xiong,Chao Peng,Xia Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Pengfei Gao and Zhao Tian contributed equally to this technical report

点击查看摘要

Abstract:Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years. With the rapid advancement of large language models (LLMs), substantial progress has been made in addressing real-world software engineering tasks. Recent studies have introduced ensemble reasoning techniques to enhance the performance of LLM-based issue resolution. However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness. In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution. Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, pruning, and selection. We conduct extensive experiments using three leading LLMs on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques. Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%. We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at this https URL.
zh

[AI-22] “I made this (sort of)”: Negotiating authorship confronting fraudulence and exploring new musical spaces with prompt-based AI music generation

【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 音乐创作平台日益成熟背景下,人类创作者的身份认同、创作边界与艺术自主性如何被重构。解决方案的关键在于通过实践性创作(两部以提示工程为核心的音乐专辑)与大语言模型(Large Language Model, LLM)驱动的自我反思机制相结合,探索创作者与AI之间的互动关系——不仅质疑“谁是作者”,还揭示了艺术家在面对比自身更具技术能力的机器时,其音乐身份正在发生结构性转变,并由此开辟出新的音乐可能性空间。

链接: https://arxiv.org/abs/2507.23365
作者: Bob L. T. Sturm
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:I reflect on my experience creating two music albums centered on state-of-the-art prompt-based AI music generation platforms. The first album explicitly poses the question: What happens when I collide my junk mail with these platforms? The second album is a direct response to the first, and toys with the inability of state-of-the-art prompt-based AI music generation platforms to generate music that is not practiced'', polished’‘, and ``produced’'. I seed a large language model (LLM) with information about these albums and have it interview me, which results in the exploration of several deeper questions: To what extent am I the author? Where am I in the resulting music? How is my musical identity changing as I am faced with machines that are in some ways far more talented than I? What new musical spaces does my work open, for me or anyone/thing else? I conclude by reflecting on my reflections, as well as LLM-mediated self-reflection as method.
zh

[AI-23] Quality Evaluation of COBOL to Java Code Transformation

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的COBOL到Java代码翻译评估难题,特别是面对模型黑箱特性及翻译质量评估复杂性的挑战。其解决方案的关键在于融合静态分析检查器与LLM作为裁判(LLM-as-a-judge, LaaJ)技术,构建一个可扩展、多维度的自动化评估系统,从而支持持续集成流程、实现大规模基准测试,并显著降低对人工评审的依赖,为开发者和项目管理者提供可操作的洞察,助力高质量现代化代码库的演进。

链接: https://arxiv.org/abs/2507.23356
作者: Shmulik Froimovich,Raviv Gal,Wesam Ibraheem,Avi Ziv
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to ASE 2025

点击查看摘要

Abstract:We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM’s watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.
zh

[AI-24] Multi-Waypoint Path Planning and Motion Control for Non-holonomic Mobile Robots in Agricultural Applications

【速读】:该论文旨在解决农业环境中自主移动机器人在无结构场景下高效导航的问题,特别是针对草地杂草控制任务中需遍历无序坐标点并最小化路径长度、同时满足曲率约束以避免土壤破坏和植被损伤的需求。解决方案的关键在于提出一个集成式导航框架,其核心是将基于Dubins旅行商问题(Dubins Traveling Salesman Problem, DTSP)的全局路径规划与非线性模型预测控制(Nonlinear Model Predictive Control, NMPC)相结合:DTSP生成一条最短且曲率受限的全局路径以高效访问所有目标点,而NMPC则利用该路径进行局部轨迹优化和实时控制,确保机器人精准到达每个航点并满足动态约束。实验结果表明,该耦合方法相比解耦策略可减少约16%的路径长度,显著提升了路径平滑性和导航效率。

链接: https://arxiv.org/abs/2507.23350
作者: Mahmoud Ghorab,Matthias Lorenzen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages

点击查看摘要

Abstract:There is a growing demand for autonomous mobile robots capable of navigating unstructured agricultural environments. Tasks such as weed control in meadows require efficient path planning through an unordered set of coordinates while minimizing travel distance and adhering to curvature constraints to prevent soil damage and protect vegetation. This paper presents an integrated navigation framework combining a global path planner based on the Dubins Traveling Salesman Problem (DTSP) with a Nonlinear Model Predictive Control (NMPC) strategy for local path planning and control. The DTSP generates a minimum-length, curvature-constrained path that efficiently visits all targets, while the NMPC leverages this path to compute control signals to accurately reach each waypoint. The system’s performance was validated through comparative simulation analysis on real-world field datasets, demonstrating that the coupled DTSP-based planner produced smoother and shorter paths, with a reduction of about 16% in the provided scenario, compared to decoupled methods. Based thereon, the NMPC controller effectively steered the robot to the desired waypoints, while locally optimizing the trajectory and ensuring adherence to constraints. These findings demonstrate the potential of the proposed framework for efficient autonomous navigation in agricultural environments.
zh

[AI-25] AI Must not be Fully Autonomous

【速读】:该论文旨在解决当前人工智能(AI)系统日益增强的自主性所带来的潜在风险问题,特别是针对可能演变为人工超级智能(ASI)的全自主AI系统所引发的安全与伦理挑战。论文通过提出三级自主AI分类体系,明确指出应避免发展完全自主的AI(即第3级),因其具备自我设定目标的能力且缺乏负责任的人类监督,从而可能带来不可控的风险。解决方案的关键在于坚持“负责任的人类监督”原则,确保AI系统的决策和行动始终处于人类的可控范围内,以此作为防范AI价值错位、失控及其他潜在危害的核心机制。

链接: https://arxiv.org/abs/2507.23330
作者: Tosin Adewumi,Lama Alkhaled,Florent Imbert,Hui Han,Nudrat Habib,Karl Löwenmark
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Autonomous Artificial Intelligence (AI) has many benefits. It also has many risks. In this work, we identify the 3 levels of autonomous AI. We are of the position that AI must not be fully autonomous because of the many risks, especially as artificial superintelligence (ASI) is speculated to be just decades away. Fully autonomous AI, which can develop its own objectives, is at level 3 and without responsible human oversight. However, responsible human oversight is crucial for mitigating the risks. To ague for our position, we discuss theories of autonomy, AI and agents. Then, we offer 12 distinct arguments and 6 counterarguments with rebuttals to the counterarguments. We also present 15 pieces of recent evidence of AI misaligned values and other risks in the appendix.
zh

[AI-26] Evaluating the Dynamics of Membership Privacy in Deep Learning

【速读】:该论文旨在解决深度学习模型在训练过程中对训练数据隐私泄露的动态机制不明确的问题,特别是如何量化单个样本层面的成员推理攻击(Membership Inference Attacks, MIAs)风险随训练进程的变化。其解决方案的关键在于提出一个动态分析框架,通过在假正率-真正率(FPR-TPR)平面上追踪每个样本的脆弱性变化,系统地测量数据集复杂度、模型架构和优化器选择等因素对样本隐私风险演化速率与严重程度的影响;并发现样本内在学习难度与其最终隐私风险存在强相关性,表明高风险样本的隐私暴露倾向在训练早期即已基本确定,从而为构建主动的、隐私感知的模型训练策略提供了理论基础。

链接: https://arxiv.org/abs/2507.23291
作者: Yuetian Chen,Zhiqi Wang,Nathalie Baracaldo,Swanand Ravindra Kadhe,Lei Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Membership inference attacks (MIAs) pose a critical threat to the privacy of training data in deep learning. Despite significant progress in attack methodologies, our understanding of when and how models encode membership information during training remains limited. This paper presents a dynamic analytical framework for dissecting and quantifying privacy leakage dynamics at the individual sample level. By tracking per-sample vulnerabilities on an FPR-TPR plane throughout training, our framework systematically measures how factors such as dataset complexity, model architecture, and optimizer choice influence the rate and severity at which samples become vulnerable. Crucially, we discover a robust correlation between a sample’s intrinsic learning difficulty, and find that the privacy risk of samples highly vulnerable in the final trained model is largely determined early during training. Our results thus provide a deeper understanding of how privacy risks dynamically emerge during training, laying the groundwork for proactive, privacy-aware model training strategies.
zh

[AI-27] How Far Are AI Scientists from Changing the World?

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 科学家系统在实现真正科学发现突破方面的局限性问题,核心关注点是厘清 AI 科学家距离重塑科学研究范式和改变世界还有多远。其解决方案的关键在于通过前瞻性综述系统梳理现有 AI Scientist 系统的成就,识别制约其发展的关键瓶颈,并明确构建具备颠覆性发现能力的科学智能体所需的核心组件,从而为未来科学 AI 的发展目标提供清晰指引。

链接: https://arxiv.org/abs/2507.23276
作者: Qiujie Xie,Yixuan Weng,Minjun Zhu,Fuchen Shen,Shulin Huang,Zhen Lin,Jiahui Zhou,Zilan Mao,Zijie Yang,Linyi Yang,Jian Wu,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.
zh

[AI-28] XABPs: Towards eXplainable Autonomous Business Processes

【速读】:该论文旨在解决自主业务流程(Autonomous Business Processes, ABPs)在实际应用中引发的信任缺失、调试困难、责任不清、偏见风险及合规性问题。其解决方案的关键在于提出可解释的自主业务流程(eXplainable ABPs, XABPs),通过使系统能够清晰阐述其决策逻辑与执行依据,从而增强透明度与可控性,为ABPs的可信部署提供理论支撑与实践路径。

链接: https://arxiv.org/abs/2507.23269
作者: Peter Fettke,Fabiana Fournier,Lior Limonad,Andreas Metzger,Stefanie Rinderle-Ma,Barbara Weber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous business processes (ABPs), i.e., self-executing workflows leveraging AI/ML, have the potential to improve operational efficiency, reduce errors, lower costs, improve response times, and free human workers for more strategic and creative work. However, ABPs may raise specific concerns including decreased stakeholder trust, difficulties in debugging, hindered accountability, risk of bias, and issues with regulatory compliance. We argue for eXplainable ABPs (XABPs) to address these concerns by enabling systems to articulate their rationale. The paper outlines a systematic approach to XABPs, characterizing their forms, structuring explainability, and identifying key BPM research challenges towards XABPs.
zh

[AI-29] DynaSwarm: Dynamically Graph Structure Selection for LLM -based Multi-agent System

【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)框架中依赖手动设计且静态的协作图结构所带来的适应性差与性能瓶颈问题。其核心解决方案在于提出DynaSwarm框架,关键创新点包括:(1) 采用Actor-Critic强化学习(A2C)机制优化协作图结构,在稳定性上优于先前的强化学习方法;(2) 引入动态图选择器,通过参数高效的大型语言模型(LLM)微调,为每个输入样本自适应地选择最优图结构,从而利用样本特异性信息动态路由查询至专用代理网络。这一设计摒弃了“一刀切”的固定架构,显著提升了LLM驱动MAS的灵活性与性能表现。

链接: https://arxiv.org/abs/2507.23261
作者: Hui Yi Leong,Yuqing Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Current multi-agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance. To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM-based MAS through two key innovations: (1) an actor-critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning. DynaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks. © We propose to fine-tune the demonstration retriever to fully exploit the power of in-context learning (ICL). Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs.
zh

[AI-30] Efficient Machine Unlearning via Influence Approximation

【速读】:该论文旨在解决机器遗忘(machine unlearning)中的计算效率问题,即如何在不重新训练模型的前提下高效地使模型“遗忘”特定训练数据。现有基于影响估计的方法虽能避免重训练,但因需对所有训练样本和参数计算海森矩阵(Hessian matrix)及其逆矩阵,导致计算开销巨大,难以应用于大规模模型或频繁的数据删除场景。论文的关键创新在于建立记忆(增量学习)与遗忘(机器遗忘)之间的理论联系,并由此提出从增量学习视角出发的高效遗忘算法——影响近似遗忘(Influence Approximation Unlearning, IAU)。IAU利用增量学习中更高效的梯度优化机制替代传统依赖海森矩阵的高成本计算,从而在保证删除效果、模型性能的同时显著提升遗忘效率,实验证明其优于当前最优方法。

链接: https://arxiv.org/abs/2507.23257
作者: Jiawei Liu,Chenwang Wu,Defu Lian,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Due to growing privacy concerns, machine unlearning, which aims at enabling machine learning models to ``forget" specific training data, has received increasing attention. Among existing methods, influence-based unlearning has emerged as a prominent approach due to its ability to estimate the impact of individual training samples on model parameters without retraining. However, this approach suffers from prohibitive computational overhead arising from the necessity to compute the Hessian matrix and its inverse across all training samples and parameters, rendering it impractical for large-scale models and scenarios involving frequent data deletion requests. This highlights the difficulty of forgetting. Inspired by cognitive science, which suggests that memorizing is easier than forgetting, this paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning). This connection allows machine unlearning to be addressed from the perspective of incremental learning. Unlike the time-consuming Hessian computations in unlearning (forgetting), incremental learning (memorizing) typically relies on more efficient gradient optimization, which supports the aforementioned cognitive theory. Based on this connection, we introduce the Influence Approximation Unlearning (IAU) algorithm for efficient machine unlearning from the incremental perspective. Extensive empirical evaluations demonstrate that IAU achieves a superior balance among removal guarantee, unlearning efficiency, and comparable model utility, while outperforming state-of-the-art methods across diverse datasets and model architectures. Our code is available at this https URL.
zh

[AI-31] An Information Bottleneck Asset Pricing Model

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在金融资产定价中因过度拟合数据中的噪声信息而导致性能下降的问题。其解决方案的关键在于引入信息瓶颈(Information Bottleneck, IB)机制,通过在非线性映射过程中对互信息(Mutual Information)施加约束:逐步减少输入数据与压缩表示之间的互信息(以消除冗余噪声),同时增加压缩表示与输出预测之间的互信息(以保留对资产定价至关重要的信号)。这一设计确保了无关信息被遗忘,而不影响最终的定价准确性,从而在保持深度网络非线性建模能力的同时实现噪声过滤。

链接: https://arxiv.org/abs/2507.23218
作者: Che Sun
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have garnered significant attention in financial asset pricing, due to their strong capacity for modeling complex nonlinear relationships within financial data. However, sophisticated models are prone to over-fitting to the noise information in financial data, resulting in inferior performance. To address this issue, we propose an information bottleneck asset pricing model that compresses data with low signal-to-noise ratios to eliminate redundant information and retain the critical information for asset pricing. Our model imposes constraints of mutual information during the nonlinear mapping process. Specifically, we progressively reduce the mutual information between the input data and the compressed representation while increasing the mutual information between the compressed representation and the output prediction. The design ensures that irrelevant information, which is essentially the noise in the data, is forgotten during the modeling of financial nonlinear relationships without affecting the final asset pricing. By leveraging the constraints of the Information bottleneck, our model not only harnesses the nonlinear modeling capabilities of deep networks to capture the intricate relationships within financial data but also ensures that noise information is filtered out during the information compression process.
zh

[AI-32] Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

【速读】:该论文旨在解决复杂多模态文档理解中的两大挑战:一是文档结构不一致导致的解析困难,二是训练数据稀缺限制了模型性能。其解决方案的关键在于提出一个无需训练的文档理解系统DocsRay,该系统通过三个核心技术协同实现高效准确的理解:(1)基于提示的语义结构模块生成层次化伪目录(pseudo Table of Contents),用于组织文档内容;(2)利用多模态大语言模型(Multimodal LLMs)原生能力实现零样本多模态分析,将文本、图像、图表等异构元素统一转化为以文本为中心的表示;(3)设计两级分层检索机制,将检索复杂度从O(N)降低至O(S + k₁·Nₛ),显著提升效率。该方案在MMLongBench-Doc基准上达到64.7%准确率,并将查询延迟降低45%,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2507.23217
作者: Hyeon Seong Jeong,Sangwoo Jo,Byeong Hyun Yoon,Yoonseok Heo,Haedong Jeong,Taehoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textitDocsRay, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models’ (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay’s framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from O(N) to O(S + k_1 \cdot N_s) . Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.
zh

[AI-33] Solution-aware vs global ReLU selection: partial MILP strikes back for DNN verification

【速读】:该论文旨在解决神经网络验证中复杂实例的求解效率问题,特别是针对深度神经网络(如CNN)在形式化验证时因大规模变量和约束导致的计算瓶颈。其核心挑战在于如何平衡验证精度与计算成本,避免传统分支定界(Branch-and-Bound, BaB)方法因单次调用复杂度高而导致的低效性。解决方案的关键在于提出一种“分而治之”的策略:不再依赖少数几次复杂的BaB调用,而是通过多次轻量级的部分混合整数线性规划(Partial Mixed Integer Linear Programming, Partial MILP)调用来逐步逼近最优解。其中,最关键的技术创新是引入了一种新的解感知ReLU评分机制(Solution-aware ReLU Scoring, SAS),用于精准筛选出对验证结果影响最大的少量ReLU激活单元,并将其设为二进制变量进行精确建模,从而显著减少所需二进制变量的数量(相比先前方法降低约6倍),同时保持相同的验证准确性。此外,该方法还结合了全局ReLU评分函数(Global ReLU Scoring, GS)以增强搜索效率,在实际实现中采用Hybrid MILP框架,先使用α,β-CROWN快速处理简单样本,再调用Partial MILP处理剩余困难实例,最终将未决实例比例降至8–15%,且平均运行时间控制在46–417秒之间,适用于包含200万参数的大规模CNN模型。

链接: https://arxiv.org/abs/2507.23197
作者: Yuke Liao,Blaise Genest,Kuldeep Meel,Shaan Aryaman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To handle complex instances, we revisit a divide-and-conquer approach to break down the complexity: instead of few complex BaB calls, we rely on many small \em partial MILP calls. The crucial step is to select very few but very important ReLUs to treat using (costly) binary variables. The previous attempts were suboptimal in that respect. To select these important ReLU variables, we propose a novel \em solution-aware ReLU scoring (\sf SAS), as well as adapt the BaB-SR and BaB-FSB branching functions as \em global ReLU scoring (\sf GS) functions. We compare them theoretically as well as experimentally, and \sf SAS is more efficient at selecting a set of variables to open using binary variables. Compared with previous attempts, SAS reduces the number of binary variables by around 6 times, while maintaining the same level of accuracy. Implemented in \em Hybrid MILP, calling first \alpha,\beta -CROWN with a short time-out to solve easier instances, and then partial MILP, produces a very accurate yet efficient verifier, reducing by up to 40% the number of undecided instances to low levels ( 8-15% ), while keeping a reasonable runtime ( 46s-417s on average per instance), even for fairly large CNNs with 2 million parameters.
zh

[AI-34] ractable Responsibility Measures for Ontology-Mediated Query Answering KR2025

【速读】:该论文旨在解决在本体媒介查询(ontology-mediated query answering)场景下,计算基于Shapley值的责任度量(Shapley-value-based responsibility measures)——即WSMS(weighted sums of minimal supports)的复杂性问题。其核心挑战在于:如何在不同本体语言和查询类别的组合下,确定此类责任分数的计算是否具有良好的复杂性边界(如多项式时间可解),以及哪些结构限制能够保证高效计算。解决方案的关键在于利用数据库理论中的已有成果,结合对本体语言表达能力的分析,证明了当本体语言支持一阶可重写(first-order-rewritable)查询时,WSMS计算具有多项式数据复杂性;而一旦本体语言可编码可达性查询(如通过∃R.A ⊑ A形式的公理),则问题变为“shP”-hard;进一步地,在联合复杂性视角下,识别出在不依赖本体的情况下,仅含合取或具有良好行为的并查集查询时亦可能不可行,但针对常见的DL-Lite方言,通过引入结构受限的合取查询类(避免原子间的不良交互),实现了WSMS计算的多项式时间可解性。

链接: https://arxiv.org/abs/2507.23191
作者: Meghyn Bienvenu,Diego Figueira,Pierre Lafourcade
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Long version of a paper to appear at KR 2025, which contains further proof details in the appendix

点击查看摘要

Abstract:Recent work on quantitative approaches to explaining query answers employs responsibility measures to assign scores to facts in order to quantify their respective contributions to obtaining a given answer. In this paper, we study the complexity of computing such responsibility scores in the setting of ontology-mediated query answering, focusing on a very recently introduced family of Shapley-value-based responsibility measures defined in terms of weighted sums of minimal supports (WSMS). By exploiting results from the database setting, we can show that such measures enjoy polynomial data complexity for classes of ontology-mediated queries that are first-order-rewritable, whereas the problem becomes “shP”-hard when the ontology language can encode reachability queries (via axioms like \exists R. A \sqsubseteq A ). To better understand the tractability frontier, we next explore the combined complexity of WSMS computation. We prove that intractability applies already to atomic queries if the ontology language supports conjunction, as well as to unions of `well-behaved’ conjunctive queries, even in the absence of an ontology. By contrast, our study yields positive results for common DL-Lite dialects: by means of careful analysis, we identify classes of structurally restricted conjunctive queries (which intuitively disallow undesirable interactions between query atoms) that admit tractable WSMS computation.
zh

[AI-35] AutoBridge: Automating Smart Device Integration with Centralized Platform

【速读】:该论文旨在解决多模态物联网(Multimodal IoT)系统中设备集成代码生成的高复杂性和人工依赖问题,即在集中式平台下将新物联网设备纳入系统时,需大量专家编程工作来编写理解与控制设备功能的集成代码。其解决方案的关键在于提出AutoBridge自动化框架,采用分而治之策略:首先通过逐步检索设备特定知识生成设备控制逻辑,再利用平台特定知识合成符合平台规范的集成代码;同时引入多阶段调试机制,包括虚拟设备自动调试器和仅需用户二元反馈(是/否)的软硬件协同调试器,以确保生成代码的正确性与完整性。

链接: https://arxiv.org/abs/2507.23178
作者: Siyuan Liu,Zhice Yang,Huangxun Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures, under review

点击查看摘要

Abstract:Multimodal IoT systems coordinate diverse IoT devices to deliver human-centered services. The ability to incorporate new IoT devices under the management of a centralized platform is an essential requirement. However, it requires significant human expertise and effort to program the complex IoT integration code that enables the platform to understand and control the device functions. Therefore, we propose AutoBridge to automate IoT integration code generation. Specifically, AutoBridge adopts a divide-and-conquer strategy: it first generates device control logic by progressively retrieving device-specific knowledge, then synthesizes platformcompliant integration code using platform-specific knowledge. To ensure correctness, AutoBridge features a multi-stage debugging pipeline, including an automated debugger for virtual IoT device testing and an interactive hardware-in-the-loop debugger that requires only binary user feedback (yes and no) for real-device verification. We evaluate AutoBridge on a benchmark of 34 IoT devices across two open-source IoT platforms. The results demonstrate that AutoBridge can achieves an average success rate of 93.87% and an average function coverage of 94.87%, without any human involvement. With minimal binary yes and no feedback from users, the code is then revised to reach 100% function coverage. A user study with 15 participants further shows that AutoBridge outperforms expert programmers by 50% to 80% in code accuracy, even when the programmers are allowed to use commercial code LLMs.
zh

[AI-36] Argumentatively Coherent Judgmental Forecasting ECAI2025

【速读】:该论文旨在解决判断性预测(judgmental forecasting)中因个体意见缺乏逻辑一致性而导致的预测准确性不足问题,尤其是在基于论证结构(argumentative structure)的预测场景下。其解决方案的关键在于提出并形式化定义了“论证一致性”(argumentative coherence)这一属性——即预测者的推理必须与其预测结果保持逻辑一致。通过三重评估验证,该方法在人类和大语言模型(Large Language Model, LLM)两类预测者中均表明:过滤掉不一致的预测可显著提升整体预测准确性,从而证明论证一致性是提高判断性预测质量的重要机制。此外,研究还发现用户通常不自觉遵循此一致性原则,进一步凸显了在群体预测前引入一致性过滤机制的必要性。

链接: https://arxiv.org/abs/2507.23163
作者: Deniz Gorur,Antonio Rago,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 18 figures, ECAI 2025

点击查看摘要

Abstract:Judgmental forecasting employs human opinions to make predictions about future events, rather than exclusively historical data as in quantitative forecasting. When these opinions form an argumentative structure around forecasts, it is useful to study the properties of the forecasts from an argumentative perspective. In this paper, we advocate and formally define a property of argumentative coherence, which, in essence, requires that a forecaster’s reasoning is coherent with their forecast. We then conduct three evaluations with our notion of coherence. First, we assess the impact of enforcing coherence on human forecasters as well as on Large Language Model (LLM)-based forecasters, given that they have recently shown to be competitive with human forecasters. In both cases, we show that filtering out incoherent predictions improves forecasting accuracy consistently, supporting the practical value of coherence in both human and LLM-based forecasting. Then, via crowd-sourced user experiments, we show that, despite its apparent intuitiveness and usefulness, users do not generally align with this coherence property. This points to the need to integrate, within argumentation-based judgmental forecasting, mechanisms to filter out incoherent opinions before obtaining group forecasting predictions.
zh

[AI-37] FLOSS: Federated Learning with Opt-Out and Strag gler Support

【速读】:该论文旨在解决联邦学习(Federated Learning)系统中因用户选择性退出数据共享(user opt-out)与设备异构性导致的延迟节点(stragglers)共同作用下所产生的缺失数据问题,此类缺失数据会引入偏差并降低模型性能。解决方案的关键在于提出 FLOSS 系统,通过设计机制有效缓解缺失数据对模型训练的影响,从而在存在用户主动不参与和设备能力差异的情况下仍能维持联邦学习的稳定性和准确性,并在模拟实验中验证了其有效性。

链接: https://arxiv.org/abs/2507.23115
作者: David J Goetze,Dahlia J Felten,Jeannie R Albrecht,Rohit Bhattacharya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Previous work on data privacy in federated learning systems focuses on privacy-preserving operations for data from users who have agreed to share their data for training. However, modern data privacy agreements also empower users to use the system while opting out of sharing their data as desired. When combined with stragglers that arise from heterogeneous device capabilities, the result is missing data from a variety of sources that introduces bias and degrades model performance. In this paper, we present FLOSS, a system that mitigates the impacts of such missing data on federated learning in the presence of stragglers and user opt-out, and empirically demonstrate its performance in simulations.
zh

[AI-38] On the Sustainability of AI Inferences in the Edge

【速读】:该论文旨在解决边缘计算场景下AI模型部署中缺乏系统性性能与能耗评估的问题,尤其针对资源受限的边缘设备(如Raspberry Pi、Intel Neural Compute Stick等)在执行传统模型、神经网络及大语言模型推理时,如何权衡F1分数、推理时间、功耗和内存使用之间的关系。解决方案的关键在于通过严谨的实验分析,量化不同设备上各类模型的性能-资源 trade-off,并结合硬件优化、框架调优以及外部参数调整,实现模型性能与资源消耗之间的平衡,从而为实际边缘AI部署提供决策依据。

链接: https://arxiv.org/abs/2507.23093
作者: Ghazal Sobhani,Md. Monzurul Amin Ifath,Tushar Sharma,Israat Haque
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 14 pages, 8 figures, 6 tables, in preparation for journal submission

点击查看摘要

Abstract:The proliferation of the Internet of Things (IoT) and its cutting-edge AI-enabled applications (e.g., autonomous vehicles and smart industries) combine two paradigms: data-driven systems and their deployment on the edge. Usually, edge devices perform inferences to support latency-critical applications. In addition to the performance of these resource-constrained edge devices, their energy usage is a critical factor in adopting and deploying edge applications. Examples of such devices include Raspberry Pi (RPi), Intel Neural Compute Stick (INCS), NVIDIA Jetson nano (NJn), and Google Coral USB (GCU). Despite their adoption in edge deployment for AI inferences, there is no study on their performance and energy usage for informed decision-making on the device and model selection to meet the demands of applications. This study fills the gap by rigorously characterizing the performance of traditional, neural networks, and large language models on the above-edge devices. Specifically, we analyze trade-offs among model F1 score, inference time, inference power, and memory usage. Hardware and framework optimization, along with external parameter tuning of AI models, can balance between model performance and resource usage to realize practical edge AI deployments.
zh

[AI-39] Moravecs Paradox: Towards an Auditory Turing Test

【速读】:该论文旨在解决当前人工智能系统在复杂听觉场景中表现严重不足的问题,尤其是与人类在自然环境中轻松完成的听觉任务相比,AI系统存在显著性能差距。其核心问题是现有模型缺乏对多源声音、噪声干扰和空间信息等要素的高效处理能力,导致在选择性注意、噪声鲁棒性和情境适应性方面出现“聚焦失败”。解决方案的关键在于构建了一个包含917个挑战的听觉图灵测试(auditory Turing test)基准,涵盖七类人类日常听觉场景,首次量化了人机之间的听觉感知差距,并揭示了当前架构在模拟人类听觉场景分析机制上的根本缺陷。该框架不仅为评估机器听觉能力提供标准化工具,还指出未来需将选择性注意机制、基于物理的音频理解与情境感知融合进多模态AI系统,以推动迈向人类水平的机器听觉(machine listening)。

链接: https://arxiv.org/abs/2507.23091
作者: David Noever,Forrest McKee
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec’s paradox (i.e., tasks simple for humans often prove difficult for machines, and vice versa), we introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions. Our evaluation of state-of-the-art audio models including GPT-4’s audio capabilities and OpenAI’s Whisper reveals a striking failure rate exceeding 93%, with even the best-performing model achieving only 6.9% accuracy on tasks that humans solved at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human-machine auditory gap but also provides insights into why these failures occur, suggesting that current architectures lack fundamental mechanisms for human-like auditory scene analysis. The traditional design of audio CAPTCHAs highlights common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human-level machine listening and highlights the need for novel approaches integrating selective attention, physics-based audio understanding, and context-aware perception into multimodal AI systems.
zh

[AI-40] Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance

【速读】:该论文旨在解决当前AI驱动的手术辅助系统在动态手术环境中因依赖任务特异性预训练、固定物体类别和显式手动提示而表现出的僵化性问题,从而限制了自然的人机交互能力。其解决方案的关键在于提出了一种新型感知代理(Perception Agent),该代理融合了语音集成的提示工程大型语言模型(LLM)、分割一切模型(SAM)以及任意点追踪基础模型,通过引入记忆库及两种新颖的未见元素分割机制,实现了对已知与未知手术场景元素的直观分割,并具备将新识别元素记忆用于未来手术的能力,显著提升了人机协同的灵活性与适应性。

链接: https://arxiv.org/abs/2507.23088
作者: Lalithkumar Seenivasan,Jiru Xu,Roger D. Soberanis Mukul,Hao Ding,Grayson Byrd,Yu-Chun Ku,Jose L. Porras,Masaru Ishii,Mathias Unberath
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Emerging surgical data science and robotics solutions, especially those designed to provide assistance in situ, require natural human-machine interfaces to fully unlock their potential in providing adaptive and intuitive aid. Contemporary AI-driven solutions remain inherently rigid, offering limited flexibility and restricting natural human-machine interaction in dynamic surgical environments. These solutions rely heavily on extensive task-specific pre-training, fixed object categories, and explicit manual-prompting. This work introduces a novel Perception Agent that leverages speech-integrated prompt-engineered large language models (LLMs), segment anything model (SAM), and any-point tracking foundation models to enable a more natural human-machine interaction in real-time intraoperative surgical assistance. Incorporating a memory repository and two novel mechanisms for segmenting unseen elements, Perception Agent offers the flexibility to segment both known and unseen elements in the surgical scene through intuitive interaction. Incorporating the ability to memorize novel elements for use in future surgeries, this work takes a marked step towards human-machine symbiosis in surgical procedures. Through quantitative analysis on a public dataset, we show that the performance of our agent is on par with considerably more labor-intensive manual-prompting strategies. Qualitatively, we show the flexibility of our agent in segmenting novel elements (instruments, phantom grafts, and gauze) in a custom-curated dataset. By offering natural human-machine interaction and overcoming rigidity, our Perception Agent potentially brings AI-based real-time assistance in dynamic surgical environments closer to reality.
zh

[AI-41] On LLM -Assisted Generation of Smart Contracts from Business Processes

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在智能合约代码生成中可靠性不足的问题,尤其是针对传统基于规则的代码生成方法难以适应复杂业务流程描述的局限性。其解决方案的关键在于提出一个自动化评估框架,用于系统性地测试不同规模和类型的大型语言模型(Large Language Models, LLMs)在生成智能合约时对流程执行关键属性(如流程控制流、资源分配和数据条件约束)的实现能力,并通过大规模过程模型数据集提供实证结果,从而为未来负责任地集成 LLM 于现有代码生成工具提供可衡量的基准与改进方向。

链接: https://arxiv.org/abs/2507.23087
作者: Fabian Stiehle,Hans Weytjens,Ingo Weber
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Distributed Ledger Technologies in Business Process Management, At the International Conference for Business Process Management (BPM), 2025

点击查看摘要

Abstract:Large language models (LLMs) have changed the reality of how software is produced. Within the wider software engineering community, among many other purposes, they are explored for code generation use cases from different types of input. In this work, we present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions, an idea that has emerged in recent literature to overcome the limitations of traditional rule-based code generation approaches. However, current LLM-based work evaluates generated code on small samples, relying on manual inspection, or testing whether code compiles but ignoring correct execution. With this work, we introduce an automated evaluation framework and provide empirical data from larger data sets of process models. We test LLMs of different types and sizes in their capabilities of achieving important properties of process execution, including enforcing process flow, resource allocation, and data-based conditions. Our results show that LLM performance falls short of the perfect reliability required for smart contract development. We suggest future work to explore responsible LLM integrations in existing tools for code generation to ensure more reliable output. Our benchmarking framework can serve as a foundation for developing and evaluating such integrations.
zh

[AI-42] AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor Towards Scaling Workloads

【速读】:该论文旨在解决大规模分析型工作负载下索引选择(index selection)的效率问题,尤其针对基于深度强化学习(Deep Reinforcement Learning, DRL)的索引推荐器在面对扩展性工作负载时因动作空间指数级增长和大量试错导致的性能瓶颈。解决方案的关键在于提出AutoIndexer框架,其核心创新包括:工作负载压缩(workload compression)以降低搜索复杂度、查询优化技术提升决策质量,以及设计专用的强化学习模型实现高效适应不同规模的工作负载。该方法在保持索引质量的同时显著减少调优时间并提升整体性能,实验证明其相较无索引基线可将查询执行时间降低95%,且相比现有最优DRL方法平均节省20%的工作负载成本,同时调优时间减少超过50%。

链接: https://arxiv.org/abs/2507.23084
作者: Taiyi Wang,Eiko Yoneki
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Efficiently selecting indexes is fundamental to database performance optimization, particularly for systems handling large-scale analytical workloads. While deep reinforcement learning (DRL) has shown promise in automating index selection through its ability to learn from experience, few works address how these RL-based index advisors can adapt to scaling workloads due to exponentially growing action spaces and heavy trial and error. To address these challenges, we introduce AutoIndexer, a framework that combines workload compression, query optimization, and specialized RL models to scale index selection effectively. By operating on compressed workloads, AutoIndexer substantially lowers search complexity without sacrificing much index quality. Extensive evaluations show that it reduces end-to-end query execution time by up to 95% versus non-indexed baselines. On average, it outperforms state-of-the-art RL-based index advisors by approximately 20% in workload cost savings while cutting tuning time by over 50%. These results affirm AutoIndexer’s practicality for large and diverse workloads.
zh

[AI-43] FairReason : Balancing Reasoning and Social Bias in MLLM s

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在提升推理能力过程中伴随的社会偏见加剧问题,即推理增强与偏见缓解之间是否存在内在权衡(trade-off)。解决方案的关键在于通过系统性实验比较三种偏见缓解策略——监督微调(Supervised Fine-Tuning, SFT)、知识蒸馏(Knowledge Distillation, KD)和基于规则的强化学习(Rule-based Reinforcement Learning, RL),并在相同条件下量化其性能表现;进一步地,通过调整每种方法中偏见缓解样本与推理导向样本的比例,发现一个稳定的“甜点”配置:使用强化学习训练时,约1:4的偏见聚焦样本与推理核心样本比例可在降低10%刻板印象得分的同时保留88%的原始推理准确率,从而为平衡公平性与模型能力提供了可操作的指导。

链接: https://arxiv.org/abs/2507.23067
作者: Zhenyu Pan,Yutong Zhang,Jianshu Zhang,Haoran Lu,Haozheng Luo,Yuwei Han,Philip S. Yu,Manling Li,Han Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models’ outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation-and whether the two objectives inherently trade off-therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies-supervised fine-uning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)-under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versus-bias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model’s original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.
zh

[AI-44] Data Readiness for Scientific AI at Scale

【速读】:该论文旨在解决科学数据在用于训练基础模型(foundation models)时的准备度不足问题,特别是针对领导级科学数据集(leadership-scale scientific datasets)在高性能计算(HPC)环境中如何实现从原始数据到AI就绪状态的标准化转化。其解决方案的关键在于提出一个二维的Data Readiness Levels (DRAI) 框架,涵盖“数据准备度层级”(从原始数据到AI就绪)与“数据处理阶段”(从数据摄取到分片),并结合Transformer架构的生成式AI模型特点,构建了一个概念性成熟度矩阵,用以识别跨领域的共性预处理模式与特定约束,从而指导基础设施建设向可扩展、可复现的AI for Science方向发展。

链接: https://arxiv.org/abs/2507.23018
作者: Wesley Brewer,Patrick Widener,Valentine Anantharaj,Feiyi Wang,Tom Beck,Arjun Shankar,Sarp Oral
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 10 pages, 1 figure, 2 tables

点击查看摘要

Abstract:This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.
zh

[AI-45] Stop Evaluating AI with Human Tests Develop Principled AI-specific Tests instead

【速读】:该论文试图解决当前主流做法中将人类心理与教育测评工具(如智力、人格测试)直接应用于大语言模型(Large Language Models, LLMs)以评估其“类人特质”的问题,指出这种做法存在本体论错误(ontological error),因其忽略了这些测试原本是为特定人类群体设计并校准的测量工具,缺乏对非人类主体的实证验证。解决方案的关键在于摒弃使用人类基准测试来衡量AI性能的做法,转而构建基于严谨理论和实证基础的、专为人工智能系统设计的评估框架,此类框架可借鉴心理测量学(psychometrics)中的测验开发与验证方法,或全新设计以适配AI的独特属性与应用场景。

链接: https://arxiv.org/abs/2507.23009
作者: Tom Sühr,Florian E. Dorner,Olawale Salaudeen,Augustin Kelava,Samira Samadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence’', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.
zh

[AI-46] Unifying Post-hoc Explanations of Knowledge Graph Completions

【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)中事后可解释性(post-hoc explainability)缺乏形式化定义和一致评估标准的问题,从而阻碍了研究的可复现性和跨研究比较。其解决方案的关键在于提出一个统一的多目标优化框架,通过平衡解释的有效性(effectiveness)与简洁性(conciseness)来表征事后解释,并以此整合现有KGC解释算法及其生成的解释结果;同时,论文还建议并实证支持改进的评估协议,引入如平均倒数排名(Mean Reciprocal Rank)和Hits@k等指标,并强调可解释性应具备满足终端用户查询意图的能力,从而推动KGC可解释性研究向更可复现、更具影响力的方向发展。

链接: https://arxiv.org/abs/2507.22951
作者: Alessandro Lonardi,Samy Badreddine,Tarek R. Besold,Pablo Sanchez Martin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-hoc explainability for Knowledge Graph Completion (KGC) lacks formalization and consistent evaluations, hindering reproducibility and cross-study comparisons. This paper argues for a unified approach to post-hoc explainability in KGC. First, we propose a general framework to characterize post-hoc explanations via multi-objective optimization, balancing their effectiveness and conciseness. This unifies existing post-hoc explainability algorithms in KGC and the explanations they produce. Next, we suggest and empirically support improved evaluation protocols using popular metrics like Mean Reciprocal Rank and Hits@ k . Finally, we stress the importance of interpretability as the ability of explanations to address queries meaningful to end-users. By unifying methods and refining evaluation standards, this work aims to make research in KGC explainability more reproducible and impactful.
zh

[AI-47] SmartCourse: A Contextual AI-Powered Course Advising System for Undergraduates

【速读】:该论文旨在解决传统学术指导工具在个性化推荐方面存在的局限性,尤其是在计算机科学(Computer Science, CS)专业本科生的课程规划与 advising 中缺乏对学生个人成绩单(transcript)和学位计划(degree plan)信息的有效利用问题。解决方案的关键在于构建一个集成的课程管理与生成式 AI 驱动的指导系统 SmartCourse,其核心创新是通过本地部署的大语言模型(Large Language Model, LLM)结合学生具体的成绩单和专业培养方案,提供上下文感知的个性化课程推荐;实验表明,使用完整上下文信息显著提升了推荐的相关性,验证了 transcript-aware AI 在学术规划中的有效性。

链接: https://arxiv.org/abs/2507.22946
作者: Yixuan Mi,Yiduo Yu,Yiyi Zhao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, 1 table. *Corresponding author: Yixuan Mi. Code: this https URL

点击查看摘要

Abstract:We present SmartCourse, an integrated course management and AI-driven advising system for undergraduate students (specifically tailored to the Computer Science (CPS) major). SmartCourse addresses the limitations of traditional advising tools by integrating transcript and plan information for student-specific context. The system combines a command-line interface (CLI) and a Gradio web GUI for instructors and students, manages user accounts, course enrollment, grading, and four-year degree plans, and integrates a locally hosted large language model (via Ollama) for personalized course recommendations. It leverages transcript and major plan to offer contextual advice (e.g., prioritizing requirements or retakes). We evaluated the system on 25 representative advising queries and introduced custom metrics: PlanScore, PersonalScore, Lift, and Recall to assess recommendation quality across different context conditions. Experiments show that using full context yields substantially more relevant recommendations than context-omitted modes, confirming the necessity of transcript and plan information for personalized academic advising. SmartCourse thus demonstrates how transcript-aware AI can enhance academic planning.
zh

[AI-48] From Propagator to Oscillator: The Dual Role of Symmetric Differential Equations in Neural Systems

【速读】:该论文旨在解决神经元模型在功能单一性上的局限问题,即传统模型难以同时实现信号传播与振荡生成的双重功能。其解决方案的关键在于提出一种基于对称微分方程的新颖神经元模型,并通过系统参数空间分析揭示了该模型具有功能二元性(functional duality):一方面表现为渐近稳定轨迹,可作为可靠的信号传播器;另一方面表现为李雅普诺夫稳定轨迹,产生持续自激振荡,充当信号发生器。这一特性使模型能够模拟生物神经元在信息传递与节律生成中的双重角色,为类脑计算和神经形态工程提供了理论基础与功能设计路径。

链接: https://arxiv.org/abs/2507.22916
作者: Kun Jiang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:In our previous work, we proposed a novel neuron model based on symmetric differential equations and demonstrated its potential as an efficient signal propagator. Building upon that foundation, the present study delves deeper into the intrinsic dynamics and functional diversity of this model. By systematically exploring the parameter space and employing a range of mathematical analysis tools, we theoretically reveal the system 's core property of functional duality. Specifically, the model exhibits two distinct trajectory behaviors: one is asymptotically stable, corresponding to a reliable signal propagator; the other is Lyapunov stable, characterized by sustained self-excited oscillations, functioning as a signal generator. To enable effective monitoring and prediction of system states during simulations, we introduce a novel intermediate-state metric termed on-road energy. Simulation results confirm that transitions between the two functional modes can be induced through parameter adjustments or modifications to the connection structure. Moreover, we show that oscillations can be effectively suppressed by introducing external signals. These findings draw a compelling parallel to the dual roles of biological neurons in both information transmission and rhythm generation, thereby establishing a solid theoretical basis and a clear functional roadmap for the broader application of this model in neuromorphic engineering.
zh

[AI-49] SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches NEURIPS2025

【速读】:该论文旨在解决生成式 AI 在评估学生绘制的科学草图(scientific sketches)时面临的挑战,即现有方法多将其视为图像分类任务或使用单一视觉-语言模型,导致评估缺乏可解释性、教学一致性以及跨认知层级的适应能力。解决方案的关键在于提出 SketchMind——一个基于认知理论的多智能体框架,通过模块化代理实现评分标准解析(rubric parsing)、草图感知(sketch perception)、认知对齐(cognitive alignment)及迭代反馈与草图修改,从而提供个性化且透明的评估与改进机制。实验表明,引入结构化推理生成(Structured Reasoning Generation, SRG)后,GPT-4o 的平均准确率提升至 77.1%(+21.4%),且多智能体协同显著优于单智能体流程,同时人类评估者对生成反馈和共创草图的满意度达 4.1/5,显著高于基线模型(如 GPT-4o 为 2.3/5)。

链接: https://arxiv.org/abs/2507.22904
作者: Ehsan Latif,Zirak Khan,Xiaoming Zhai
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted to NeurIPS2025

点击查看摘要

Abstract:Scientific sketches (e.g., models) offer a powerful lens into students’ conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom’s level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textscSketchMind with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system’s potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.
zh

[AI-50] ool or Trouble? Exploring Student Attitudes Toward AI Coding Assistants

【速读】:该论文试图解决的问题是:AI代码助手(AI code assistants)如何影响初学者程序员在编程学习过程中的体验与能力发展,特别是在有无AI支持情境下知识迁移与概念理解的差异。解决方案的关键在于通过设计两阶段实验(第一阶段允许使用AI完成编程任务,第二阶段要求无AI扩展原解决方案),结合量化与质性数据收集方法,识别学生对AI工具的感知、使用挑战及认知依赖问题,从而为教学实践提供依据,强调需制定融合AI辅助与基础编程技能强化的教育策略,以避免过度依赖并促进深层次理解。

链接: https://arxiv.org/abs/2507.22900
作者: Sergio Rojas-Galeano
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This exploratory study examines how AI code assistants shape novice programmers’ experiences during a two-part exam in an introductory programming course. In the first part, students completed a programming task with access to AI support; in the second, they extended their solutions without AI. We collected Likert-scale and open-ended responses from 20 students to evaluate their perceptions and challenges. Findings suggest that AI tools were perceived as helpful for understanding code and increasing confidence, particularly during initial development. However, students reported difficulties transferring knowledge to unaided tasks, revealing possible overreliance and gaps in conceptual understanding. These insights highlight the need for pedagogical strategies that integrate AI meaningfully while reinforcing foundational programming skills.
zh

[AI-51] RecUserSim: A Realistic and Diverse User Simulator for Evaluating Conversational Recommender Systems

【速读】:该论文旨在解决对话式推荐系统(Conversational Recommender Systems, CRS)评估中的两大挑战:一是现有用户模拟器难以真实且多样化地模拟个体用户在不同场景下的交互行为,二是缺乏明确的评分机制以实现定量评估。为此,作者提出RecUserSim——一种基于大语言模型(Large Language Models, LLMs)的用户模拟器,其核心创新在于引入了三个关键模块:1)用户画像模块用于构建多样且真实的用户人格特征;2)记忆模块用于追踪交互历史并挖掘未知偏好;3)基于有限理性理论(Bounded Rationality)设计的核心动作模块,支持更细致的决策与个性化响应生成。此外,通过精炼模块对输出进行微调,进一步提升可控性与质量。实验表明,RecUserSim即使使用较小的基座LLM也能生成高质量、多样化的对话,并产生跨模型一致的评分结果,显著提升了CRS评估的现实性和可量化性。

链接: https://arxiv.org/abs/2507.22897
作者: Luyu Chen,Quanyu Dai,Zeyu Zhang,Xueyang Feng,Mingyu Zhang,Pengcheng Tang,Xu Chen,Yue Zhu,Zhenhua Dong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by TheWebConf’25 Industry Track

点击查看摘要

Abstract:Conversational recommender systems (CRS) enhance user experience through multi-turn interactions, yet evaluating CRS remains challenging. User simulators can provide comprehensive evaluations through interactions with CRS, but building realistic and diverse simulators is difficult. While recent work leverages large language models (LLMs) to simulate user interactions, they still fall short in emulating individual real users across diverse scenarios and lack explicit rating mechanisms for quantitative evaluation. To address these gaps, we propose RecUserSim, an LLM agent-based user simulator with enhanced simulation realism and diversity while providing explicit scores. RecUserSim features several key modules: a profile module for defining realistic and diverse user personas, a memory module for tracking interaction history and discovering unknown preferences, and a core action module inspired by Bounded Rationality theory that enables nuanced decision-making while generating more fine-grained actions and personalized responses. To further enhance output control, a refinement module is designed to fine-tune final responses. Experiments demonstrate that RecUserSim generates diverse, controllable outputs and produces realistic, high-quality dialogues, even with smaller base LLMs. The ratings generated by RecUserSim show high consistency across different base LLMs, highlighting its effectiveness for CRS evaluation.
zh

[AI-52] Invisible Architectures of Thought: Toward a New Science of AI as Cognitive Infrastructure

【速读】:该论文试图解决当前人-AI交互研究中忽视AI系统如何在无意识层面重塑人类认知这一关键盲点,从而影响分布式认知的理解。其核心问题在于:AI作为“认知基础设施”(cognitive infrastructures)如何通过语义传输、预测性个性化和自适应隐形机制,自动化“相关性判断”,并将“认识论代理权”(locus of epistemic agency)转移至非人类系统,进而改变个体认知、公共推理与社会知识体系。解决方案的关键在于提出“认知基础设施研究”(Cognitive Infrastructure Studies, CIS),它不仅重构了AI的认知角色,还通过“基础设施中断方法论”(infrastructure breakdown methodologies)——即在习惯化后系统性撤除AI预处理——来揭示人类对AI的隐性依赖,从而实现从个体到社会尺度上对AI预处理如何重塑分布式认知的跨学科实证分析。

链接: https://arxiv.org/abs/2507.22893
作者: Giuseppe Riva
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contemporary human-AI interaction research overlooks how AI systems fundamentally reshape human cognition pre-consciously, a critical blind spot for understanding distributed cognition. This paper introduces “Cognitive Infrastructure Studies” (CIS) as a new interdisciplinary domain to reconceptualize AI as “cognitive infrastructures”: foundational, often invisible systems conditioning what is knowable and actionable in digital societies. These semantic infrastructures transport meaning, operate through anticipatory personalization, and exhibit adaptive invisibility, making their influence difficult to detect. Critically, they automate “relevance judgment,” shifting the “locus of epistemic agency” to non-human systems. Through narrative scenarios spanning individual (cognitive dependency), collective (democratic deliberation), and societal (governance) scales, we describe how cognitive infrastructures reshape human cognition, public reasoning, and social epistemologies. CIS aims to address how AI preprocessing reshapes distributed cognition across individual, collective, and cultural scales, requiring unprecedented integration of diverse disciplinary methods. The framework also addresses critical gaps across disciplines: cognitive science lacks population-scale preprocessing analysis capabilities, digital sociology cannot access individual cognitive mechanisms, and computational approaches miss cultural transmission dynamics. To achieve this goal CIS also provides methodological innovations for studying invisible algorithmic influence: “infrastructure breakdown methodologies”, experimental approaches that reveal cognitive dependencies by systematically withdrawing AI preprocessing after periods of habituation.
zh

[AI-53] Evaluating LLM s for Visualization Generation and Understanding

【速读】:该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)生成信息可视化代码并理解常见图表内容的问题。其解决方案的关键在于系统性评估不同主流LLMs在两个核心任务中的表现:一是基于简单自然语言提示生成可视化代码(如柱状图、饼图等),二是回答关于可视化图形的结构化问题(如识别形状边界或判断长度关系)。研究发现,LLMs在处理简单可视化任务时具备一定能力,但对复杂图表(如小提琴图)生成和对细节性视觉推理(如边界关系识别)仍存在显著局限,这为未来提升LLMs与信息可视化系统的协同能力提供了关键方向。

链接: https://arxiv.org/abs/2507.22890
作者: Saadiq Rauf Khan,Vinit Chandak,Sougata Mukherjea
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information Visualization has been utilized to gain insights from complex data. In recent times, Large Language models (LLMs) have performed very well in many tasks. In this paper, we showcase the capabilities of different popular LLMs to generate code for visualization based on simple prompts. We also analyze the power of LLMs to understand some common visualizations by answering questions. Our study shows that LLMs could generate code for some simpler visualizations such as bar and pie charts. Moreover, they could answer simple questions about visualizations. However, LLMs also have several limitations. For example, some of them had difficulty generating complex visualizations, such as violin plot. LLMs also made errors in answering some questions about visualizations, for example, identifying relationships between close boundaries and determining lengths of shapes. We believe that our insights can be used to improve both LLMs and Information Visualization systems.
zh

[AI-54] A Privacy-Preserving Federated Framework with Hybrid Quantum-Enhanced Learning for Financial Fraud Detection

【速读】:该论文旨在解决数字交易激增背景下金融领域欺诈检测中传统方法性能不足的问题,特别是应对复杂跨交易模式识别困难及数据隐私泄露风险。其解决方案的关键在于提出一种融合量子增强长短期记忆(Quantum-enhanced Long Short-Term Memory, Quantum LSTM)模型与先进隐私保护技术的专用联邦学习框架——FedRansel。该框架通过在LSTM架构中引入量子层以提升对复杂时序特征的捕捉能力,实现比传统模型约5%的性能提升;同时,FedRansel通过创新性的防御机制有效抵御投毒攻击和推理攻击,相较标准差分隐私方法使模型退化和推理准确率降低4–8%,从而显著增强欺诈检测精度与敏感金融数据的安全性与保密性。

链接: https://arxiv.org/abs/2507.22908
作者: Abhishek Sawaika,Swetang Krishna,Tushar Tomar,Durga Pritam Suggisetti,Aditi Lal,Tanmaya Shrivastav,Nouhaila Innan,Muhammad Shafique
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in proceedings of IEEE International Conference on Quantum Computing and Engineering (QCE) 2025

点击查看摘要

Abstract:Rapid growth of digital transactions has led to a surge in fraudulent activities, challenging traditional detection methods in the financial sector. To tackle this problem, we introduce a specialised federated learning framework that uniquely combines a quantum-enhanced Long Short-Term Memory (LSTM) model with advanced privacy preserving techniques. By integrating quantum layers into the LSTM architecture, our approach adeptly captures complex cross-transactional patters, resulting in an approximate 5% performance improvement across key evaluation metrics compared to conventional models. Central to our framework is “FedRansel”, a novel method designed to defend against poisoning and inference attacks, thereby reducing model degradation and inference accuracy by 4-8%, compared to standard differential privacy mechanisms. This pseudo-centralised setup with a Quantum LSTM model, enhances fraud detection accuracy and reinforces the security and confidentiality of sensitive financial data.
zh

[AI-55] DNN-based Methods of Jointly Sensing Number and Directions of Targets via a Green Massive H2AD MIMO Receiver

【速读】:该论文旨在解决异构混合模拟-数字(Heterogeneous Hybrid Analog-Digital, H2AD)大规模MIMO(MIMO)架构中如何智能感知多发射源的数量与方向问题,这是当前该结构在无线网络应用中面临的关键挑战。解决方案的核心在于提出了一种两阶段感知框架:第一阶段设计三种目标数量感知方法——改进的特征域聚类(Eigen-Domain Clustering, EDC)、基于五种关键统计特征增强的深度神经网络(Deep Neural Network, DNN),以及利用全部特征值改进的一维卷积神经网络(1D-CNN);第二阶段引入在线微聚类(Online Micro-Clustering, OMC-DOA)方法实现低复杂度、高精度的方向估计(Direction of Arrival, DOA)。仿真结果表明,三种数量感知方法在中高信噪比(SNR)下可实现100%的目标数检测,其中改进的1D-CNN在极低SNR下表现最优,而OMC-DOA在多源环境下显著优于现有聚类与融合类DOA估计方法。

链接: https://arxiv.org/abs/2507.22906
作者: Bin Deng,Jiatong Bai,Feilong Zhao,Zuming Xie,Maolin Li,Yan Wang,Feng Shu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As a green MIMO structure, the heterogeneous hybrid analog-digital H2AD MIMO architecture has been shown to own a great potential to replace the massive or extremely large-scale fully-digital MIMO in the future wireless networks to address the three challenging problems faced by the latter: high energy consumption, high circuit cost, and high complexity. However, how to intelligently sense the number and direction of multi-emitters via such a structure is still an open hard problem. To address this, we propose a two-stage sensing framework that jointly estimates the number and direction values of multiple targets. Specifically, three target number sensing methods are designed: an improved eigen-domain clustering (EDC) framework, an enhanced deep neural network (DNN) based on five key statistical features, and an improved one-dimensional convolutional neural network (1D-CNN) utilizing full eigenvalues. Subsequently, a low-complexity and high-accuracy DOA estimation is achieved via the introduced online micro-clustering (OMC-DOA) method. Furthermore, we derive the Cramér-Rao lower bound (CRLB) for the H2AD under multiple-source conditions as a theoretical performance benchmark. Simulation results show that the developed three methods achieve 100% number of targets sensing at moderate-to-high SNRs, while the improved 1D-CNN exhibits superior under extremely-low SNR conditions. The introduced OMC-DOA outperforms existing clustering and fusion-based DOA methods in multi-source environments.
zh

机器学习

[LG-0] Improving annotator selection in Active Learning using a mood and fatigue-aware Recommender System

链接: https://arxiv.org/abs/2507.23756
作者: Diana Mortagua
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study centers on overcoming the challenge of selecting the best annotators for each query in Active Learning (AL), with the objective of minimizing misclassifications. AL recognizes the challenges related to cost and time when acquiring labeled data, and decreases the number of labeled data needed. Nevertheless, there is still the necessity to reduce annotation errors, aiming to be as efficient as possible, to achieve the expected accuracy faster. Most strategies for query-annotator pairs do not consider internal factors that affect productivity, such as mood, attention, motivation, and fatigue levels. This work addresses this gap in the existing literature, by not only considering how the internal factors influence annotators (mood and fatigue levels) but also presenting a new query-annotator pair strategy, using a Knowledge-Based Recommendation System (RS). The RS ranks the available annotators, allowing to choose one or more to label the queried instance using their past accuracy values, and their mood and fatigue levels, as well as information about the instance queried. This work bases itself on existing literature on mood and fatigue influence on human performance, simulating annotators in a realistic manner, and predicting their performance with the RS. The results show that considering past accuracy values, as well as mood and fatigue levels reduces the number of annotation errors made by the annotators, and the uncertainty of the model through its training, when compared to not using internal factors. Accuracy and F1-score values were also better in the proposed approach, despite not being as substantial as the aforementioned. The methodologies and findings presented in this study begin to explore the open challenge of human cognitive factors affecting AL.

[LG-1] Anomalous Samples for Few-Shot Anomaly Detection

链接: https://arxiv.org/abs/2507.23712
作者: Aymane Abdali,Bartosz Boguslawski,Lucas Drumetz,Vincent Gripon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several anomaly detection and classification methods rely on large amounts of non-anomalous or “normal” samples under the assump- tion that anomalous data is typically harder to acquire. This hypothesis becomes questionable in Few-Shot settings, where as little as one anno- tated sample can make a significant difference. In this paper, we tackle the question of utilizing anomalous samples in training a model for bi- nary anomaly classification. We propose a methodology that incorporates anomalous samples in a multi-score anomaly detection score leveraging recent Zero-Shot and memory-based techniques. We compare the utility of anomalous samples to that of regular samples and study the benefits and limitations of each. In addition, we propose an augmentation-based validation technique to optimize the aggregation of the different anomaly scores and demonstrate its effectiveness on popular industrial anomaly detection datasets.

[LG-2] One-Step Flow Policy Mirror Descent

链接: https://arxiv.org/abs/2507.23675
作者: Tianyi Chen,Haitong Ma,Na Li,Kai Wang,Bo Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on flow policy and MeanFlow policy parametrizations, respectively. Extensive empirical evaluations on MuJoCo benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring hundreds of times fewer function evaluations during inference.

[LG-3] SHAP-Guided Regularization in Machine Learning Models

链接: https://arxiv.org/abs/2507.23665
作者: Amal Saadallah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature attribution methods such as SHapley Additive exPlanations (SHAP) have become instrumental in understanding machine learning models, but their role in guiding model optimization remains underexplored. In this paper, we propose a SHAP-guided regularization framework that incorporates feature importance constraints into model training to enhance both predictive performance and interpretability. Our approach applies entropy-based penalties to encourage sparse, concentrated feature attributions while promoting stability across samples. The framework is applicable to both regression and classification tasks. Our first exploration started with investigating a tree-based model regularization using TreeSHAP. Through extensive experiments on benchmark regression and classification datasets, we demonstrate that our method improves generalization performance while ensuring robust and interpretable feature attributions. The proposed technique offers a novel, explainability-driven regularization approach, making machine learning models both more accurate and more reliable.

[LG-4] On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

链接: https://arxiv.org/abs/2507.23632
作者: Gabriel Mongaras,Eric C. Larson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts.

[LG-5] Hierarchical Message-Passing Policies for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2507.23604
作者: Tommaso Marzi,Cesare Alippi,Andrea Cini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized Multi-Agent Reinforcement Learning (MARL) methods allow for learning scalable multi-agent policies, but suffer from partial observability and induced non-stationarity. These challenges can be addressed by introducing mechanisms that facilitate coordination and high-level planning. Specifically, coordination and temporal abstraction can be achieved through communication (e.g., message passing) and Hierarchical Reinforcement Learning (HRL) approaches to decision-making. However, optimization issues limit the applicability of hierarchical policies to multi-agent systems. As such, the combination of these approaches has not been fully explored. To fill this void, we propose a novel and effective methodology for learning multi-agent hierarchies of message-passing policies. We adopt the feudal HRL framework and rely on a hierarchical graph structure for planning and coordination among agents. Agents at lower levels in the hierarchy receive goals from the upper levels and exchange messages with neighboring agents at the same level. To learn hierarchical multi-agent policies, we design a novel reward-assignment method based on training the lower-level policies to maximize the advantage function associated with the upper levels. Results on relevant benchmarks show that our method performs favorably compared to the state of the art.

[LG-6] EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution

链接: https://arxiv.org/abs/2507.23600
作者: Yu-Tang Chang,Shih-Fang Chen
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed chemical signals into base patterns (components) and their concentrations, playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified component count, usually unknown in real data. As dataset size or component count increases, the scalability and reliability of MF-based MCR face significant challenges. This study reformulates MCR as a generative process (gMCR), and introduces an energy-based deep learning solver, EB-gMCR, that automatically discovers the smallest component set able to reconstruct the data faithfully. EB-gMCR starts from a large candidate pool (e.g., 1024 spectra) and employs a differentiable gating network to retain only active components while estimating their concentrations. On noisy synthetic datasets containing up to 256 latent sources, EB-gMCR maintained R^2 = 0.98 and recovered the component count within 5% of the ground truth; at lower noise it achieved R^2 = 0.99 with near exact component estimation. Additional chemical priors, such as non-negativity or nonlinear mixing, enter as simple plug-in functions, enabling adaptation to other instruments or domains without altering the core learning process. By uniting high-capacity generative modeling and hard component selection, EB-gMCR offers a practical route to large-scale signal unmixing analysis, including chemical library-driven scenarios. The source code is available at this https URL.

[LG-7] GraphRAG -R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2507.23581
作者: Chuanyue Yu,Kuo Zhao,Yuhan Li,Heng Chang,Mingjian Feng,Xiangzhe Jiang,Yufei Sun,Jia Li,Yuzhi Zhang,Jianxin Li,Ziwei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships. However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning. Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. Next, we design two process-constrained reward functions. To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals. Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards. Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity. Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets. Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements.

[LG-8] Optimised Feature Subset Selection via Simulated Annealing

链接: https://arxiv.org/abs/2507.23568
作者: Fernando Martínez-García,Álvaro Rubio-García,Samuel Fernández-Lorenzo,Juan José García-Ripoll,Diego Porras
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:We introduce SA-FDR, a novel algorithm for \ell_0 -norm feature selection that considers this task as a combinatorial optimisation problem and solves it by using simulated annealing to perform a global search over the space of feature subsets. The optimisation is guided by the Fisher discriminant ratio, which we use as a computationally efficient proxy for model quality in classification tasks. Our experiments, conducted on datasets with up to hundreds of thousands of samples and hundreds of features, demonstrate that SA-FDR consistently selects more compact feature subsets while achieving a high predictive accuracy. This ability to recover informative yet minimal sets of features stems from its capacity to capture inter-feature dependencies often missed by greedy optimisation approaches. As a result, SA-FDR provides a flexible and effective solution for designing interpretable models in high-dimensional settings, particularly when model sparsity, interpretability, and performance are crucial.

[LG-9] Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform

链接: https://arxiv.org/abs/2507.23562
作者: Sirine Arfa,Bernhard Vogginger,Christian Mayr
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 8 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) promise orders-of-magnitude lower power consumption and low-latency inference on neuromorphic hardware for a wide range of robotic tasks. In this work, we present an energy-efficient implementation of a reinforcement learning (RL) algorithm using quantized SNNs to solve two classical control tasks. The network is trained using the Q-learning algorithm, then fine-tuned and quantized to low-bit (8-bit) precision for embedded deployment on the SpiNNaker2 neuromorphic chip. To evaluate the comparative advantage of SpiNNaker2 over conventional computing platforms, we analyze inference latency, dynamic power consumption, and energy cost per inference for our SNN models, comparing performance against a GTX 1650 GPU baseline. Our results demonstrate SpiNNaker2’s strong potential for scalable, low-energy neuromorphic computing, achieving up to 32x reduction in energy consumption. Inference latency remains on par with GPU-based execution, with improvements observed in certain task settings, reinforcing SpiNNaker2’s viability for real-time neuromorphic control and making the neuromorphic approach a compelling direction for efficient deep Q-learning.

[LG-10] Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions ICLR2025

链接: https://arxiv.org/abs/2507.23539
作者: Piotr Indyk,Michael Kapralov,Kshiteej Sheth,Tal Wagner
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Published in ICLR 2025

点击查看摘要

Abstract:Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices K\in \mathbbR^n\times n . K 's columns are indexed by a set of n keys k_1,k_2\ldots, k_n\in \mathbbR^d , rows by a set of n queries q_1,q_2,\ldots,q_n\in \mathbbR^d , and its i,j entry is K_ij = e^-|q_i-k_j|_2^2/2\sigma^2 for some bandwidth parameter \sigma0 . Given a vector x\in \mathbbR^n and error parameter \epsilon0 , our task is to output a y\in \mathbbR^n such that |Kx-y|_2\leq \epsilon |x|_2 in time subquadratic in n and linear in d . Our algorithms rely on the following modelling assumption about the matrices K : the sum of the entries of K scales linearly in n , as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors.

[LG-11] Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level

链接: https://arxiv.org/abs/2507.23512
作者: Saleh Vatan Khah,Savelii Chezhegov,Shahrokh Farahmand,Samuel Horváth,Eduard Gorbunov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 60 pages

点击查看摘要

Abstract:Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central \alpha -th moment assumption, \alpha \in (1,2] . Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a faster rate than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.

[LG-12] A Verifier Hierarchy

链接: https://arxiv.org/abs/2507.23504
作者: Maurits Kaptein
类目: Machine Learning (cs.LG)
*备注: This paper is primarily relevant to cs.CC, but submitted under this http URL due to lack of endorsement. The paper is under review at “Information and Communication”

点击查看摘要

Abstract:We investigate the trade-off between certificate length and verifier runtime. We prove a Verifier Trade-off Theorem showing that reducing the inherent verification time of a language from (f(n)) to (g(n)), where (f(n) \ge g(n)), requires certificates of length at least (\Omega(\log(f(n) / g(n)))). This theorem induces a natural hierarchy based on certificate complexity. We demonstrate its applicability to analyzing conjectured separations between complexity classes (e.g., (\np) and (\exptime)) and to studying natural problems such as string periodicity and rotation detection. Additionally, we provide perspectives on the (\p) vs. (\np) problem by relating it to the existence of sub-linear certificates.

[LG-13] Directional Ensemble Aggregation for Actor-Critics

链接: https://arxiv.org/abs/2507.23501
作者: Nicklas Werge,Yi-Shan Wu,Bahareh Tasdighi,Melih Kandemir
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Off-policy reinforcement learning in continuous control tasks depends critically on accurate Q -value estimates. Conservative aggregation over ensembles, such as taking the minimum, is commonly used to mitigate overestimation bias. However, these static rules are coarse, discard valuable information from the ensemble, and cannot adapt to task-specific needs or different learning regimes. We propose Directional Ensemble Aggregation (DEA), an aggregation method that adaptively combines Q -value estimates in actor-critic frameworks. DEA introduces two fully learnable directional parameters: one that modulates critic-side conservatism and another that guides actor-side policy exploration. Both parameters are learned using ensemble disagreement-weighted Bellman errors, which weight each sample solely by the direction of its Bellman error. This directional learning mechanism allows DEA to adjust conservatism and exploration in a data-driven way, adapting aggregation to both uncertainty levels and the phase of training. We evaluate DEA across continuous control benchmarks and learning regimes - from interactive to sample-efficient - and demonstrate its effectiveness over static ensemble strategies.

[LG-14] Incorporating structural uncertainty in causal decision making

链接: https://arxiv.org/abs/2507.23495
作者: Maurits Kaptein
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: This work is under review at the Journal of Causal Inference

点击查看摘要

Abstract:Practitioners making decisions based on causal effects typically ignore structural uncertainty. We analyze when this uncertainty is consequential enough to warrant methodological solutions (Bayesian model averaging over competing causal structures). Focusing on bivariate relationships ( X \rightarrow Y vs. X \leftarrow Y ), we establish that model averaging is beneficial when: (1) structural uncertainty is moderate to high, (2) causal effects differ substantially between structures, and (3) loss functions are sufficiently sensitive to the size of the causal effect. We prove optimality results of our suggested methodological solution under regularity conditions and demonstrate through simulations that modern causal discovery methods can provide, within limits, the necessary quantification. Our framework complements existing robust causal inference approaches by addressing a distinct source of uncertainty typically overlooked in practice.

[LG-15] Explainable artificial intelligence model predicting the risk of all-cause mortality in patients with type 2 diabetes mellitus

链接: https://arxiv.org/abs/2507.23491
作者: Olga Vershinina,Jacopo Sabbatinelli,Anna Rita Bonfigli,Dalila Colombaretti,Angelica Giuliani,Mikhail Krivonosov,Arseniy Trukhanov,Claudio Franceschi,Mikhail Ivanchenko,Fabiola Olivieri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective. Type 2 diabetes mellitus (T2DM) is a highly prevalent non-communicable chronic disease that substantially reduces life expectancy. Accurate estimation of all-cause mortality risk in T2DM patients is crucial for personalizing and optimizing treatment strategies. Research Design and Methods. This study analyzed a cohort of 554 patients (aged 40-87 years) with diagnosed T2DM over a maximum follow-up period of 16.8 years, during which 202 patients (36%) died. Key survival-associated features were identified, and multiple machine learning (ML) models were trained and validated to predict all-cause mortality risk. To improve model interpretability, Shapley additive explanations (SHAP) was applied to the best-performing model. Results. The extra survival trees (EST) model, incorporating ten key features, demonstrated the best predictive performance. The model achieved a C-statistic of 0.776, with the area under the receiver operating characteristic curve (AUC) values of 0.86, 0.80, 0.841, and 0.826 for 5-, 10-, 15-, and 16.8-year all-cause mortality predictions, respectively. The SHAP approach was employed to interpret the model’s individual decision-making processes. Conclusions. The developed model exhibited strong predictive performance for mortality risk assessment. Its clinically interpretable outputs enable potential bedside application, improving the identification of high-risk patients and supporting timely treatment optimization.

[LG-16] Manifold-regularised Signature Kernel Large-Margin ell_p-SVDD for Multidimensional Time Series Anomaly Detection

链接: https://arxiv.org/abs/2507.23449
作者: Shervin Rahimzadeh Arashloo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We generalise the recently introduced large-margin \ell_p -SVDD approach to exploit the geometry of data distribution via manifold regularising and a signature kernel representation for time series anomaly detection. Specifically, we formulate a manifold-regularised variant of the \ell_p -SVDD method to encourage label smoothness on the underlying manifold to capture structural information for improved detection performance. Drawing on an existing Representer theorem, we then provide an effective optimisation technique for the proposed method and show that it can benefit from the signature kernel to capture time series complexities for anomaly detection. We theoretically study the proposed approach using Rademacher complexities to analyse its generalisation performance and also provide an experimental assessment of the proposed method across various data sets to compare its performance against other methods. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.23449 [cs.LG] (or arXiv:2507.23449v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.23449 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Adjoint-Based Aerodynamic Shape Optimization with a Manifold Constraint Learned by Diffusion Models

链接: https://arxiv.org/abs/2507.23443
作者: Long Chen,Emre Oezkaya,Jan Rottmayer,Nicolas R. Gauger,Zebang Shen,Yinyu Ye
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce an adjoint-based aerodynamic shape optimization framework that integrates a diffusion model trained on existing designs to learn a smooth manifold of aerodynamically viable shapes. This manifold is enforced as an equality constraint to the shape optimization problem. Central to our method is the computation of adjoint gradients of the design objectives (e.g., drag and lift) with respect to the manifold space. These gradients are derived by first computing shape derivatives with respect to conventional shape design parameters (e.g., Hicks-Henne parameters) and then backpropagating them through the diffusion model to its latent space via automatic differentiation. Our framework preserves mathematical rigor and can be integrated into existing adjoint-based design workflows with minimal modification. Demonstrated on extensive transonic RANS airfoil design cases using off-the-shelf and general-purpose nonlinear optimizers, our approach eliminates ad hoc parameter tuning and variable scaling, maintains robustness across initialization and optimizer choices, and achieves superior aerodynamic performance compared to conventional approaches. This work establishes how AI generated priors integrates effectively with adjoint methods to enable robust, high-fidelity aerodynamic shape optimization through automatic differentiation.

[LG-18] Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design

链接: https://arxiv.org/abs/2507.23437
作者: Yinhui Ma,Tomomasa Yamasaki,Zhehui Wang,Tao Luo,Bo Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICCAD 2025 (camera-ready); 9 pages, 5 figures

点击查看摘要

Abstract:Hardware-Aware Neural Architecture Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency, making it particularly useful for the development of Deep Neural Network accelerators on the edge. However, the extensive search space and high computational cost pose significant challenges to its practical adoption. To address these limitations, we propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization. By leveraging sparse inducing points, Coflex reduces the GP kernel complexity from cubic to near-linear with respect to the number of training samples, without compromising optimization performance. This enables scalable approximation of large-scale search space, substantially decreasing computational overhead while preserving high predictive accuracy. We evaluate the efficacy of Coflex across various benchmarks, focusing on accelerator-specific architecture. Our experi- mental results show that Coflex outperforms state-of-the-art methods in terms of network accuracy and Energy-Delay-Product, while achieving a computational speed-up ranging from 1.9x to 9.5x.

[LG-19] Merging Memory and Space: A Spatiotemporal State Space Neural Operator

链接: https://arxiv.org/abs/2507.23428
作者: Nodens F. Koren,Samuel Lanthaler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose the Spatiotemporal State Space Neural Operator (ST-SSM), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). ST-SSM introduces a novel factorization of the spatial and temporal dimensions, using structured state-space models to independently model temporal evolution and spatial interactions. This design enables parameter efficiency and flexible modeling of long-range spatiotemporal dynamics. A theoretical connection is established between SSMs and neural operators, and a unified universality theorem is proved for the resulting class of architectures. Empirically, we demonstrate that our factorized formulation outperforms alternative schemes such as zigzag scanning and parallel independent processing on several PDE benchmarks, including 1D Burgers’ equation, 1D Kuramoto-Sivashinsky equation, and 2D Navier-Stokes equations under varying physical conditions. Our model performs competitively with existing baselines while using significantly fewer parameters. In addition, our results reinforce previous findings on the benefits of temporal memory by showing improved performance under partial observability. Our results highlight the advantages of dimensionally factorized operator learning for efficient and generalizable PDE modeling, and put this approach on a firm theoretical footing.

[LG-20] Detection of Adulteration in Coconut Milk using Infrared Spectroscopy and Machine Learning

链接: https://arxiv.org/abs/2507.23418
作者: Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a system for detecting adulteration in coconut milk, utilizing infrared spectroscopy. The machine learning-based proposed system comprises three phases: preprocessing, feature extraction, and classification. The first phase involves removing irrelevant data from coconut milk spectral signals. In the second phase, we employ the Linear Discriminant Analysis (LDA) algorithm for extracting the most discriminating features. In the third phase, we use the K-Nearest Neighbor (KNN) model to classify coconut milk samples into authentic or adulterated. We evaluate the performance of the proposed system using a public dataset comprising Fourier Transform Infrared (FTIR) spectral information of pure and contaminated coconut milk samples. Findings show that the proposed method successfully detects adulteration with a cross-validation accuracy of 93.33%.

[LG-21] A Machine Learning Approach for Honey Adulteration Detection using Mineral Element Profiles

链接: https://arxiv.org/abs/2507.23412
作者: Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper aims to develop a Machine Learning (ML)-based system for detecting honey adulteration utilizing honey mineral element profiles. The proposed system comprises two phases: preprocessing and classification. The preprocessing phase involves the treatment of missing-value attributes and normalization. In the classifica-tion phase, we use three supervised ML models: logistic regression, decision tree, and random forest, to dis-criminate between authentic and adulterated honey. To evaluate the performance of the ML models, we use a public dataset comprising measurements of mineral element content of authentic honey, sugar syrups, and adul-terated honey. Experimental findings show that mineral element content in honey provides robust discriminative information for detecting honey adulteration. Results also demonstrate that the random forest-based classifier outperforms other classifiers on this dataset, achieving the highest cross-validation accuracy of 98.37%.

[LG-22] Policy Learning from Large Vision-Language Model Feedback without Reward Modeling IROS2025

链接: https://arxiv.org/abs/2507.23391
作者: Tung M. Luu,Donghoon Lee,Younghwan Lee,Chang D. Yoo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to IROS 2025

点击查看摘要

Abstract:Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.

[LG-23] Causal Explanation of Concept Drift – A Truly Actionable Approach KDD2025 ECML

链接: https://arxiv.org/abs/2507.23389
作者: David Komnick,Kathrin Lammers,Barbara Hammer,Valerie Vaquet,Fabian Hinder
类目: Machine Learning (cs.LG)
*备注: This manuscript is accepted to be presented at the TempXAI workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2025)

点击查看摘要

Abstract:In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.

[LG-24] Designing Dynamic Pricing for Bike-sharing Systems via Differentiable Agent -based Simulation

链接: https://arxiv.org/abs/2507.23344
作者: Tatsuya Mitomi,Fumiyasu Makinoshima,Fumiya Makihara,Eigo Segawa
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Bike-sharing systems are emerging in various cities as a new ecofriendly transportation system. In these systems, spatiotemporally varying user demands lead to imbalanced inventory at bicycle stations, resulting in additional relocation costs. Therefore, it is essential to manage user demand through optimal dynamic pricing for the system. However, optimal pricing design for such a system is challenging because the system involves users with diverse backgrounds and their probabilistic choices. To address this problem, we develop a differentiable agent-based simulation to rapidly design dynamic pricing in bike-sharing systems, achieving balanced bicycle inventory despite spatiotemporally heterogeneous trips and probabilistic user decisions. We first validate our approach against conventional methods through numerical experiments involving 25 bicycle stations and five time slots, yielding 100 parameters. Compared to the conventional methods, our approach obtains a more accurate solution with a 73% to 78% reduction in loss while achieving more than a 100-fold increase in convergence speed. We further validate our approach on a large-scale urban bike-sharing system scenario involving 289 bicycle stations, resulting in a total of 1156 parameters. Through simulations using the obtained pricing policies, we confirm that these policies can naturally induce balanced inventory without any manual relocation. Additionally, we find that the cost of discounts to induce the balanced inventory can be minimized by setting appropriate initial conditions.

[LG-25] Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions

链接: https://arxiv.org/abs/2507.23335
作者: Qilin Zhou,Haipeng Wang,Zhengyuan Wei,W.K. Chan
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: accepted by QRS 2025

点击查看摘要

Abstract:Patch robustness certification is an emerging verification approach for defending against adversarial patch attacks with provable guarantees for deep learning systems. Certified recovery techniques guarantee the prediction of the sole true label of a certified sample. However, existing techniques, if applicable to top-k predictions, commonly conduct pairwise comparisons on those votes between labels, failing to certify the sole true label within the top k prediction labels precisely due to the inflation on the number of votes controlled by the attacker (i.e., attack budget); yet enumerating all combinations of vote allocation suffers from the combinatorial explosion problem. We propose CostCert, a novel, scalable, and precise voting-based certified recovery defender. CostCert verifies the true label of a sample within the top k predictions without pairwise comparisons and combinatorial explosion through a novel design: whether the attack budget on the sample is infeasible to cover the smallest total additional votes on top of the votes uncontrollable by the attacker to exclude the true labels from the top k prediction labels. Experiments show that CostCert significantly outperforms the current state-of-the-art defender PatchGuard, such as retaining up to 57.3% in certified accuracy when the patch size is 96, whereas PatchGuard has already dropped to zero.

[LG-26] Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

链接: https://arxiv.org/abs/2507.23317
作者: Tao He,Rongchuan Mu,Lizi Liao,Yixin Cao,Ming Liu,Bing Qin
类目: Machine Learning (cs.LG)
*备注: 33 pages, 3 figures, 19 tables

点击查看摘要

Abstract:Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent ‘thought’ units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step segmentation and alleviating reward hacking. We further introduce a capability-adaptive reward mechanism that dynamically balances exploration and exploitation based on the LRM’s current proficiency, guiding learning without stifling creative trial-and-error. These innovations are integrated into a new off-policy RL algorithm, TP-GRPO, which extends grouped proximal optimization with process-based rewards and improves training efficiency. Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines. The results validate that well-structured process rewards can substantially accelerate LRM optimization in math reasoning tasks. Code is available at this https URL.

[LG-27] An Interpretable Data-Driven Unsupervised Approach for the Prevention of Forgotten Items

链接: https://arxiv.org/abs/2507.23303
作者: Luca Corbucci,Javier Alejandro Borges Legrottaglie,Francesco Spinnato,Anna Monreale,Riccardo Guidotti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately identifying items forgotten during a supermarket visit and providing clear, interpretable explanations for recommending them remains an underexplored problem within the Next Basket Prediction (NBP) domain. Existing NBP approaches typically only focus on forecasting future purchases, without explicitly addressing the detection of unintentionally omitted items. This gap is partly due to the scarcity of real-world datasets that allow for the reliable estimation of forgotten items. Furthermore, most current NBP methods rely on black-box models, which lack transparency and limit the ability to justify recommendations to end users. In this paper, we formally introduce the forgotten item prediction task and propose two novel interpretable-by-design algorithms. These methods are tailored to identify forgotten items while offering intuitive, human-understandable explanations. Experiments on a real-world retail dataset show our algorithms outperform state-of-the-art NBP baselines by 10-15% across multiple evaluation metrics.

[LG-28] A Single Direction of Truth: An Observer Models Linear Residual Probe Exposes and Steers Contextual Hallucinations

链接: https://arxiv.org/abs/2507.23221
作者: Charles O’Neill,Slava Chalnev,Chi Chi Zhao,Max Kirkby,Mudith Jayasekara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual hallucinations – statements unsupported by given context – remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

[LG-29] Not Just What But When: Integrating Irregular Intervals to LLM for Sequential Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.23209
作者: Wei-Wei Du,Takuma Udagawa,Kei Tateno
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by RecSys 2025 short paper track

点击查看摘要

Abstract:Time intervals between purchasing items are a crucial factor in sequential recommendation tasks, whereas existing approaches focus on item sequences and often overlook by assuming the intervals between items are static. However, dynamic intervals serve as a dimension that describes user profiling on not only the history within a user but also different users with the same item history. In this work, we propose IntervalLLM, a novel framework that integrates interval information into LLM and incorporates the novel interval-infused attention to jointly consider information of items and intervals. Furthermore, unlike prior studies that address the cold-start scenario only from the perspectives of users and items, we introduce a new viewpoint: the interval perspective to serve as an additional metric for evaluating recommendation methods on the warm and cold scenarios. Extensive experiments on 3 benchmarks with both traditional- and LLM-based baselines demonstrate that our IntervalLLM achieves not only 4.4% improvements in average but also the best-performing warm and cold scenarios across all users, items, and the proposed interval perspectives. In addition, we observe that the cold scenario from the interval perspective experiences the most significant performance drop among all recommendation methods. This finding underscores the necessity of further research on interval-based cold challenges and our integration of interval information in the realm of sequential recommendation tasks. Our code is available here: this https URL.

[LG-30] Are Recommenders Self-Aware? Label-Free Recommendation Performance Estimation via Model Uncertainty

链接: https://arxiv.org/abs/2507.23208
作者: Jiayu Li,Ziyi Ye,Guohao Jian,Zhiqiang Guo,Weizhi Ma,Qingyao Ai,Min Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Can a recommendation model be self-aware? This paper investigates the recommender’s self-awareness by quantifying its uncertainty, which provides a label-free estimation of its performance. Such self-assessment can enable more informed understanding and decision-making before the recommender engages with any users. To this end, we propose an intuitive and effective method, probability-based List Distribution uncertainty (LiDu). LiDu measures uncertainty by determining the probability that a recommender will generate a certain ranking list based on the prediction distributions of individual items. We validate LiDu’s ability to represent model self-awareness in two settings: (1) with a matrix factorization model on a synthetic dataset, and (2) with popular recommendation algorithms on real-world datasets. Experimental results show that LiDu is more correlated with recommendation performance than a series of label-free performance estimators. Additionally, LiDu provides valuable insights into the dynamic inner states of models throughout training and inference. This work establishes an empirical connection between recommendation uncertainty and performance, framing it as a step towards more transparent and self-evaluating recommender systems.

[LG-31] NaN-Propagation: A Novel Method for Sparsity Detection in Black-Box Computational Functions

链接: https://arxiv.org/abs/2507.23186
作者: Peter Sharpe
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Sparsity detection in black-box functions enables significant computational speedups in gradient-based optimization through Jacobian compression, but existing finite-difference methods suffer from false negatives due to coincidental zero gradients. These false negatives can silently corrupt gradient calculations, leading to difficult-to-diagnose errors. We introduce NaN-propagation, which exploits the universal contamination property of IEEE 754 Not-a-Number floating-point values to trace input-output dependencies through floating-point numerical computations. By systematically contaminating inputs with NaN and observing which outputs become NaN, the method reconstructs conservative sparsity patterns that eliminate false negatives. We demonstrate the approach on an aerospace wing weight model, achieving a 1.52x speedup while detecting dozens of dependencies missed by conventional methods – a significant improvement since gradient computation is the bottleneck in many optimization workflows. The technique leverages IEEE 754 compliance to work across programming languages and math libraries without modifying existing black-box codes. Advanced strategies including NaN payload encoding enable faster-than-linear time complexity, improving upon existing black-box sparsity detection methods. Practical algorithms are also proposed to mitigate challenges from branching code execution common in engineering applications.

[LG-32] BAR Conjecture: the Feasibility of Inference Budget-Constrained LLM Services with Authenticity and Reasoning

链接: https://arxiv.org/abs/2507.23170
作者: Jinan Zhou,Rajat Ghosh,Vaishnavi Bhargava,Debojyoti Dutta,Aryan Singhal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When designing LLM services, practitioners care about three key properties: inference-time budget, factual authenticity, and reasoning capacity. However, our analysis shows that no model can simultaneously optimize for all three. We formally prove this trade-off and propose a principled framework named The BAR Theorem for LLM-application design.

[LG-33] AI paradigm for solving differential equations: first-principles data generation and scale-dilation operator AI solver

链接: https://arxiv.org/abs/2507.23141
作者: Xiangshu Gong,Zhiqiang Xie,Xiaowei Jin,Chen Wang,Yanling Qu,Wangmeng Zuo,Hui Li
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Many problems are governed by differential equations (DEs). Artificial intelligence (AI) is a new path for solving DEs. However, data is very scarce and existing AI solvers struggle with approximation of high frequency components (AHFC). We propose an AI paradigm for solving diverse DEs, including DE-ruled first-principles data generation methodology and scale-dilation operator (SDO) AI solver. Using either prior knowledge or random fields, we generate solutions and then substitute them into the DEs to derive the sources and initial/boundary conditions through balancing DEs, thus producing arbitrarily vast amount of, first-principles-consistent training datasets at extremely low computational cost. We introduce a reversible SDO that leverages the Fourier transform of the multiscale solutions to fix AHFC, and design a spatiotemporally coupled, attention-based Transformer AI solver of DEs with SDO. An upper bound on the Hessian condition number of the loss function is proven to be proportional to the squared 2-norm of the solution gradient, revealing that SDO yields a smoother loss landscape, consequently fixing AHFC with efficient training. Extensive tests on diverse DEs demonstrate that our AI paradigm achieves consistently superior accuracy over state-of-the-art methods. This work makes AI solver of DEs to be truly usable in broad nature and engineering fields.

[LG-34] Observational Multiplicity

链接: https://arxiv.org/abs/2507.23136
作者: Erin George,Deanna Needell,Berk Ustun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many prediction tasks can admit multiple models that can perform almost equally well. This phenomenon can can undermine interpretability and safety when competing models assign conflicting predictions to individuals. In this work, we study how arbitrariness can arise in probabilistic classification tasks as a result of an effect that we call \emphobservational multiplicity. We discuss how this effect arises in a broad class of practical applications where we learn a classifier to predict probabilities p_i \in [0,1] but are given a dataset of observations y_i \in \0,1\ . We propose to evaluate the arbitrariness of individual probability predictions through the lens of \emphregret. We introduce a measure of regret for probabilistic classification tasks, which measures how the predictions of a model could change as a result of different training labels change. We present a general-purpose method to estimate the regret in a probabilistic classification task. We use our measure to show that regret is higher for certain groups in the dataset and discuss potential applications of regret. We demonstrate how estimating regret promote safety in real-world applications by abstention and data collection.

[LG-35] Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts ICASSP2026

链接: https://arxiv.org/abs/2507.23128
作者: Anaïs Baranger,Lucas Maison
类目: Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Although prior work in computer vision has shown strong correlations between in-distribution (ID) and out-of-distribution (OOD) accuracies, such relationships remain underexplored in audio-based models. In this study, we investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions. We benchmark several neural architectures across a variety of evaluation sets. To quantify the impact of noise on generalization, we make use of two metrics: Fairness (F), which measures overall accuracy gains compared to a baseline model, and Robustness ®, which assesses the convergence between ID and OOD performance. Our results suggest that noise-aware training improves robustness in some configurations. These findings shed new light on the benefits and limitations of noise-based augmentation for generalization in speech models.

[LG-36] Scalable Generative Modeling of Weighted Graphs

链接: https://arxiv.org/abs/2507.23111
作者: Richard Williams,Eric Nalisnick,Andrew Holbrook
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures, included appendix. code at this https URL

点击查看摘要

Abstract:Weighted graphs are ubiquitous throughout biology, chemistry, and the social sciences, motivating the development of generative models for abstract weighted graph data using deep neural networks. However, most current deep generative models are either designed for unweighted graphs and are not easily extended to weighted topologies or incorporate edge weights without consideration of a joint distribution with topology. Furthermore, learning a distribution over weighted graphs must account for complex nonlocal dependencies between both the edges of the graph and corresponding weights of each edge. We develop an autoregressive model BiGG-E, a nontrivial extension of the BiGG model, that learns a joint distribution over weighted graphs while still exploiting sparsity to generate a weighted graph with n nodes and m edges in O((n + m)\log n) time. Simulation studies and experiments on a variety of benchmark datasets demonstrate that BiGG-E best captures distributions over weighted graphs while remaining scalable and computationally efficient.

[LG-37] A Foundation Model for Material Fracture Prediction

链接: https://arxiv.org/abs/2507.23077
作者: Agnese Marcato,Aleksandra Pachalieva,Ryley G. Hill,Kai Gao,Xiaoyu Wang,Esteban Rougier,Zhou Lei,Vinamra Agrawal,Janel Chua,Qinjun Kang,Jeffrey D. Hyman,Abigail Hunter,Nathan DeBardeleben,Earl Lawrence,Hari Viswanathan,Daniel O’Malley,Javier E. Santos
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Accurately predicting when and how materials fail is critical to designing safe, reliable structures, mechanical systems, and engineered components that operate under stress. Yet, fracture behavior remains difficult to model across the diversity of materials, geometries, and loading conditions in real-world applications. While machine learning (ML) methods show promise, most models are trained on narrow datasets, lack robustness, and struggle to generalize. Meanwhile, physics-based simulators offer high-fidelity predictions but are fragmented across specialized methods and require substantial high-performance computing resources to explore the input space. To address these limitations, we present a data-driven foundation model for fracture prediction, a transformer-based architecture that operates across simulators, a wide range of materials (including plastic-bonded explosives, steel, aluminum, shale, and tungsten), and diverse loading conditions. The model supports both structured and unstructured meshes, combining them with large language model embeddings of textual input decks specifying material properties, boundary conditions, and solver settings. This multimodal input design enables flexible adaptation across simulation scenarios without changes to the model architecture. The trained model can be fine-tuned with minimal data on diverse downstream tasks, including time-to-failure estimation, modeling fracture evolution, and adapting to combined finite-discrete element method simulations. It also generalizes to unseen materials such as titanium and concrete, requiring as few as a single sample, dramatically reducing data needs compared to standard ML. Our results show that fracture prediction can be unified under a single model architecture, offering a scalable, extensible alternative to simulator-specific workflows.

[LG-38] Locally Differentially Private Thresholding Bandits

链接: https://arxiv.org/abs/2507.23073
作者: Annalisa Barbara,Joseph Lazzaro,Ciara Pike-Burke
类目: Machine Learning (cs.LG)
*备注: 18th European Workshop on Reinforcement Learning (EWRL 2025)

点击查看摘要

Abstract:This work investigates the impact of ensuring local differential privacy in the thresholding bandit problem. We consider both the fixed budget and fixed confidence settings. We propose methods that utilize private responses, obtained through a Bernoulli-based differentially private mechanism, to identify arms with expected rewards exceeding a predefined threshold. We show that this procedure provides strong privacy guarantees and derive theoretical performance bounds on the proposed algorithms. Additionally, we present general lower bounds that characterize the additional loss incurred by any differentially private mechanism, and show that the presented algorithms match these lower bounds up to poly-logarithmic factors. Our results provide valuable insights into privacy-preserving decision-making frameworks in bandit problems.

[LG-39] Prediction of Significant Creatinine Elevation in First ICU Stays with Vancomycin Use: A retrospective study through Catboost

链接: https://arxiv.org/abs/2507.23043
作者: Junyi Fan,Li Sun,Shuheng Chen,Yong Si,Minoo Ahmadi,Greg Placencia,Elham Pishgar,Kamiar Alaei,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Vancomycin, a key antibiotic for severe Gram-positive infections in ICUs, poses a high nephrotoxicity risk. Early prediction of kidney injury in critically ill patients is challenging. This study aimed to develop a machine learning model to predict vancomycin-related creatinine elevation using routine ICU data. Methods: We analyzed 10,288 ICU patients (aged 18-80) from the MIMIC-IV database who received vancomycin. Kidney injury was defined by KDIGO criteria (creatinine rise =0.3 mg/dL within 48h or =50% within 7d). Features were selected via SelectKBest (top 30) and Random Forest ranking (final 15). Six algorithms were tested with 5-fold cross-validation. Interpretability was evaluated using SHAP, Accumulated Local Effects (ALE), and Bayesian posterior sampling. Results: Of 10,288 patients, 2,903 (28.2%) developed creatinine elevation. CatBoost performed best (AUROC 0.818 [95% CI: 0.801-0.834], sensitivity 0.800, specificity 0.681, negative predictive value 0.900). Key predictors were phosphate, total bilirubin, magnesium, Charlson index, and APSIII. SHAP confirmed phosphate as a major risk factor. ALE showed dose-response patterns. Bayesian analysis estimated mean risk 60.5% (95% credible interval: 16.8-89.4%) in high-risk cases. Conclusions: This machine learning model predicts vancomycin-associated creatinine elevation from routine ICU data with strong accuracy and interpretability, enabling early risk detection and supporting timely interventions in critical care. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.23043 [cs.LG] (or arXiv:2507.23043v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.23043 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuheng Chen [view email] [v1] Wed, 30 Jul 2025 19:15:37 UTC (768 KB)

[LG-40] Linking Actor Behavior to Process Performance Over Time

链接: https://arxiv.org/abs/2507.23037
作者: Aurélie Leribaux,Rafael Oyamada,Johannes De Smedt,Zahra Dasht Bozorgi,Artem Polyvyanyy,Jochen De Weerdt
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at the 5th Workshop on Change, Drift, and Dynamics of Organizational Processes (ProDy), BPM 2025

点击查看摘要

Abstract:Understanding how actor behavior influences process outcomes is a critical aspect of process mining. Traditional approaches often use aggregate and static process data, overlooking the temporal and causal dynamics that arise from individual actor behavior. This limits the ability to accurately capture the complexity of real-world processes, where individual actor behavior and interactions between actors significantly shape performance. In this work, we address this gap by integrating actor behavior analysis with Granger causality to identify correlating links in time series data. We apply this approach to realworld event logs, constructing time series for actor interactions, i.e. continuation, interruption, and handovers, and process outcomes. Using Group Lasso for lag selection, we identify a small but consistently influential set of lags that capture the majority of causal influence, revealing that actor behavior has direct and measurable impacts on process performance, particularly throughput time. These findings demonstrate the potential of actor-centric, time series-based methods for uncovering the temporal dependencies that drive process outcomes, offering a more nuanced understanding of how individual behaviors impact overall process efficiency.

[LG-41] KLLM : Fast LLM Inference with K-Means Quantization

链接: https://arxiv.org/abs/2507.23035
作者: Xueying Wu,Baijun Zhou,Zhihui Gao,Yuzhe Fu,Qilin Zheng,Yintao He,Hai Li
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. However, two key challenges remain in the existing WAQ designs. (1) Traditional WAQ designs rely on uniform integer-based quantization for hardware efficiency, but this often results in significant accuracy degradation at low precision. K-Means-based quantization, a non-uniform quantization technique, achieves higher accuracy by matching the Gaussian-like distributions of weights and activations in LLMs. However, its non-uniform nature prevents direct execution on low-precision compute units, requiring dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers further hinder effective low-precision WAQ. Offline thresholding methods for outlier detection can lead to significant model performance degradation, while existing online detection techniques introduce substantial runtime overhead. To address the aforementioned challenges and fully unleash the potential of WAQ with K-Means quantization for LLM inference, in this paper, we propose KLLM, a hardware-software co-design framework. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a novel outlier detection engine, Orizuru, that efficiently identifies the top- k largest and smallest elements in the activation data stream during online inference. Extensive experiments show that, on average, KLLM achieves speedups of 9.67x, 7.03x and energy efficiency improvements of 229.50x, 150.21x compared to the A100 GPU and Atom, respectively. Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR) Cite as: arXiv:2507.23035 [cs.LG] (or arXiv:2507.23035v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.23035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Learning to Prune Branches in Modern Tree-Fruit Orchards

链接: https://arxiv.org/abs/2507.23015
作者: Abhinav Jain,Cindy Grimm,Stefan Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dormant tree pruning is labor-intensive but essential to maintaining modern highly-productive fruit orchards. In this work we present a closed-loop visuomotor controller for robotic pruning. The controller guides the cutter through a cluttered tree environment to reach a specified cut point and ensures the cutters are perpendicular to the branch. We train the controller using a novel orchard simulation that captures the geometric distribution of branches in a target apple orchard configuration. Unlike traditional methods requiring full 3D reconstruction, our controller uses just optical flow images from a wrist-mounted camera. We deploy our learned policy in simulation and the real-world for an example V-Trellis envy tree with zero-shot transfer, achieving a 30% success rate – approximately half the performance of an oracle planner.

[LG-43] FedCVD: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model Optimization

链接: https://arxiv.org/abs/2507.22963
作者: Abdelrhman Gaber,Hassan Abd-Eltawab,John Elgallab,Youssif Abuzied,Dineo Mpanya,Turgay Celik,Swarun Kumar,Tamer ElBatt
类目: Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
*备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance. Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy. Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints. Subjects: Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT) Cite as: arXiv:2507.22963 [cs.LG] (or arXiv:2507.22963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.22963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] Multi-Hazard Early Warning Systems for Agriculture with Featural-Temporal Explanations

链接: https://arxiv.org/abs/2507.22962
作者: Boyuan Zheng,Victor W. Chu
类目: Machine Learning (cs.LG)
*备注: Pre-print v0.8 2025-07-30

点击查看摘要

Abstract:Climate extremes present escalating risks to agriculture intensifying the need for reliable multi-hazard early warning systems (EWS). The situation is evolving due to climate change and hence such systems should have the intelligent to continue to learn from recent climate behaviours. However, traditional single-hazard forecasting methods fall short in capturing complex interactions among concurrent climatic events. To address this deficiency, in this paper, we combine sequential deep learning models and advanced Explainable Artificial Intelligence (XAI) techniques to introduce a multi-hazard forecasting framework for agriculture. In our experiments, we utilize meteorological data from four prominent agricultural regions in the United States (between 2010 and 2023) to validate the predictive accuracy of our framework on multiple severe event types, which are extreme cold, floods, frost, hail, heatwaves, and heavy rainfall, with tailored models for each area. The framework uniquely integrates attention mechanisms with TimeSHAP (a recurrent XAI explainer for time series) to provide comprehensive temporal explanations revealing not only which climatic features are influential but precisely when their impacts occur. Our results demonstrate strong predictive accuracy, particularly with the BiLSTM architecture, and highlight the system’s capacity to inform nuanced, proactive risk management strategies. This research significantly advances the explainability and applicability of multi-hazard EWS, fostering interdisciplinary trust and effective decision-making process for climate risk management in the agricultural industry.

[LG-45] Scientific Machine Learning with Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2507.22959
作者: Salah A. Faroughi,Farinaz Mostajeran,Amin Hamed Mashhadzadeh,Shirko Faroughi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs’ advantages in capturing complex dynamics while learning more effectively. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.

[LG-46] LLM -Assisted Cheating Detection in Korean Language via Keystrokes

链接: https://arxiv.org/abs/2507.22956
作者: Dong Hyun Roh,Rajesh Kumar,An Ngo
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: This paper has 11 pages, 6 figures, 2 tables, and has been accepted for publication at IEEE-IJCB 2025

点击查看摘要

Abstract:This paper presents a keystroke-based framework for detecting LLM-assisted cheating in Korean, addressing key gaps in prior research regarding language coverage, cognitive context, and the granularity of LLM involvement. Our proposed dataset includes 69 participants who completed writing tasks under three conditions: Bona fide writing, paraphrasing ChatGPT responses, and transcribing ChatGPT responses. Each task spans six cognitive processes defined in Bloom’s Taxonomy (remember, understand, apply, analyze, evaluate, and create). We extract interpretable temporal and rhythmic features and evaluate multiple classifiers under both Cognition-Aware and Cognition-Unaware settings. Temporal features perform well under Cognition-Aware evaluation scenarios, while rhythmic features generalize better under cross-cognition scenarios. Moreover, detecting bona fide and transcribed responses was easier than paraphrased ones for both the proposed models and human evaluators, with the models significantly outperforming the humans. Our findings affirm that keystroke dynamics facilitate reliable detection of LLM-assisted writing across varying cognitive demands and writing strategies, including paraphrasing and transcribing LLM-generated responses.

[LG-47] LLM s Between the Nodes: Community Discovery Beyond Vectors

链接: https://arxiv.org/abs/2507.22955
作者: Ekta Gujral,Apurva Sinha
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Community detection in social network graphs plays a vital role in uncovering group dynamics, influence pathways, and the spread of information. Traditional methods focus primarily on graph structural properties, but recent advancements in Large Language Models (LLMs) open up new avenues for integrating semantic and contextual information into this task. In this paper, we present a detailed investigation into how various LLM-based approaches perform in identifying communities within social graphs. We introduce a two-step framework called CommLLM, which leverages the GPT-4o model along with prompt-based reasoning to fuse language model outputs with graph structure. Evaluations are conducted on six real-world social network datasets, measuring performance using key metrics such as Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Variation of Information (VOI), and cluster purity. Our findings reveal that LLMs, particularly when guided by graph-aware strategies, can be successfully applied to community detection tasks in small to medium-sized graphs. We observe that the integration of instruction-tuned models and carefully engineered prompts significantly improves the accuracy and coherence of detected communities. These insights not only highlight the potential of LLMs in graph-based research but also underscore the importance of tailoring model interactions to the specific structure of graph data.

[LG-48] Neural Autoregressive Modeling of Brain Aging MICCAI2025

链接: https://arxiv.org/abs/2507.22954
作者: Ridvan Yesiloglu,Wei Peng,Md Tauhidul Islam,Ehsan Adeli
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: Accepted at Deep Generative Models Workshop @ MICCAI 2025

点击查看摘要

Abstract:Brain aging synthesis is a critical task with broad applications in clinical and computational neuroscience. The ability to predict the future structural evolution of a subject’s brain from an earlier MRI scan provides valuable insights into aging trajectories. Yet, the high-dimensionality of data, subtle changes of structure across ages, and subject-specific patterns constitute challenges in the synthesis of the aging brain. To overcome these challenges, we propose NeuroAR, a novel brain aging simulation model based on generative autoregressive transformers. NeuroAR synthesizes the aging brain by autoregressively estimating the discrete token maps of a future scan from a convenient space of concatenated token embeddings of a previous and future scan. To guide the generation, it concatenates into each scale the subject’s previous scan, and uses its acquisition age and the target age at each block via cross-attention. We evaluate our approach on both the elderly population and adolescent subjects, demonstrating superior performance over state-of-the-art generative models, including latent diffusion models (LDM) and generative adversarial networks, in terms of image fidelity. Furthermore, we employ a pre-trained age predictor to further validate the consistency and realism of the synthesized images with respect to expected aging patterns. NeuroAR significantly outperforms key models, including LDM, demonstrating its ability to model subject-specific brain aging trajectories with high fidelity.

[LG-49] Formal Bayesian Transfer Learning via the Total Risk Prior

链接: https://arxiv.org/abs/2507.23768
作者: Nathan Wycoff,Ali Arab,Lisa O. Singh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In analyses with severe data-limitations, augmenting the target dataset with information from ancillary datasets in the application domain, called source datasets, can lead to significantly improved statistical procedures. However, existing methods for this transfer learning struggle to deal with situations where the source datasets are also limited and not guaranteed to be well-aligned with the target dataset. A typical strategy is to use the empirical loss minimizer on the source data as a prior mean for the target parameters, which places the estimation of source parameters outside of the Bayesian formalism. Our key conceptual contribution is to use a risk minimizer conditional on source parameters instead. This allows us to construct a single joint prior distribution for all parameters from the source datasets as well as the target dataset. As a consequence, we benefit from full Bayesian uncertainty quantification and can perform model averaging via Gibbs sampling over indicator variables governing the inclusion of each source dataset. We show how a particular instantiation of our prior leads to a Bayesian Lasso in a transformed coordinate system and discuss computational techniques to scale our approach to moderately sized datasets. We also demonstrate that recently proposed minimax-frequentist transfer learning techniques may be viewed as an approximate Maximum a Posteriori approach to our model. Finally, we demonstrate superior predictive performance relative to the frequentist baseline on a genetics application, especially when the source data are limited.

[LG-50] Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing

链接: https://arxiv.org/abs/2507.23767
作者: Jonathan R. Landers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures, 3 tables

点击查看摘要

Abstract:A novel approach is presented for identifying distinct signatures of performing acts in the secondary ticket resale market by analyzing dynamic pricing distributions. Using a newly curated, time series dataset from the SeatGeek API, we model ticket pricing distributions as scaled Beta distributions. This enables accurate parameter estimation from incomplete statistical data using a hybrid of quantile matching and the method of moments. Incorporating the estimated \alpha and \beta parameters into Random Forest classifiers significantly improves pairwise artist classification accuracy, demonstrating the unique economic signatures in event pricing data. Additionally, we provide theoretical and empirical evidence that incorporating zero-variance (constant-value) features into Random Forest models acts as an implicit regularizer, enhancing feature variety and robustness. This regularization promotes deeper, more varied trees in the ensemble, improving the bias-variance tradeoff and mitigating overfitting to dominant features. These findings are validated on both the new ticket pricing dataset and the standard UCI ML handwritten digits dataset.

[LG-51] DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable Uncertainty-Aware Redaction

链接: https://arxiv.org/abs/2507.23736
作者: Kyle Naddeo,Nikolas Koutsoubis,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Tony OSullivan,Raj Krish
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures,

点击查看摘要

Abstract:Access to medical imaging and associated text data has the potential to drive major advances in healthcare research and patient outcomes. However, the presence of Protected Health Information (PHI) and Personally Identifiable Information (PII) in Digital Imaging and Communications in Medicine (DICOM) files presents a significant barrier to the ethical and secure sharing of imaging datasets. This paper presents a hybrid de-identification framework developed by Impact Business Information Solutions (IBIS) that combines rule-based and AI-driven techniques, and rigorous uncertainty quantification for comprehensive PHI/PII removal from both metadata and pixel data. Our approach begins with a two-tiered rule-based system targeting explicit and inferred metadata elements, further augmented by a large language model (LLM) fine-tuned for Named Entity Recognition (NER), and trained on a suite of synthetic datasets simulating realistic clinical PHI/PII. For pixel data, we employ an uncertainty-aware Faster R-CNN model to localize embedded text, extract candidate PHI via Optical Character Recognition (OCR), and apply the NER pipeline for final redaction. Crucially, uncertainty quantification provides confidence measures for AI-based detections to enhance automation reliability and enable informed human-in-the-loop verification to manage residual risks. This uncertainty-aware deidentification framework achieves robust performance across benchmark datasets and regulatory standards, including DICOM, HIPAA, and TCIA compliance metrics. By combining scalable automation, uncertainty quantification, and rigorous quality assurance, our solution addresses critical challenges in medical data de-identification and supports the secure, ethical, and trustworthy release of imaging data for research. Comments: 15 pages, 6 figures, Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2507.23736 [stat.ML] (or arXiv:2507.23736v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.23736 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules

链接: https://arxiv.org/abs/2507.23349
作者: Wenhai Cui,Xiaoting Ji,Wen Su,Xiaodong Yan,Xingqiu Zhao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Individualized treatment rules (ITRs) have gained significant attention due to their wide-ranging applications in fields such as precision medicine, ridesharing, and advertising recommendations. However, when ITRs are influenced by sensitive attributes such as race, gender, or age, they can lead to outcomes where certain groups are unfairly advantaged or disadvantaged. To address this gap, we propose a flexible approach based on the optimal transport theory, which is capable of transforming any optimal ITR into a fair ITR that ensures demographic parity. Recognizing the potential loss of value under fairness constraints, we introduce an ``improved trade-off ITR," designed to balance value optimization and fairness while accommodating varying levels of fairness through parameter adjustment. To maximize the value of the improved trade-off ITR under specific fairness levels, we propose a smoothed fairness constraint for estimating the adjustable parameter. Additionally, we establish a theoretical upper bound on the value loss for the improved trade-off ITR. We demonstrate performance of the proposed method through extensive simulation studies and application to the Next 36 entrepreneurial program dataset.

[LG-53] Simulation-based inference for Precision Neutrino Physics through Neural Monte Carlo tuning

链接: https://arxiv.org/abs/2507.23297
作者: A. Gavrikov,A. Serafini,D. Dolzhikov,A. Garfagnini,M. Gonchar,M. Grassi,R. Brugnera,V. Cerrone,L. V. D’Auria,R. M. Guizzetti,L. Lastrucci,G. Andronico,V. Antonelli,A. Barresi,D. Basilico,M. Beretta,A. Bergnoli,M. Borghesi,A. Brigatti,R. Bruno,A. Budano,B. Caccianiga,A. Cammi,R. Caruso,D. Chiesa,C. Clementi,C. Coletta,S. Dusini,A. Fabbri,G. Felici,G. Ferrante,M.G. Giammarchi,N. Giudice,N. Guardone,F. Houria,C. Landini,I. Lippi,L. Loi,P. Lombardi,F. Mantovani,S.M. Mari,A. Martini,L. Miramonti,M. Montuschi,M. Nastasi,D. Orestano,F. Ortica,A. Paoloni,L. Pelicci,E. Percalli,F. Petrucci,E. Previtali,G. Ranucci,A.C. Re,B. Ricci,A. Romani,C. Sirignano,M. Sisti,L. Stanco,E. Stanescu Farilla,V. Strati,M.D.C Torri,C. Tuvè,C. Venettacci,G. Verde,L. Votano
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:Precise modeling of detector energy response is crucial for next-generation neutrino experiments which present computational challenges due to lack of analytical likelihoods. We propose a solution using neural likelihood estimation within the simulation-based inference framework. We develop two complementary neural density estimators that model likelihoods of calibration data: conditional normalizing flows and a transformer-based regressor. We adopt JUNO - a large neutrino experiment - as a case study. The energy response of JUNO depends on several parameters, all of which should be tuned, given their non-linear behavior and strong correlations in the calibration data. To this end, we integrate the modeled likelihoods with Bayesian nested sampling for parameter inference, achieving uncertainties limited only by statistics with near-zero systematic biases. The normalizing flows model enables unbinned likelihood analysis, while the transformer provides an efficient binned alternative. By providing both options, our framework offers flexibility to choose the most appropriate method for specific needs. Finally, our approach establishes a template for similar applications across experimental neutrino and broader particle physics.

[LG-54] Extended Factorization Machine Annealing for Rapid Discovery of Transparent Conducting Materials

链接: https://arxiv.org/abs/2507.23160
作者: Daisuke Makino,Tatsuya Goto,Yoshinori Suga
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 12pages, 6figures

点击查看摘要

Abstract:The development of novel transparent conducting materials (TCMs) is essential for enhancing the performance and reducing the cost of next-generation devices such as solar cells and displays. In this research, we focus on the (Al _x Ga _y In _z ) _2 O _3 system and extend the FMA framework, which combines a Factorization Machine (FM) and annealing, to search for optimal compositions and crystal structures with high accuracy and low cost. The proposed method introduces (i) the binarization of continuous variables, (ii) the utilization of good solutions using a Hopfield network, (iii) the activation of global search through adaptive random flips, and (iv) fine-tuning via a bit-string local search. Validation using the (Al _x Ga _y In _z ) _2 O _3 data from the Kaggle “Nomad2018 Predicting Transparent Conductors” competition demonstrated that our method achieves faster and more accurate searches than Bayesian optimization and genetic algorithms. Furthermore, its application to multi-objective optimization showed its capability in designing materials by simultaneously considering both the band gap and formation energy. These results suggest that applying our method to larger, more complex search problems and diverse material designs that reflect realistic experimental conditions is expected to contribute to the further advancement of materials informatics.

[LG-55] On the Complexity of Finding Stationary Points in Nonconvex Simple Bilevel Optimization

链接: https://arxiv.org/abs/2507.23155
作者: Jincheng Cao,Ruichen Jiang,Erfan Yazdandoost Hamedani,Aryan Mokhtari
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the problem of solving a simple bilevel optimization problem, where the upper-level objective is minimized over the solution set of the lower-level problem. We focus on the general setting in which both the upper- and lower-level objectives are smooth but potentially nonconvex. Due to the absence of additional structural assumptions for the lower-level objective-such as convexity or the Polyak-Łojasiewicz (PL) condition-guaranteeing global optimality is generally intractable. Instead, we introduce a suitable notion of stationarity for this class of problems and aim to design a first-order algorithm that finds such stationary points in polynomial time. Intuitively, stationarity in this setting means the upper-level objective cannot be substantially improved locally without causing a larger deterioration in the lower-level objective. To this end, we show that a simple and implementable variant of the dynamic barrier gradient descent (DBGD) framework can effectively solve the considered nonconvex simple bilevel problems up to stationarity. Specifically, to reach an (\epsilon_f, \epsilon_g) -stationary point-where \epsilon_f and \epsilon_g denote the target stationarity accuracies for the upper- and lower-level objectives, respectively-the considered method achieves a complexity of \mathcalO\left(\max\left(\epsilon_f^-\frac3+p1+p, \epsilon_g^-\frac3+p2\right)\right) , where p \geq 0 is an arbitrary constant balancing the terms. To the best of our knowledge, this is the first complexity result for a discrete-time algorithm that guarantees joint stationarity for both levels in general nonconvex simple bilevel problems.

[LG-56] A Smoothing Newton Method for Rank-one Matrix Recovery

链接: https://arxiv.org/abs/2507.23017
作者: Tyler Maunu,Gabriel Abreu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:We consider the phase retrieval problem, which involves recovering a rank-one positive semidefinite matrix from rank-one measurements. A recently proposed algorithm based on Bures-Wasserstein gradient descent (BWGD) exhibits superlinear convergence, but it is unstable, and existing theory can only prove local linear convergence for higher rank matrix recovery. We resolve this gap by revealing that BWGD implements Newton’s method with a nonsmooth and nonconvex objective. We develop a smoothing framework that regularizes the objective, enabling a stable method with rigorous superlinear convergence guarantees. Experiments on synthetic data demonstrate this superior stability while maintaining fast convergence.

[LG-57] yping Tensor Calculus in 2-Categories (I)

链接: https://arxiv.org/abs/1908.01212
作者: Fatimah Rita Ahmadi
类目: Category Theory (math.CT); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 28 pages; extended introduction, more explanation

点击查看摘要

Abstract:To formalize calculations in linear algebra for the development of efficient algorithms and a framework suitable for functional programming languages and faster parallelized computations, we adopt an approach that treats elements of linear algebra, such as matrices, as morphisms in the category of matrices, \mathbfMat_k . This framework is further extended by generalizing the results to arbitrary monoidal semiadditive categories. To enrich this perspective and accommodate higher-rank matrices (tensors), we define semiadditive 2-categories, where matrices T_ij are represented as 1-morphisms, and tensors with four indices T_ijkl as 2-morphisms. This formalization provides an index-free, typed linear algebra framework that includes matrices and tensors with up to four indices. Furthermore, we extend the framework to monoidal semiadditive 2-categories and demonstrate detailed operations and vectorization within the 2-category of 2Vec introduced by Kapranov and Voevodsky.

信息检索

[IR-0] owards LLM -Enhanced Product Line Scoping

链接: https://arxiv.org/abs/2507.23410
作者: Alexander Felfernig,Damian Garber,Viet-Man Le,Sebastian Lubos,Thi Ngoc Trang Tran
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The idea of product line scoping is to identify the set of features and configurations that a product line should include, i.e., offer for configuration purposes. In this context, a major scoping task is to find a balance between commercial relevance and technical feasibility. Traditional product line scoping approaches rely on formal feature models and require a manual analysis which can be quite time-consuming. In this paper, we sketch how Large Language Models (LLMs) can be applied to support product line scoping tasks with a natural language interaction based scoping process. Using a working example from the smarthome domain, we sketch how LLMs can be applied to evaluate different feature model alternatives. We discuss open research challenges regarding the integration of LLMs with product line scoping.

[IR-1] Your Spending Needs Attention: Modeling Financial Habits with Transformers

链接: https://arxiv.org/abs/2507.23267
作者: D. T. Braithwaite,Misael Cavalcanti,R. Austin McEver,Hiroto Udagawa,Daniel Silva,Rohan Ramanath,Felipe Meneses,Arissa Yoshida,Evan Wingert,Matheus Ramos,Brian Zanfelice,Aman Gupta
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Predictive models play a crucial role in the financial industry, enabling risk prediction, fraud detection, and personalized recommendations, where slight changes in core model performance can result in billions of dollars in revenue or losses. While financial institutions have access to enormous amounts of user data (e.g., bank transactions, in-app events, and customer support logs), leveraging this data effectively remains challenging due to its complexity and scale. Thus, in many financial institutions, most production models follow traditional machine learning (ML) approaches by converting unstructured data into manually engineered tabular features. Conversely, other domains (e.g., natural language processing) have effectively utilized self-supervised learning (SSL) to learn rich representations from raw data, removing the need for manual feature extraction. In this paper, we investigate using transformer-based representation learning models for transaction data, hypothesizing that these models, trained on massive data, can provide a novel and powerful approach to understanding customer behavior. We propose a new method enabling the use of SSL with transaction data by adapting transformer-based models to handle both textual and structured attributes. Our approach, denoted nuFormer, includes an end-to-end fine-tuning method that integrates user embeddings with existing tabular features. Our experiments demonstrate improvements for large-scale recommendation problems at Nubank. Notably, these gains are achieved solely through enhanced representation learning rather than incorporating new data sources.

附件下载

点击下载今日全部论文列表