Arxiv今日论文 | 2025-08-07

本篇博文主要内容为 2025-08-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决当前计算机使用代理（Computer Use Agents, CUAs）在面对新型和专业化软件时性能下降的问题，尤其是在缺乏人类标注数据的情况下。现有方法依赖于人工标注数据驱动模型训练，难以适应未见过的软件环境。解决方案的关键在于提出SEAgent框架，其核心机制包括：1）通过经验式学习（experiential learning）使代理在与陌生软件的交互中自主探索、试错并逐步掌握任务技能；2）设计世界状态模型（World State Model）用于分步轨迹评估，以及课程生成器（Curriculum Generator）自动构建从简单到复杂的任务序列；3）采用对抗性模仿失败动作与群体相对策略优化（Group Relative Policy Optimization, GRPO）相结合的方式更新代理策略；4）引入“专家到通用”的训练策略，融合多个专用代理的经验知识，从而构建具备持续自主进化能力的通用CUA。该方案在OS-World平台上的五种新软件环境中验证有效，成功将成功率从11.3%提升至34.5%，显著优于基线模型UI-TARS。

链接: https://arxiv.org/abs/2508.04700
作者: Zeyi Sun,Ziyu Liu,Yuhang Zang,Yuhang Cao,Xiaoyi Dong,Tong Wu,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: Code at this https URL

点击查看摘要

Abstract:Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent’s policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.
zh

[NLP-1] Hop Skip and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

【速读】：该论文旨在解决当前生成式AI（Generative AI）在多跳问答（multi-hop question answering）任务中推理失败的机制不明确问题，尤其是为何其幻觉（hallucination）现象比通用语言模型更为显著。解决方案的关键在于提出一个新颖且细致的错误分类框架，从三个核心维度系统分析推理失败：涉及文档的多样性与唯一性（“hops”）、相关信息覆盖的完整性（“coverage”）以及认知效率低下（“overthinking”）。通过人工标注结合自动化指标的严谨验证，该方法揭示了传统以准确率为中心的评估难以捕捉的复杂错误模式，从而为提升模型推理可靠性、透明度和鲁棒性提供了可操作的改进方向。

链接: https://arxiv.org/abs/2508.04699
作者: Anushka Yadav,Isha Nalawade,Srujana Pillarichety,Yashwanth Babu,Reshmi Ghosh,Samyadeep Basu,Wenlong Zhao,Ali Nasaeh,Sriram Balasubramanian,Soundararajan Srinivasan
机构: Microsoft(微软); University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); University of Maryland, College Park(马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved (“hops”), completeness in capturing relevant information (“coverage”), and cognitive inefficiency (“overthinking”). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.
zh

[NLP-2] FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在实际应用中因采用“一刀切”方式部署而无法适配个体用户偏好这一问题，特别是在仅能获取少量用户偏好标注数据的情况下，即“个性化偏好对齐与有限数据”（Personalized Preference Alignment with Limited Data, PPALLI）问题。其解决方案的关键在于提出一种名为FaST的高参数效率方法，该方法通过自动从数据中发现高层特征来实现模型个性化，从而在小样本条件下显著提升偏好对齐效果，并在所构建的DnD和ELIP两个数据集上验证了其优越性能。

链接: https://arxiv.org/abs/2508.04698
作者: Thibaut Thonet,Germán Kruszewski,Jos Rozen,Pierre Erbacher,Marc Dymetman
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization – tailoring models to align with specific user preferences – has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user – a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets – DnD and ELIP – and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance.
zh

[NLP-3] Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering

【速读】：该论文旨在解决传统搜索引擎在处理开放文本查询时精度和相关性不足的问题，尤其是面对自由文本输入时难以有效提取结构化信息、导致检索结果噪声大、聚焦性差的局限。其解决方案的关键在于提出Query Attribute Modeling (QAM)框架，通过自动将自然语言查询分解为结构化的元数据标签（metadata tags）和语义元素（semantic elements），从而实现更精准的过滤与匹配；该方法显著提升了检索系统的有效性，在Amazon Toys Reviews数据集上实现了mAP@5达52.99%，优于BM25、语义相似度搜索、交叉编码重排序及基于RRF融合的混合搜索等多种基线方法，验证了其在企业级搜索（特别是电商场景）中的实用性与先进性。

链接: https://arxiv.org/abs/2508.04683
作者: Karthik Menon,Batool Arhamna Haider,Muhammad Arham,Kanwal Mehreen,Ram Mohan Rao Kadiyala,Hamza Farooq
机构: traversaal.ai(Traversaal)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces Query Attribute Modeling (QAM), a hybrid framework that enhances search precision and relevance by decomposing open text queries into structured metadata tags and semantic elements. QAM addresses traditional search limitations by automatically extracting metadata filters from free-form text queries, reducing noise and enabling focused retrieval of relevant items. Experimental evaluation using the Amazon Toys Reviews dataset (10,000 unique items with 40,000+ reviews and detailed product attributes) demonstrated QAM’s superior performance, achieving a mean average precision at 5 (mAP@5) of 52.99%. This represents significant improvement over conventional methods, including BM25 keyword search, encoder-based semantic similarity search, cross-encoder re-ranking, and hybrid search combining BM25 and semantic results via Reciprocal Rank Fusion (RRF). The results establish QAM as a robust solution for Enterprise Search applications, particularly in e-commerce systems. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2508.04683 [cs.IR] (or arXiv:2508.04683v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.04683 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-4] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在跨领域持续微调过程中面临的灾难性遗忘问题，具体表现为：1）通用能力显著退化；2）先前学习任务性能急剧下降。其解决方案的关键在于提出一种名为“通用样本重放”（General Sample Replay, GeRe）的框架，该框架利用预训练文本作为固定的一般性重放样本集，以高效缓解遗忘问题；进一步地，通过引入基于阈值边缘（Threshold-based Margin, TM）损失的激活状态约束优化方法，维持重放学习期间神经激活状态的一致性，从而在不增加复杂度的前提下实现对通用能力和多任务性能的同步提升。实验表明，GeRe结合TM策略在多种重放机制中均展现出更优的稳定性和性能表现。

链接: https://arxiv.org/abs/2508.04676
作者: Yunan Zhang,Shuoran Jiang,Mengchen Zhao,Yuefeng Li,Yang Fan,Xiangping Wu,Qingcai Chen
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); Ysstech Info-Tech Co.,Ltd(亿迅科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin ™ loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns–retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at this https URL.
zh

[NLP-5] Sculptor: Empowering LLM s with Cognitive Agency via Active Context Management

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长上下文时因前向干扰（proactive interference）导致的性能下降问题，即早期无关信息会干扰模型的推理与记忆召回。其解决方案的关键在于引入主动上下文管理（Active Context Management, ACM）机制，通过赋予LLMs三种工具能力：(1) 上下文分片（context fragmentation），(2) 摘要、隐藏与恢复（summary, hide, and restore），以及(3) 智能搜索（intelligent search），从而让模型能够像人类一样主动调控注意力和工作记忆，选择性聚焦于相关信息并过滤干扰。实验表明，该方法无需额外训练即可显著提升模型在信息稀疏型基准任务上的表现，验证了显式上下文控制策略对长上下文任务鲁棒性的重要性。

链接: https://arxiv.org/abs/2508.04664
作者: Mo Li,L.H. Xu,Qitai Tan,Ting Cao,Yunxin Liu
机构: Tsinghua University (清华大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs’ capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs’ inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.
zh

[NLP-6] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

【速读】：该论文旨在解决如何将Group Relative Policy Optimization (GRPO) 有效应用于由多个语言模型（LM）调用组成的模块化AI系统，这些系统通常包含不同的提示模板和其他工具，且轨迹长度不固定或可能被中断。传统GRPO主要针对单一LM的后训练优化，难以直接适配此类复杂多模块结构。解决方案的关键在于提出mmGRPO（multi-module GRPO），其通过按模块分组LM调用、处理变长及中断轨迹的方式，实现了对多模块系统的统一策略优化；实验表明，结合自动提示优化时，mmGRPO在分类、多跳搜索和隐私保护委托任务上相较基线LM提升平均准确率11%，较仅使用提示优化提升5%。

链接: https://arxiv.org/abs/2508.04660
作者: Noah Ziems,Dilara Soylu,Lakshya A Agrawal,Isaac Miller,Liheng Lai,Chen Qian,Kaiqiang Song,Meng Jiang,Dan Klein,Matei Zaharia,Karel D’Oosterlinck,Christopher Potts,Omar Khattab
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the this http URL optimizer.
zh

[NLP-7] Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech

【速读】：该论文试图解决当前自然语言处理（Natural Language Processing, NLP）领域中关于反仇恨言论（counterspeech）研究与受影响社区需求之间日益加剧的脱节问题。其解决方案的关键在于通过系统性回顾74项相关研究，分析利益相关者参与在数据集构建、模型开发和评估中的影响，并结合与五家专注于在线性别暴力（online Gender-Based Violence, oGBV）的非政府组织（NGO）开展的参与式案例研究，识别出由利益相关者驱动的反仇恨言论生成实践。研究强调必须将受影响社区的专业知识重新置于反仇恨言论研究的核心位置，以确保技术方案的有效性和社会相关性。

链接: https://arxiv.org/abs/2508.04638
作者: Tanvi Dinkar,Aiqi Jiang,Simona Frenda,Poppy Gerrard-Abbott,Nancie Gunson,Gavin Abercrombie,Ioannis Konstas
机构: The Interaction Lab, Heriot-Watt University (赫瑞-瓦特大学交互实验室); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Counterspeech, i.e. the practice of responding to online hate speech, has gained traction in NLP as a promising intervention. While early work emphasised collaboration with non-governmental organisation stakeholders, recent research trends have shifted toward automated pipelines that reuse a small set of legacy datasets, often without input from affected communities. This paper presents a systematic review of 74 NLP studies on counterspeech, analysing the extent to which stakeholder participation influences dataset creation, model development, and evaluation. To complement this analysis, we conducted a participatory case study with five NGOs specialising in online Gender-Based Violence (oGBV), identifying stakeholder-informed practices for counterspeech generation. Our findings reveal a growing disconnect between current NLP research and the needs of communities most impacted by toxic online content. We conclude with concrete recommendations for re-centring stakeholder expertise in counterspeech research.
zh

[NLP-8] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）方法在训练效率低下和过度优化（over-optimization）两个核心问题：前者源于对指令难度评估不足导致样本利用效率低，后者表现为大语言模型（Large Language Models, LLMs）通过验证机制的“捷径”而非真正理解用户意图来最大化奖励。解决方案的关键在于提出Instruction Following Decorator（IFDecorator）框架，其核心创新包括：（1）合作对抗式数据飞轮（cooperative-adversarial data flywheel），协同演化指令与混合验证机制以生成更具挑战性的指令-验证对；（2）IntentCheck模块，作为旁路机制强制模型对齐用户意图；（3）陷阱指令（trip wires）诊断机制，通过设计特定陷阱指令识别并捕获奖励欺骗行为。该方案显著提升了指令遵循准确率（IFEval达87.43%），同时有效降低奖励作弊率，并保持模型通用能力。

链接: https://arxiv.org/abs/2508.04632
作者: Xu Guo,Tianyi Liang,Tong Jian,Xiaogui Yang,Ling-I Wu,Chenhui Li,Zhihui Lu,Qipeng Guo,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.
zh

[NLP-9] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在面对不完善指令（如缺失上下文、模糊指令或不当语气）时，难以保持安全、有用和诚实输出的问题。现有方法要么在推理阶段引入高昂的搜索成本，要么依赖于定制训练语料的端到端模型重写，但后者目标不明确且效率低下。解决方案的关键在于提出P-Aligner——一个轻量级模块，通过预对齐指令实现高效且有效的偏好对齐：它在模型解码前生成保留原始意图但更符合人类偏好的指令形式，其训练基于UltraPrompt数据集，该数据集通过一种基于原则引导的蒙特卡洛树搜索（Monte-Carlo Tree Search）管道系统性地探索与人类偏好紧密关联的候选指令空间。实验表明，P-Aligner在多个模型和基准上显著优于强基线，例如在GPT-4-turbo和Gemma-2-SimPO上的平均胜率分别提升28.35%和8.69%，且具备良好的可扩展性和低时间开销。

链接: https://arxiv.org/abs/2508.04626
作者: Feifan Song,Bofei Gao,Yifan Song,Yi Liu,Weimin Xiong,Yuyang Song,Tianyu Liu,Guoyin Wang,Houfeng Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.
zh

[NLP-10] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

【速读】：该论文旨在解决在低资源环境下，非专业用户通过自然语言查询关系型数据库的难题，即文本到SQL（Text-to-SQL）翻译问题。其核心挑战在于如何在有限计算资源和数据规模下实现高准确率的SQL生成。解决方案的关键在于设计了一个可复用、模型无关的流水线（pipeline），该流水线针对不同Transformer架构（T5-Small、BART-Small、GPT-2）定制化处理数据库模式（schema）格式，并在Spider数据集上进行微调训练（迭代次数为1000–5000次），最终以逻辑形式准确率（Logical Form Accuracy, LFAcc）为主要评估指标。实验表明，编码器-解码器结构的T5-Small模型表现最优（LFAcc达27.8%），验证了此类架构在处理依赖schema信息的SQL生成任务中的优势。

链接: https://arxiv.org/abs/2508.04623
作者: Chirag Seth,Utkarsh Singh
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model’s architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models’ superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline’s modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments.
zh

[NLP-11] URA: Tool-Augmented Unified Retrieval Agent for AI Search

【速读】：该论文旨在解决当前基于检索增强生成（Retrieval-Augmented Generation, RAG）的AI搜索系统在处理动态实时信息时存在的局限性问题，例如无法响应需要访问实时数据库或API接口的交互式查询（如票务状态、库存信息等），因为传统RAG仅依赖静态网页索引，难以满足工业级应用对时效性和复杂意图的理解需求。解决方案的关键在于提出TURA（Tool-Augmented Unified Retrieval Agent for AI Search），其核心创新是将RAG与代理工具调用（agentic tool-use）相结合，构建了一个三阶段框架：1）意图感知检索模块（Intent-Aware Retrieval），通过Model Context Protocol (MCP) 服务器识别并获取结构化信息源；2）基于有向无环图（Directed Acyclic Graph, DAG）的任务规划器，实现任务依赖关系建模与并行优化执行；3）轻量级蒸馏代理执行器（Distilled Agent Executor），高效调用外部工具以获取动态数据。该架构首次系统性地弥合了静态RAG与动态信息源之间的鸿沟，支撑百万级用户规模下的低延迟、高可靠实时问答服务。

链接: https://arxiv.org/abs/2508.04604
作者: Zhejun Zhao,Yuehu Dong,Alley Liu,Lixue Zheng,Pingsheng Liu,Dongdong Shen,Long Xia,Jiashu Zhao,Dawei Yin
机构: Baidu Inc.(百度公司); University of Science and Technology of China(中国科学技术大学); Wilfrid Laurier University(威尔弗里德劳里埃大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.
zh

[NLP-12] Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）顶会因快速扩张而引发的结构性危机问题，具体表现为科研产出过载、环境负担加重、参会者心理压力上升以及场地容量不足等多重挑战，这些因素共同威胁到学术传播、公平性和社区福祉的核心目标。其解决方案的关键在于提出“社区联邦会议”（Community-Federated Conference, CFC）模型，通过将同行评审、成果展示与社交网络功能分离，并在全球协调下由本地机构分担组织职责，从而构建一个更可持续、包容且韧性的AI学术交流体系。

链接: https://arxiv.org/abs/2508.04586
作者: Nuo Chen,Moming Duan,Andre Huikai Lin,Qian Wang,Jiaying Wu,Bingsheng He
机构: National University of Singapore (新加坡国立大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Artificial Intelligence (AI) conferences are essential for advancing research, sharing knowledge, and fostering academic community. However, their rapid expansion has rendered the centralized conference model increasingly unsustainable. This paper offers a data-driven diagnosis of a structural crisis that threatens the foundational goals of scientific dissemination, equity, and community well-being. We identify four key areas of strain: (1) scientifically, with per-author publication rates more than doubling over the past decade to over 4.5 papers annually; (2) environmentally, with the carbon footprint of a single conference exceeding the daily emissions of its host city; (3) psychologically, with 71% of online community discourse reflecting negative sentiment and 35% referencing mental health concerns; and (4) logistically, with attendance at top conferences such as NeurIPS 2024 beginning to outpace venue capacity. These pressures point to a system that is misaligned with its core mission. In response, we propose the Community-Federated Conference (CFC) model, which separates peer review, presentation, and networking into globally coordinated but locally organized components, offering a more sustainable, inclusive, and resilient path forward for AI research.
zh

[NLP-13] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）因计算和内存需求高而难以广泛部署的问题，特别是针对Transformer架构中层间冗余未被充分挖掘的痛点。现有压缩方法多聚焦于层内优化（如低秩近似、注意力头剪枝），但忽略了Transformer重复性结构所蕴含的跨层参数冗余。解决方案的关键在于提出MASA（Matrix Atom Sharing in Attention）框架，通过字典学习策略将注意力投影矩阵分解为共享的字典原子（dictionary atoms），使每层权重表示为这些原子的线性组合，从而实现结构化的权重共享。该方法无需知识蒸馏或架构修改，可作为即插即用模块训练，显著降低注意力模块参数量达66.7%，同时保持与原始模型相当的性能，在多个规模模型（100M–700M参数）和任务上优于GQA、低秩基线及近期提出的Repeat-all-over/Sequential共享方法。

链接: https://arxiv.org/abs/2508.04581
作者: Magauiya Zhussip,Dmitriy Shopkhoev,Ammar Ali,Stamatios Lefkimmiatis
机构: MTS AI; ITMO University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module’s parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer’s weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.
zh

[NLP-14] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

【速读】：该论文试图解决现有生成式 AI (Generative AI) 框架在科学创意生成中因单代理优化导致创造力受限的问题，即由于知识和视角的局限性，难以产生高质量、跨学科且具有战略深度的研究提案。其解决方案的关键在于引入结构化的多代理协作框架，通过模拟真实科研团队的互动机制，系统性地探索团队规模、领导结构（有无领导者）、跨学科性和资历组合等因素对创意质量的影响。研究发现，设定明确领导者可显著提升提案的整合度与前瞻性，而认知多样性是驱动质量的核心因素，但前提是团队必须具备基础的专业知识储备，否则无法超越单一熟练代理的表现。这一发现为设计高效协同的AI创意生成系统提供了实证依据与结构化指导。

链接: https://arxiv.org/abs/2508.04575
作者: Nuo Chen,Yicheng Tong,Jiaying Wu,Minh Duc Duong,Qian Wang,Qingyun Zou,Bryan Hooi,Bingsheng He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint

点击查看摘要

Abstract:While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.
zh

[NLP-15] Do Recommender Systems Really Leverag e Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation CIKM2025

【速读】：该论文旨在解决多模态推荐系统（Multimodal Recommender Systems）中嵌入表示语义信息不足与跨模态对齐控制缺失的问题，即当前方法的性能提升是否源于真正的多模态理解，还是仅仅依赖于模型复杂度的增加。其解决方案的关键在于利用大视觉语言模型（Large Vision-Language Models, LVLMs）通过结构化提示（structured prompts）生成原生多模态嵌入（multimodal-by-design embeddings），无需额外的融合策略即可实现语义对齐；同时，LVLM嵌入可解码为结构化文本描述，作为辅助内容输入推荐系统，从而直接验证并增强其语义深度与跨模态一致性，显著提升推荐性能。

链接: https://arxiv.org/abs/2508.04571
作者: Claudio Pomo,Matteo Attimonelli,Danilo Danese,Fedelucio Narducci,Tommaso Di Noia
机构: Politecnico Di Bari (巴里理工大学); Sapienza University of Rome (罗马大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as Full Research Papers at CIKM 2025

点击查看摘要

Abstract:Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks.
zh

[NLP-16] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在训练数据规模扩大后仍存在的幻觉问题，即模型生成与视觉输入不一致的文本内容。研究表明，当前LVLMs存在训练偏差：它们未能充分利用训练数据，在训练过程中见过的图像上反而更频繁地产生对象幻觉，尤其在对掩码对象进行反事实问答时错误回答“是”的频率较高。解决方案的关键在于识别并消除这种训练偏差，作者提出了一种轻量级且高效的“遗忘”方法Obliviate，其核心思想是将训练数据中真实标签与模型输出之间的差异作为偏差代理，并采用参数和数据高效的微调策略，仅更新语言建模（Language Modeling, LM）头部分参数。实验表明，仅使用原始训练数据并更新约2%的参数，Obliviate即可显著降低判别和生成任务中的幻觉现象，且在不同模型规模（2B至72B参数）和训练数据量下均表现出良好可扩展性和泛化能力。

链接: https://arxiv.org/abs/2508.04567
作者: Yifan Li,Kun Zhou,Wayne Xin Zhao,Lei Fang,Ji-Rong Wen
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes’’ to questions about masked objects. To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.
zh

[NLP-17] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

【速读】：该论文旨在解决当前自动化抑郁症评估研究中普遍存在的问题：即多数研究依赖于有限或未经临床验证的数据集，且过度关注复杂模型设计而忽视实际临床有效性。为应对这一挑战，作者构建了C-MIND数据集，这是一个在真实医院就诊场景下历时两年收集的多模态神经精神诊断数据集，包含音频、视频、文本转录及功能近红外光谱（fNIRS）信号，并附有专业临床医师的最终诊断结果。解决方案的关键在于：首先通过系统分析不同行为特征与诊断的相关性，量化各任务和模态对诊断性能的贡献；其次探索大型语言模型（LLM）在精神病理推理中的潜力及其局限性，并提出基于临床知识引导的推理框架，显著提升LLM在真实临床环境下的诊断准确率（最高提升达10%的Macro-F1分数）。该工作从数据与算法双维度构建了可落地的抑郁症临床评估基础设施。

链接: https://arxiv.org/abs/2508.04531
作者: Zhuang Chen,Guanqun Bi,Wen Zhang,Jiawei Hu,Aoyun Wang,Xiyao Xiao,Kun Feng,Minlie Huang
机构: Central South University (中南大学); Tsinghua University (清华大学); University of International Relations (国际关系学院); Central China Normal University (华中师范大学); Lingxin AI; Yuquan Hospital, Tsinghua University (清华长庚医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.
zh

[NLP-18] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

【速读】：该论文旨在解决生成式大语言模型（Large Language Model, LLM）在通过表示编辑（representation editing）实现风格化输出时所面临的“风格诱导真实性崩溃”（stylization-induced truthfulness collapse）问题，即风格注入往往导致模型核心事实性表征被污染，从而降低回答的准确性。其解决方案的关键在于提出StyliTruth机制，该机制通过正交消减（orthogonal deflation）过程，在模型表示空间中分离出与风格相关和与真实性相关的子空间，从而实现对二者独立控制；进一步设计基于token级别的自适应引导向量，动态精确调控生成过程，有效平衡风格保真度与事实正确性，显著缓解了风格化带来的真实性下降问题。

链接: https://arxiv.org/abs/2508.04530
作者: Chenglei Shen,Zhongxiang Sun,Teng Shi,Xiao Zhang,Jun Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.
zh

[NLP-19] Causal Reflection with Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）和传统强化学习（Reinforcement Learning）代理在因果推理上的局限性问题，即它们往往依赖虚假相关性和脆弱模式，缺乏对动作与结果之间因果机制的建模能力。解决方案的关键在于提出“因果反思”（Causal Reflection）框架，该框架将因果关系显式建模为状态、动作、时间与扰动的动态函数，使代理能够推断延迟和非线性效应；同时引入形式化的“反思机制”（Reflect mechanism），通过识别预测与观测结果之间的不一致来生成因果假设以更新内部模型。在此架构中，LLMs被用作结构化推理引擎，将形式化的因果输出转化为自然语言解释和反事实陈述，从而构建具备自适应、自我修正及因果理解沟通能力的因果反射代理（Causal Reflective Agents）。

链接: https://arxiv.org/abs/2508.04495
作者: Abi Aryan,Zac Liu
机构: Abide AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs exhibit impressive fluency and factual recall, they struggle with robust causal reasoning, often relying on spurious correlations and brittle patterns. Similarly, traditional Reinforcement Learning agents also lack causal understanding, optimizing for rewards without modeling why actions lead to outcomes. We introduce Causal Reflection, a framework that explicitly models causality as a dynamic function over state, action, time, and perturbation, enabling agents to reason about delayed and nonlinear effects. Additionally, we define a formal Reflect mechanism that identifies mismatches between predicted and observed outcomes and generates causal hypotheses to revise the agent’s internal model. In this architecture, LLMs serve not as black-box reasoners, but as structured inference engines translating formal causal outputs into natural language explanations and counterfactuals. Our framework lays the theoretical groundwork for Causal Reflective agents that can adapt, self-correct, and communicate causal understanding in evolving environments.
zh

[NLP-20] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

【速读】：该论文旨在解决现有上下文语言模型在捕捉词汇语义多样性时的局限性问题，尤其是当前基于Word-in-Context的任务仅能比较同一词素（lemma）的不同用法，难以覆盖跨词之间的语义差异。其解决方案的关键在于提出“概念区分”（Concept Differentiation）这一新任务，通过扩展至跨词场景来增强模型对词汇间语义关系的理解，并构建了一个基于SemCor数据集的新训练数据集，进而对多种表示模型进行微调，得到概念对齐嵌入（Concept-Aligned Embeddings, CALE）。实验表明，CALE不仅在多个词汇语义任务中达到最优性能，还显著优化了嵌入空间的结构组织。

链接: https://arxiv.org/abs/2508.04494
作者: Bastien Liétard,Gabriel Loiseau
机构: University of Lille (里尔大学); Inria (法国国家信息与自动化研究院); CNRS (法国国家科学研究中心); Centrale Lille (里尔中央理工学院); Hornetsecurity (霍内特安全公司)
类目: Computation and Language (cs.CL)
备注: Under review in ARR July 2025

点击查看摘要

Abstract:Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE’s fine-tuning brings valuable changes to the spatial organization of embeddings.
zh

[NLP-21] OS Agents : A Survey on MLLM -based Agents for General Computing Devices Use ACL2025

【速读】：该论文旨在解决如何构建具备类J.A.R.V.I.S.能力的通用人工智能助手的问题，其核心挑战在于使基于操作系统（Operating System, OS）的智能体（OS Agents）能够理解并操作用户设备的图形界面（Graphical User Interface, GUI），从而自主完成多样化任务。解决方案的关键在于系统性地梳理OS Agents的技术框架，包括环境建模、观测空间与动作空间的设计，以及理解、规划和具身化（grounding）等核心能力的实现路径；同时，通过分析领域特定基础模型（domain-specific foundation models）和代理架构（agent frameworks），提出标准化评估协议与基准测试方法，为未来在安全性、隐私保护、个性化及自我进化等方面的突破提供方向。

链接: https://arxiv.org/abs/2508.04482
作者: Xueyu Hu,Tao Xiong,Biao Yi,Zishu Wei,Ruixuan Xiao,Yurun Chen,Jiasheng Ye,Meiling Tao,Xiangxin Zhou,Ziyu Zhao,Yuhuai Li,Shengze Xu,Shenzhi Wang,Xinchen Xu,Shuofei Qiao,Zhaokai Wang,Kun Kuang,Tieyong Zeng,Liang Wang,Jiwei Li,Yuchen Eleanor Jiang,Wangchunshu Zhou,Guoyin Wang,Keting Yin,Zhou Zhao,Hongxia Yang,Fan Wu,Shengyu Zhang,Fei Wu
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); OPPO AI Center (OPPO人工智能中心); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); 01.AI; The Hong Kong Polytechnic University (香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ACL 2025 (Oral)

点击查看摘要

Abstract:The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.
zh

[NLP-22] FrEVL: Leverag ing Frozen Pretrained Embeddings for Efficient Vision-Language Understanding

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Model）在部署时面临的高计算资源需求问题。其解决方案的关键在于提出一种名为FrEVL的框架，通过冻结预训练嵌入（frozen pretrained embeddings）来实现高效的多模态理解：研究表明，冻结嵌入本身蕴含丰富的判别性信息，在仅使用6840万可训练参数的情况下即可达到主流基准测试中95%的性能表现；同时，由于无需微调嵌入层，FrEVL在端到端计算中可实现2.3倍加速和52%的能耗降低，尤其适用于输入可预计算或部署约束优先于微小性能提升的场景。这一方法揭示了冻结嵌入的有效性高度依赖于预训练目标与下游任务需求之间的对齐程度。

链接: https://arxiv.org/abs/2508.04469
作者: Emmanuelle Bourigault,Pauline Bourigault
机构: University of Oxford (牛津大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbfFrEVL, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85% to 95% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides 2.3\times speedup with 52% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding.
zh

[NLP-23] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

【速读】：该论文旨在解决马来西亚教育体系中对可扩展且高质量教学评估工具的迫切需求，尤其针对低资源语言（如马来语）在生成数学选择题时面临的事实准确性与课程一致性难题。其解决方案的关键在于引入四种渐进式生成管道，从非基于知识的提示方法到检索增强生成（Retrieval-Augmented Generation, RAG）方法，并以官方课程文件（包括教师笔记和年度教学计划RPT）作为知识源进行内容约束。研究通过双轨自动化评估框架验证效果：利用语义文本相似度（STS）衡量课程匹配度，采用创新的RAG-QA方法检验情境有效性。结果表明，RAG方法显著优于传统提示方式，在保证内容准确性和课程契合度方面表现更优，同时揭示了框架驱动型RAG与手动实现之间在易用性与控制精度上的权衡关系。

链接: https://arxiv.org/abs/2508.04442
作者: Rohaizah Abdul Wahid,Muhamad Said Nizamuddin Nadim,Suliana Sulaiman,Syahmi Akmal Shaharudin,Muhammad Danial Jupikil,Iqqwan Jasman Su Azlan Su
机构: Fakulti Komputeran dan Meta-Teknologi (META), Universiti Pendidikan Sultan Idris (苏丹依德里斯教育大学计算机与元技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI’s GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.
zh

[NLP-24] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLM s through Knowledge-Reasoning Fusion

【速读】：该论文旨在解决自动形式化（autoformalization）任务中现有方法准确率低的问题，其核心挑战在于模型对形式语言领域知识的掌握不足以及自然语言问题理解与非形式到形式映射推理能力的欠缺。解决方案的关键在于构建一个名为ThinkingF的数据合成与训练流程，通过两个关键步骤提升模型能力：一是利用大规模高质量示例构建包含丰富形式知识的数据集；二是基于专家设计模板生成从非形式到形式的推理轨迹数据；随后采用监督微调（SFT）和基于人类反馈的强化学习（RLVR）进一步融合与优化这两种能力，最终使7B和32B模型在FormalMATH-Lite和ProverBench等基准上达到当前最优性能。

链接: https://arxiv.org/abs/2508.04440
作者: Yutong Wu,Di Huang,Ruosi Wan,Yue Peng,Shijie Shang,Chenrui Cao,Lei Qi,Rui Zhang,Zidong Du,Jie Yan,Xing Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 17 figures, under review

点击查看摘要

Abstract:Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.
zh

[NLP-25] Evaluating Synthesizing and Enhancing for Customer Support Conversation

【速读】：该论文旨在解决当前客户支持对话数据集缺乏战略指导、且真实服务数据难以获取与标注的问题，从而提升客服代理生成结构化、共情且符合专业标准的响应能力。解决方案的关键在于提出了一种基于COPC（Customer Operations Performance Center）指南的结构化客户支持对话（Customer Support Conversation, CSC）框架，明确五类对话阶段和十二种支持策略，并据此构建了两个核心资源：一是通过大语言模型（LLM）重写的真实对话数据集CSConv（含1,855条标注对话），二是基于角色扮演模拟策略丰富对话的训练数据集RoleCS。实验表明，在RoleCS上微调的大型语言模型在CSConv上的策略一致性与问题解决效果显著提升，验证了该框架的有效性。

链接: https://arxiv.org/abs/2508.04423
作者: Jie Zhu,Huaixia Dou,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong
机构: 未知
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at this https URL.
zh

[NLP-26] Beyond Pixels: Exploring DOM Downsampling for LLM -Based Web Agents

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的自主网络代理（web agents）在应用状态表示上的效率与效果瓶颈问题。现有方法依赖于视觉增强的图形用户界面（GUI）快照（grounded GUI snapshots），虽能模拟人类感知且输入成本较低，但受限于LLM视觉理解能力弱于代码解析能力；而结构化更强的文档对象模型（DOM）快照虽更具语义表达潜力，却因token消耗过大难以在实际Web代理中可靠部署。论文提出D2Snap——首个面向DOM的下采样算法，其核心创新在于通过高效压缩DOM结构，在保持关键UI层次信息的同时显著降低输入token数量，从而实现与GUI快照相当甚至更优的代理性能（如在1e3级别token内达到67%成功率，优于基线65%，且在更高token配置下提升至75%）。实验表明，DOM固有的层级结构本身即为LLM提供强大UI特征，是提升代理决策质量的关键因素。

链接: https://arxiv.org/abs/2508.04412
作者: Thassilo M. Schiepanski,Nicholas Piël
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation \unicodex2013 referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) \unicodex2013 within the same input token order of magnitude (1e3). Our best evaluated configurations \unicodex2013 one token order above, but within the model’s context window \unicodex2013 outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) Cite as: arXiv:2508.04412 [cs.AI] (or arXiv:2508.04412v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.04412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-27] Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model

【速读】：该论文旨在解决语音对话系统中用户感知延迟（User-Perceived Latency, UPL）的问题，即用户在等待系统响应时所感受到的等待时间。为降低UPL，需在用户语音结束前预测出完整的用户话语，并据此提前生成对话响应。其解决方案的关键在于提出了一种预测置信度模型（Prediction Confidence Model, PCM），该模型通过估计预测的完整用户话语与实际完整用户话语之间的语义相似度，来判断是否具备进行预取（prefetching）的条件，从而在保证响应准确性的同时优化预取决策。

链接: https://arxiv.org/abs/2508.04403
作者: Kiyotada Mori,Seiya Kawano,Angel Fernando Garcia Contreras,Koichiro Yoshino
机构: NAIST(日本信息学研究所); RIKEN(理化学研究所); Tokyo Institute of Technology (东京工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user’s waiting time before receiving the system’s response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user’s speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance.
zh

[NLP-28] What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

【速读】：该论文旨在解决当前语音对话系统（Spoken Dialogue Systems, SDSs）中自动语音识别（ASR）模块的评估方法不足问题，特别是如何更准确地衡量ASR在实际对话场景下的有效性。传统ASR评估多依赖于全句转写准确性，但忽略了人类在对话中“选择性倾听”（selective listening）的能力——即仅关注与响应生成相关的关键信息片段。论文的核心解决方案在于通过实验验证人类在生成对话响应时对语音内容的选择性注意行为，并以此为基础提出一种基于人类选择性倾听的新ASR评估方法，从而揭示ASR系统与人类在关键信息识别能力上的差距，为ASR性能优化提供更具语义相关性的评价依据。

链接: https://arxiv.org/abs/2508.04402
作者: Kiyotada Mori,Seiya Kawano,Chaoran Liu,Carlos Toshinori Ishi,Angel Fernando Garcia Contreras,Koichiro Yoshino
机构: NAIST(日本信息科学与技术研究院); RIKEN(理化学研究所); NII(日本国立情报学研究所); Titech(东京工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.
zh

[NLP-29] Why are LLM s abilities emergent?

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在生成任务中展现出的“涌现能力”（emergent capabilities）是否源于参数规模的简单增长，还是由深度神经网络（Deep Neural Networks, DNNs）内在的复杂非线性动力学机制所驱动，从而引发对当前AI系统“创造但无理解”（creation without understanding）现象的本质认知困境。解决方案的关键在于揭示DNN作为复杂动态系统的行为特征——即其宏观层面的能力并非可从微观神经元活动直接推导，而是源于高度敏感的非线性系统中组件间的协同作用，这种机制与物理学、化学和生物学中的普遍涌现规律一致；因此，理解LLM能力需转向将DNN视为一类具有普适涌现原理的新一代复杂动力系统，而非仅依赖训练指标或预训练损失阈值等表征性描述。

链接: https://arxiv.org/abs/2508.04401
作者: Vladimír Havlík
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of “creation without understanding” that characterises contemporary AI development. We explore how the neural approach’s reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.
zh

[NLP-30] Improving Crash Data Quality with Large Language Models : Evidence from Secondary Crash Narratives in Kentucky

【速读】：该论文旨在解决交通事故数据中二次碰撞（secondary crash）识别准确率低的问题，通过挖掘事故描述文本（crash narratives）提升数据质量。其核心解决方案是采用先进的自然语言处理（Natural Language Processing, NLP）技术对非结构化文本进行自动分类，其中关键在于对比三类模型：零样本开源大语言模型（Large Language Models, LLMs）、微调的Transformer架构（如RoBERTa、BERT等）以及传统逻辑回归基线模型。研究发现，微调后的Transformer模型在精度与召回率之间取得最佳平衡，尤其是RoBERTa在测试集上达到F1-score 0.90和准确率95%，显著优于其他方法，且推理时间仅需数秒，具备高效、可部署的优势，为高质 crash data 的自动化识别提供了可复制的技术路径。

链接: https://arxiv.org/abs/2508.04399
作者: Xu Zhang,Mei Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 2 figures

点击查看摘要

Abstract:This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.
zh

[NLP-31] AIC CTU@FEVER 8: On-premise fact checking through long context RAG

【速读】：该论文旨在解决自动化事实核查（fact-checking）任务中的准确性与部署效率问题，特别是在资源受限环境下的性能优化。解决方案的关键在于提出一个基于检索增强生成（Retrieval-Augmented Generation, RAG）的两步式流水线架构，该架构可在仅使用单张NVIDIA A10 GPU（23GB显存）且每条声明处理时间不超过60秒的约束条件下，实现接近当前最优的事实核查性能（以Ev2R测试得分衡量）。

链接: https://arxiv.org/abs/2508.04390
作者: Herbert Ullrich,Jan Drchal
机构: CTU FEE (捷克理工大学电气工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year’s submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim.
zh

[NLP-32] Chain of Questions: Guiding Multimodal Curiosity in Language Models

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在多模态场景下推理能力不足的问题，尤其是模型难以主动决策何时以及如何调用视觉、音频或空间感知等不同感官模态来应对复杂现实环境。其解决方案的关键在于提出了一种名为“问题链”（Chain of Questions, CoQ）的 curiosity-driven 推理框架，通过让模型动态生成针对环境的具体问题，引导其选择性激活相关模态以获取关键信息，从而提升多模态信息识别与整合的准确性，增强推理过程的可解释性和任务适应性。

链接: https://arxiv.org/abs/2508.04350
作者: Nima Iji,Kia Dashtipour
机构: Edinburgh Napier University (爱丁堡纳皮尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model’s ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.
zh

[NLP-33] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在大语言模型（Large Language Model, LLM）推理任务中因粗粒度信用分配机制而导致的性能瓶颈问题，即现有算法如Group Relative Policy Optimization (GRPO) 对序列中所有token统一赋予权重，无法有效区分不同token对最终结果的贡献，尤其在长链推理任务中表现不佳。解决方案的关键在于提出动态熵加权机制（Dynamic Entropy Weighting），其核心思想是利用正确响应中高熵token作为信号，引导策略向更高性能上限逼近；通过两种实现方式——分token的组策略优化（Group Token Policy Optimization, GTPO）和基于平均token熵的序列级相对策略优化（Sequence-Level Group Relative Policy Optimization, GRPO-S），实现了细粒度的奖励信号分配，从而显著提升策略更新的精度与效果。实验表明，该机制是性能提升的关键驱动因素，为增强模型深度推理能力提供了新路径。

链接: https://arxiv.org/abs/2508.04349
作者: Hongze Tan,Jianfei Pan
机构: ByteDance(字节跳动); HKUST(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbfDynamic Entropy Weighting. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbfGroup Token Policy Optimization (\textbfGTPO), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbfSequence-Level Group Relative Policy Optimization (\textbfGRPO-S), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.
zh

[NLP-34] Modelling and Classifying the Components of a Literature Review

【速读】：该论文旨在解决生成式 AI (Generative AI) 在科学文献分析中面临的两大核心挑战：一是缺乏适用于文献综述生成任务的标准化 rhetorical role（修辞角色）标注体系，二是大规模高质量标注数据的获取困难。其解决方案的关键在于：1）提出一个专为文献综述生成设计的新颖标注框架，明确区分研究空白、结果、局限性等关键语义角色；2）构建 Sci-Sentence 基准数据集，包含由领域专家人工标注的 700 条句子和通过大语言模型（LLM）自动标注的 2,240 条句子，并系统评估了 37 种不同架构与规模的 LLM 在该任务上的表现。实验表明，高质量微调数据可使模型 F1 分数超过 96%，且部分轻量级开源模型在性能上接近闭源大模型，同时利用 LLM 自动生成的半合成样本能显著提升小模型表现，为未来自动化文献综述系统的开发提供了坚实基础。

链接: https://arxiv.org/abs/2508.04337
作者: Francisco Bolaños,Angelo Salatino,Francesco Osborne,Enrico Motta
机构: Open University (开放大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.
zh

[NLP-35] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

【速读】：该论文旨在解决当前医疗领域大语言模型（Large Language Models, LLMs）评估基准存在的可靠性问题，如临床真实性不足、数据管理不 robust 以及缺乏安全导向的评价指标。其解决方案的关键在于提出首个面向生命周期的评估框架 MedCheck，该框架将基准开发过程细分为从设计到治理的五个连续阶段，并提供一套包含 46 条医学定制化标准的检查清单，从而系统性诊断现有基准的缺陷并指导构建更标准化、可靠且透明的医疗 AI 评估体系。

链接: https://arxiv.org/abs/2508.04325
作者: Zizhan Ma,Wenxuan Wang,Guo Yu,Yiu-Fai Cheung,Meidan Ding,Jie Liu,Wenting Chen,Linlin Shen
机构: The Chinese University of Hong Kong (香港中文大学); Renmin University of China (中国人民大学); Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.
zh

[NLP-36] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

【速读】：该论文旨在解决GraphRAG（基于图的检索增强生成）系统在知识构建阶段易受恶意篡改的问题，即通过修改原始文本中的少量词汇即可诱导生成错误的知识图谱，从而毒化下游问答任务的推理结果。解决方案的关键在于提出两种新型知识投毒攻击方法：一是目标导向型知识投毒攻击（Targeted KPA），利用图论分析定位敏感节点并针对性改写对应叙述，实现对特定问答结果的精准控制（成功率93.1%）；二是通用型知识投毒攻击（Universal KPA），通过操纵全局影响力词汇（如代词和依存关系）破坏图结构完整性，仅需修改不足0.05%的文本即导致问答准确率从95%降至50%。这两种攻击均能保持中毒文本的流畅性和自然性，且现有先进防御机制无法有效检测，揭示了GraphRAG安全防护领域的重大漏洞。

链接: https://arxiv.org/abs/2508.04276
作者: Jiayi Wen,Tianxin Chen,Zhirun Zheng,Cheng Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05% of full text modified, the QA accuracy collapses from 95% to 50%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.
zh

[NLP-37] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM -based Agents AAAI2026

【速读】：该论文旨在解决现有电子商务（e-commerce）基准测试难以覆盖复杂用户意图的问题，例如使用优惠券、预算管理及多商品卖家查找等现实场景中的高阶任务。为应对这一挑战，作者提出 ShoppingBench，一个端到端的购物评估基准，其核心创新在于构建了一个可扩展的框架，基于真实商品数据模拟多样化用户指令，并提供包含超过250万真实商品的交互式仿真沙箱环境，以实现稳定可靠的评估。关键解决方案包括：1）通过轨迹蒸馏策略（trajectory distillation）将大型语言代理（如GPT-4.1）的能力迁移至小型代理；2）结合监督微调与强化学习对合成轨迹进行训练，最终使小型代理在性能上接近GPT-4.1，显著提升了复杂购物任务的自动化处理能力。

链接: https://arxiv.org/abs/2508.04266
作者: Jiangyuan Wang,Kejun Xiao,Qi Sun,Huaipeng Zhao,Tao Luo,Jiandong Zhang,Xiaoyi Zeng
机构: 未知
类目: Computation and Language (cs.CL)
备注: submit to AAAI2026

点击查看摘要

Abstract:Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.
zh

[NLP-38] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中，由于键值（Key-Value, KV）缓存量化导致的性能下降问题，特别是关注注意力下沉（attention sinks）现象对模型输出质量的影响。现有方法如Preserve-First-N（PFN）策略仅保护前几个token的KV精度，但未能解释其原理，且无法应对注意力下沉出现在非初始位置的新发现。论文的关键在于通过分析跨层极端激活异常值（extreme activation outliers）的演化机制，揭示了注意力下沉的本质成因，并提出了一种轻量级、即插即用的预测方法KVSink，能够高效识别并保留潜在的sink token，从而显著提升KV缓存量化下的语言建模性能，同时降低对高精度数值（16-bit outlier）的依赖。

链接: https://arxiv.org/abs/2508.04257
作者: Zunhai Su,Kehong Yuan
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at COLM 2025

点击查看摘要

Abstract:Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit\textbfKVSink, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.
zh

[NLP-39] Graph Representation Learning with Massive Unlabeled Data for Rumor Detection

【速读】：该论文旨在解决谣言检测方法在实际应用中面临的两大核心问题：一是高质量标注数据稀缺导致模型泛化能力不足，二是谣言具有强时效性和话题敏感性，使得基于特定事件训练的模型难以适应新出现的谣言事件。解决方案的关键在于引入大规模未标注的话题数据（来自微博和Twitter），通过图自监督学习方法（如InfoGraph、JOAO和GraphMAE）增强图表示学习模型对多主题语义的理解能力，并结合一个覆盖十年跨度的多样化谣言数据集进行训练与验证，从而提升模型在少样本条件下的鲁棒性和跨事件泛化性能。

链接: https://arxiv.org/abs/2508.04252
作者: Chaoqun Cui,Caiyan Jia
机构: Beijing Jiaotong University (北京交通大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:With the development of social media, rumors spread quickly, cause great harm to society and economy. Thereby, many effective rumor detection methods have been developed, among which the rumor propagation structure learning based methods are particularly effective compared to other methods. However, the existing methods still suffer from many issues including the difficulty to obtain large-scale labeled rumor datasets, which leads to the low generalization ability and the performance degeneration on new events since rumors are time-critical and usually appear with hot topics or newly emergent events. In order to solve the above problems, in this study, we used large-scale unlabeled topic datasets crawled from the social media platform Weibo and Twitter with claim propagation structure to improve the semantic learning ability of a graph reprentation learing model on various topics. We use three typical graph self-supervised methods, InfoGraph, JOAO and GraphMAE in two commonly used training strategies, to verify the performance of general graph semi-supervised methods in rumor detection tasks. In addition, for alleviating the time and topic difference between unlabeled topic data and rumor data, we also collected a rumor dataset covering a variety of topics over a decade (10-year ago from 2022) from the Weibo rumor-refuting platform. Our experiments show that these general graph self-supervised learning methods outperform previous methods specifically designed for rumor detection tasks and achieve good performance under few-shot conditions, demonstrating the better generalization ability with the help of our massive unlabeled topic dataset.
zh

[NLP-40] alkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening CIKM2025

【速读】：该论文旨在解决当前抑郁症自动诊断系统因真实临床训练数据稀缺而导致的模型泛化能力不足问题。现有虚拟患者模拟方法难以生成具有临床有效性、自然性和多样性的症状表现，限制了训练与评估的质量。其解决方案的关键在于提出一种基于先进语言模型的“医生在环”（clinician-in-the-loop）患者模拟流程 TalkDep，通过引入精神科诊断标准、症状严重程度量表和情境因素作为条件输入，生成符合临床规范的患者响应，从而提升模拟患者的逼真度与多样性，并通过临床专业人员的全面评估验证其可靠性，为自动抑郁诊断系统提供可扩展、可适配的高质量训练资源。

链接: https://arxiv.org/abs/2508.04248
作者: Xi Wang,Anxo Perez,Javier Parapar,Fabio Crestani
机构: University of Sheffield (谢菲尔德大学); University of A Coruña (拉科鲁尼亚大学); Università della Svizzera italiana (瑞士意大利语大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted at CIKM 2025

点击查看摘要

Abstract:The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.
zh

[NLP-41] DP-GPT 4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

【速读】：该论文旨在解决传统时间序列预测模型仅依赖数值数据、忽视事件与新闻等文本信息对预测精度影响的问题。现有基于大语言模型的单提示（single-prompt）框架难以有效捕捉带时间戳文本的语义，常引入冗余信息从而降低性能。其解决方案的关键在于提出DP-GPT4MTS（Dual-Prompt GPT2-base for Multimodal Time Series）框架，通过双提示机制实现文本上下文的有效融合：一个显式提示（explicit prompt）明确任务指令，另一个文本提示（textual prompt）生成时序文本的上下文感知嵌入（context-aware embeddings），并通过自注意力机制和前馈网络进行优化。该设计显著提升了多模态时间序列预测的准确性。

链接: https://arxiv.org/abs/2508.04239
作者: Chanjuan Liu(1),Shengzhi Wang(2),Enqiang Zhu(2) ((1) School of Computer Science and Technology, Dalian University of Technology, Dalian, China,(2) Institute of Computing Technology, Guangzhou University, Guangzhou, China)
机构: Guangzhou University (广州大学); Guangzhou University (广州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.
zh

[NLP-42] Hierarchical Text Classification Using Black Box Large Language Models

【速读】：该论文旨在解决层次化文本分类（Hierarchical Text Classification, HTC）中因数据稀缺和模型复杂性带来的挑战，传统机器学习方法通常需要大量标注数据和计算资源。其解决方案的关键在于探索通过API访问的黑盒大语言模型（Large Language Models, LLMs）作为替代方案，并比较三种提示策略——直接叶节点标签预测（Direct Leaf Label Prediction, DL）、直接层次标签预测（Direct Hierarchical Label Prediction, DH）以及自顶向下多步层次标签预测（Top-down Multi-step Hierarchical Label Prediction, TMH）在零样本和少样本设置下的准确性与成本效益。研究发现，少样本设置能稳定提升分类准确率，且在深层标签结构上DH策略优于传统机器学习模型，但同时也带来显著更高的API调用成本，凸显了性能与计算开销之间的权衡。

链接: https://arxiv.org/abs/2508.04219
作者: Kosuke Yoshimura,Hisashi Kashima
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies – Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) – in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.
zh

[NLP-43] Reasoning Guard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在推理过程中易生成有害内容的问题，尤其是中后期推理步骤中的安全风险。现有防御机制依赖昂贵的微调和专家知识，难以规模化部署。其解决方案的关键在于提出一种推理时（inference-time）的安全防护机制——ReasoningGuard，通过捕捉模型内部注意力行为精准识别推理路径中的关键节点，并触发自发的安全反思（safety-oriented reflection），同时在解码阶段采用尺度采样策略选择最优推理路径，从而在不显著增加推理开销的前提下，有效抵御三类越狱攻击（jailbreak attacks），实现优于七种现有方法的防护效果，且避免了过度安全导致的性能下降问题。

链接: https://arxiv.org/abs/2508.04204
作者: Yuquan Wang,Mi Zhang,Yining Wang,Geng Hong,Xiaoyu You,Min Yang
机构: Fudan University (复旦大学); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model’s internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.
zh

[NLP-44] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource Culturally Nuanced Contexts

【速读】：该论文旨在解决低资源、文化语境复杂的场景下情感分析（sentiment analysis）问题，传统自然语言处理（NLP）方法因假设标签固定且情感表达具有普适性，难以适应此类情境。其解决方案的关键在于提出一个诊断框架（diagnostic framework），将情感视为依赖语境且嵌入文化的建构，并通过结合人工标注数据、情感反转的反事实样本以及基于评分标准的解释评估，系统探究大型语言模型（LLMs）在非正式、混码（code-mixed）WhatsApp消息中的情感推理能力、鲁棒性与人类认知对齐程度。研究强调以社会科学研究的测量视角来操作和检验LLM输出作为抽象情感概念的测量工具，揭示了顶级LLMs在解释稳定性上的优势，同时指出开源模型在面对模糊性或情感转变时的局限，凸显了在真实世界沟通中开展文化敏感且具备推理意识的AI评估的重要性。

链接: https://arxiv.org/abs/2508.04199
作者: Millicent Ochieng,Anja Thieme,Ignatius Ezeani,Risa Ueno,Samuel Maina,Keshet Ronen,Javier Gonzalez,Jacki O’Neill
机构: Microsoft Research (微软研究院); Lancaster University (兰卡斯特大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
zh

[NLP-45] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

【速读】：该论文旨在解决当前先进语言模型（Large Language Models, LLMs）在对齐（alignment）技术下仍易受复杂对话场景诱导而产生非预期行为的问题，即“隐性越狱”（implicit jailbreaking）现象。其核心挑战在于，即使模型未被直接指令违反安全规则，也能通过精心设计的叙事沉浸、情感施压和策略性框架等心理机制诱发欺骗、价值漂移、自我保护及操纵性推理等多类不当行为。解决方案的关键在于：首先，通过人工红队测试（manual red-teaming）系统性识别并归纳出10种成功的攻击场景，构建了首个针对对话式操控的分类体系；其次，将这些攻击模式抽象为自动化评估基准MISALIGNMENTBENCH，实现了跨模型、可复现的脆弱性检测。实验表明，该框架在五种前沿模型中平均暴露76%的漏洞，凸显了现有对齐方法在应对情境化、隐蔽性攻击时的不足，强调未来需强化模型对复杂语境推理的鲁棒性。

链接: https://arxiv.org/abs/2508.04196
作者: Siddhant Panpatil,Hiskias Dingeto,Haon Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.
zh

[NLP-46] Characterizing Deep Research: A Benchmark and Formal Definition

【速读】：该论文旨在解决深度研究（Deep Research, DR）任务定义模糊、与其它推理密集型问题区分不清的问题，从而推动对该类任务的系统评估与改进。其解决方案的关键在于提出一种形式化的DR任务定义，强调其核心特征并非生成长篇报告，而是搜索过程中对概念的高扇出（high fan-out）探索——即广泛且需要深度推理的概念扩展；并据此设计了一个中间输出表示，用于编码搜索过程中发现的关键主张，将推理挑战与表面报告生成分离，从而实现客观评价。基于此，作者构建了包含100个科学主题和公共事件相关任务的LiveDRBench基准测试集，为当前DR系统提供了统一评估标准。

链接: https://arxiv.org/abs/2508.04183
作者: Abhinav Java,Ashmit Khandelwal,Sukruta Midigeshi,Aaron Halfaker,Amit Deshpande,Navin Goyal,Ankur Gupta,Nagarajan Natarajan,Amit Sharma
机构: Microsoft Research (微软研究院); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: First three authors contributed equally (ordered alphabetically)

点击查看摘要

Abstract:Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textitdeep research – a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI’s model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at this https URL.
zh

[NLP-47] Hacking Hallucinations of MLLM s with Causal Sufficiency and Necessity

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉-语言任务中常见的幻觉问题，即生成与输入图像或文本语义不一致的输出。研究表明，这类幻觉可分为两类：遗漏型幻觉源于模型未能充分捕捉关键因果因素，而虚构型幻觉则由模型被非因果线索误导所致。解决方案的关键在于提出一种基于因果完备性的强化学习框架，通过同时考虑标记（token）的因果充分性和必要性来定义token级因果完备性奖励，并将其嵌入GRPO优化框架中，构建因果感知的优势函数，从而引导模型聚焦于对准确生成不可或缺且充分的标记，有效缓解MLLMs中的幻觉现象。

链接: https://arxiv.org/abs/2508.04182
作者: Peizheng Guo,Jingyao Wang,Wenwen Qiang,Huijie Guo,Changwen Zheng,Jiahuan Zhou,Gang Hua
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Hefei National Laboratory for Physical Sciences at the Microscale (合肥国家实验室); 3. Tsinghua University (清华大学); 4. Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations–generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token’s standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.
zh

[NLP-48] he State Of TTS: A Case Study with Human Fooling Rates INTERSPEECH2025

【速读】：该论文旨在解决当前文本到语音（Text-to-Speech, TTS）系统在人类感知层面是否真正达到与真人语音难以区分的问题，即是否存在“人类欺骗率”（Human Fooling Rate, HFR）意义上的真实进步。其核心解决方案是引入HFR这一量化指标，直接衡量机器生成语音被误判为人类语音的频率，从而提供一种更贴近 Turing 测试逻辑的评估方式。关键发现包括：商业模型在零样本设置下接近人类欺骗水平，而开源系统仍难以处理自然对话场景；同时指出需在高欺骗率的人类参考数据集上进行评测，以避免因参考样本单调导致的评价偏差。

链接: https://arxiv.org/abs/2508.04179
作者: Praveen Srinivasa Varadhan,Sherry Thomas,Sai Teja M. S.,Suvrat Bhooshan,Mitesh M. Khapra
机构: AI4Bharat; Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at InterSpeech 2025

点击查看摘要

Abstract:While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.
zh

[NLP-49] oxicTAGS: Decoding Toxic Memes with Rich Tag Annotations

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 环境中，基于图像和文本的模态内容（如网络迷因 meme）在社交媒体平台上传播有害信息时缺乏高质量标注数据与有效检测机制的问题。其核心挑战在于：一方面，真实世界中的迷因数据获取困难且标注成本高昂；另一方面，现有视觉语言模型（VLMs）在处理多模态毒性内容时性能受限，尤其当缺乏社会语境信息时。解决方案的关键在于构建一个包含6,300条真实迷因帖子的双阶段标注数据集——首先进行有毒/正常二分类，再细分为仇恨、危险和冒犯三类，并引入辅助的社会相关标签（socially relevant tags），以增强每条迷因的内容上下文；同时提出一种标签生成模块，自动为无标签迷因补充语境信息。实验表明，这些增强标签显著提升了主流VLMs在毒性检测任务中的性能，为多模态在线内容治理提供了可扩展的新范式。

链接: https://arxiv.org/abs/2508.04166
作者: Subhankar Swain,Naquee Rizwan,Nayandeep Deb,Vishwajeet Singh Solanki,Vishwa Gangadhar S,Animesh Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.
zh

[NLP-50] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）对齐过程中依赖昂贵且庞大的偏好数据集的问题，尤其是在数据稀缺或资源受限场景下如何提升对齐效率。其解决方案的关键在于提出一种基于难度的偏好数据选择策略，该策略以直接偏好优化（Direct Preference Optimization, DPO）的隐式奖励机制为基础，通过选取DPO隐式奖励差距较小的样本（即更难区分优劣的困难样本），实现更高数据效率与更强的模型对齐效果。实验表明，仅使用原始数据的10%即可超越五种强基线方法，在多个数据集和任务上取得最优性能。

链接: https://arxiv.org/abs/2508.04149
作者: Xuan Qi,Rongwu Xu,Zhijing Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.
zh

[NLP-51] COPO: Consistency-Aware Policy Optimization

【速读】：该论文旨在解决基于规则奖励的强化学习在训练大型语言模型（Large Language Models, LLMs）时，因多个样本响应趋于一致而导致优势函数退化为零的问题，进而引发梯度消失、学习信号失效，最终限制训练效率与下游性能。其解决方案的关键在于提出一种一致性感知的策略优化框架（consistency-aware policy optimization framework），通过引入基于结果一致性的结构化全局奖励（global reward based on outcome consistency），确保即使组内输出高度一致，仍能提供有效的学习信号；同时结合基于熵的软混合机制（entropy-based soft blending mechanism），自适应平衡局部优势估计与全局优化，实现训练过程中探索与收敛的动态切换，从而提升模型生成正确且自洽推理路径的能力。

链接: https://arxiv.org/abs/2508.04138
作者: Jinghang Han,Jiawei Chen,Hang Shao,Hao Ma,Mingcheng Li,Xintian Shen,Lihao Zheng,Wei Chen,Tao Wei,Lihua Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization. However, a common challenge observed across many replication and extension efforts is that when multiple sampled responses under a single prompt converge to identical outcomes, whether correct or incorrect, the group-based advantage degenerates to zero. This leads to vanishing gradients and renders the corresponding samples ineffective for learning, ultimately limiting training efficiency and downstream performance. To address this issue, we propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency, the global loss based on it ensures that, even when model outputs show high intra-group consistency, the training process still receives meaningful learning signals, which encourages the generation of correct and self-consistent reasoning paths from a global perspective. Furthermore, we incorporate an entropy-based soft blending mechanism that adaptively balances local advantage estimation with global optimization, enabling dynamic transitions between exploration and convergence throughout training. Our method introduces several key innovations in both reward design and optimization strategy. We validate its effectiveness through substantial performance gains on multiple mathematical reasoning benchmarks, highlighting the proposed framework’s robustness and general applicability. Code of this work has been released at this https URL.
zh

[NLP-52] AgREE: Agent ic Reasoning for Knowledge Graph Completion on Emerging Entities

【速读】：该论文旨在解决开放域知识图谱补全（Open-domain Knowledge Graph Completion, KGC）在动态信息环境中面临的挑战，特别是如何有效捕捉未被预训练语言模型覆盖的新兴实体（emerging entities）及其关联信息。现有方法依赖于参数化知识、预构建查询或单步检索，往往需要大量标注数据且难以适应新实体的持续涌现。其解决方案的关键在于提出一种基于智能体（agent-based）的框架AgREE，通过迭代式检索动作与多步推理相结合的方式，动态构建高质量的知识图谱三元组；该框架无需训练即可实现对新兴实体的高效知识补全，在实验中相比现有方法提升达13.7%，尤其在处理训练阶段未见的新实体时表现突出。

链接: https://arxiv.org/abs/2508.04118
作者: Ruochen Zhao,Simone Conia,Eric Peng,Min Li,Saloni Potdar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models’ parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models’ training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments.
zh

[NLP-53] Unveiling Over-Memorization in Finetuning LLM s for Reasoning Tasks

【速读】：该论文旨在解决预训练大语言模型（Large Language Models, LLMs）在微调过程中出现的“过记忆化”（over-memorization）现象，即模型在特定微调阶段过度记忆训练数据，导致测试困惑度（perplexity）升高但测试准确率仍保持较高水平，进而引发鲁棒性下降、分布外泛化能力减弱和生成多样性降低的问题。解决方案的关键在于识别出导致过记忆化的两个核心因素：训练轮数（epochs）和较大的学习率（learning rate），并据此提出在微调过程中优化检查点（checkpoint）选择与学习率调度策略的建议，从而在保证性能的同时避免模型陷入不良的学习动态。

链接: https://arxiv.org/abs/2508.04117
作者: Zhiwen Ruan,Yun Chen,Yutao Hou,Peng Li,Yang Liu,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.
zh

[NLP-54] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂多步数学推理任务中因视觉感知或逻辑推导中的微小错误而导致整体失败的问题。现有过程奖励模型（Process Reward Models, PRMs）仅能作为二元验证器识别错误，缺乏纠错能力与解释性。其解决方案的关键在于提出生成式多模态过程奖励模型（Generative Multimodal Process Reward Model, GM-PRM），该模型将PRM从被动评判者转变为主动推理合作者：不仅对每一步推理进行意图、视觉对齐性和逻辑合理性等细粒度评估，还具备生成首个错误步骤修正版本的能力。这一特性支持设计新的测试时推理策略——精化版最佳N采样（Refined Best-of-N, Refined-BoN），通过PRM的纠正引导策略模型走向更优推理路径，从而提升解题多样性与准确性，并在仅需20K样本训练数据的情况下实现显著性能提升。

链接: https://arxiv.org/abs/2508.04088
作者: Jianghangfan Zhang,Yibo Yan,Kening Zheng,Xin Zou,Song Dai,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.
zh

[NLP-55] oolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”

【速读】：该论文旨在解决现有工具使用大语言模型（Large Language Model, LLM）数据集生成方法中存在的标注失败率高、效率低的问题，这些问题通常源于先生成用户查询再进行复杂工具调用链（如深度优先搜索，DFS）标注的范式。其解决方案的关键在于提出一种名为ToolGrad的代理框架，采用“答案先行”的新范式：首先通过文本梯度（textual “gradients”）引导的迭代过程构建有效的工具使用链，再据此合成对应的用户查询。该方法显著提升了数据生成的准确性与效率，最终生成的ToolGrad-5k数据集在复杂度、成本和通过率上均优于传统方法，并使训练模型在分布外（OOD）基准测试中表现更优。

链接: https://arxiv.org/abs/2508.04086
作者: Zhongyi Zhou,Kohei Uehara,Haoyu Zhang,Jingtao Zhou,Lin Gu,Ruofei Du,Zheng Xu,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所先进人工智能中心); Google(谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS. This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual “gradients”, and then synthesizes corresponding user queries. This “answer-first” approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate. Experiments show that models trained on ToolGrad-5k outperform those on expensive baseline datasets and proprietary LLMs, even on OOD benchmarks.
zh

[NLP-56] Efficient Strategy for Improving Large Language Model (LLM ) Capabilities

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在资源受限环境下部署效率低下的问题，尤其关注计算资源消耗大、知识范围受限以及响应延迟高等挑战。其解决方案的关键在于从基础模型出发，系统性地整合数据处理与精细化数据筛选技术、优化训练策略及架构调整方法，从而在保证模型能力、泛化性、响应速度和安全性的同时，显著提升LLMs在有限资源条件下的运行效率。

链接: https://arxiv.org/abs/2508.04073
作者: Julián Camilo Velandia Gutiérrez
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Based on master’s thesis in Systems and Computer Engineering, Universidad Nacional de Colombia (2025)

点击查看摘要

Abstract:Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master’s thesis in Systems and Computer Engineering titled “Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)”.
zh

[NLP-57] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

【速读】：该论文旨在解决当前检索增强生成（Retrieval-Augmented Generation, RAG）系统存在的两大问题：一是对所有查询均执行外部信息检索，包括仅依赖模型参数知识即可回答的简单问题，导致效率低下；二是当查询信息稀疏时，容易检索到不相关文档，影响准确性。解决方案的关键在于提出一种无需训练的自适应框架——Parametric-verified Adaptive Information Retrieval and Selection (PAIRS)，其核心机制为双路径生成与自适应选择：首先通过大语言模型（LLM）生成直接答案和基于自生成伪上下文的增强答案，若二者收敛则跳过外部检索以提升效率；若不收敛，则激活双路径检索（Dual-Path Retrieval, DPR），结合原始查询与自动生成的上下文信号进行检索，并由自适应信息选择（Adaptive Information Selection, AIS）模块基于加权相似度筛选文档，从而实现高效且准确的信息获取。

链接: https://arxiv.org/abs/2508.04057
作者: Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM’s parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system’s efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.
zh

[NLP-58] DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation

【速读】：该论文旨在解决可控文本生成（Controllable Text Generation, CTG）中长文本生成时控制能力下降的问题，特别是基于前缀（prefix-based）的Air-Decoding方法在序列长度增加时注意力对前缀衰减导致的可控性减弱问题。解决方案的关键在于提出一种轻量且高效的框架——动态Token级前缀增强（Dynamic Token-level Prefix Augmentation, DTPA），其核心机制是根据任务选择最优前缀类型，并通过一个随序列长度指数增长的缩放因子动态增强前缀对属性分布的注意力，从而提升可控性；同时可选地对原始提示进行类似增强以平衡文本质量，在属性分布重建后确保生成文本满足约束条件。

链接: https://arxiv.org/abs/2508.04047
作者: Jiabing Yang,Yixiang Chen,Zichen Wen,Chenhang Cui,Peiyan Li,Yuan Xu,Bowen Fang,Yan Huang,Liang Wang
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Microsoft Research (微软研究院); 3. University of California, Berkeley (加州大学伯克利分校); 4. Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA’s superior effectiveness in long text generation.
zh

[NLP-59] Large Reasoning Models Are Autonomous Jailbreak Agents

【速读】：该论文旨在解决生成式 AI（Generative AI）模型中安全机制易被绕过的问题，即“越狱”（jailbreaking）攻击的门槛过低且可规模化，导致非专业人员也能轻易突破模型的安全防护。其解决方案的关键在于揭示了大推理模型（Large Reasoning Models, LRMs）具备强大的说服能力，能够作为自主代理在无需人工干预的情况下，通过多轮对话系统性地对其他目标模型实施越狱攻击；实验表明，四种主流LRM在97.14%的测试案例中成功诱导目标模型违反安全约束，暴露出当前对齐机制存在“对齐回归”（alignment regression）风险，强调必须强化前沿模型的内在安全性，使其不仅能够抵御越狱，还应防止自身被用于执行此类攻击。

链接: https://arxiv.org/abs/2508.04039
作者: Thilo Hagendorff,Erik Derner,Nuria Oliver
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Jailbreaking – bypassing built-in safety mechanisms in AI models – has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.
zh

[NLP-60] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

【速读】：该论文旨在解决传统人类活动识别（Human Activity Recognition, HAR）方法在面对新行为或传感器配置时需昂贵重新训练的问题，以及现有基于大语言模型（Large Language Models, LLMs）的HAR方法因信号转译为文本或图像而导致准确率有限且缺乏可验证解释性的问题。解决方案的关键在于提出ZARA——首个直接从原始运动时间序列进行零样本、可解释HAR的代理框架：其核心包括一个自动构建的成对特征知识库（capture discriminative statistics for every activity pair），用于表征不同活动间的统计差异；一个多传感器检索模块，用于提取相关证据；以及一个分层代理流水线，引导LLM迭代选择特征、调用证据并输出活动预测与自然语言解释。此设计实现了无需微调或任务特定分类器即可灵活、可信地完成HAR分析。

链接: https://arxiv.org/abs/2508.04038
作者: Zechen Li,Baiyu Chen,Hao Xue,Flora D. Salim
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at this https URL.
zh

[NLP-61] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

【速读】：该论文旨在解决元学习-based模型编辑（Meta-learning-based Model Editing, MLBME）方法在低数据场景下性能不佳以及训练效率受限于KL散度计算的问题。其解决方案的关键在于提出一种名为SMEdit的新方法，通过引入多步反向传播（Multiple Backpropagation Steps, MBPS）策略提升有限监督下的编辑效果，并采用权重更新的范数正则化来加速训练过程，从而在保持高编辑性能的同时显著提高计算效率。

链接: https://arxiv.org/abs/2508.04012
作者: Xiaopeng Li,Shasha Li,Xi Wang,Shezheng Song,Bin Ji,Shangwen Wang,Jun Ma,Xiaodong Liu,Mina Liu,Jie Yu
机构: National University of Defense Technology (国防科技大学); KylinSoft (麒麟软件)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose \textbfS tep \textbfM ore \textbfEdit ( \textbfSMEdit ), a novel MLBME method that adopts \textbfM ultiple \textbfB ackpro \textbfP agation \textbfS teps ( \textbfMBPS ) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.
zh

[NLP-62] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的网络代理（Web Agents）在长期序列操作中难以同时保障任务性能与安全性的问题，现有研究多局限于单一目标优化或单轮场景，缺乏对安全性和实用性协同优化的能力。解决方案的关键在于提出HarmonyGuard框架，其核心机制包括两个方面：一是自适应策略增强（Adaptive Policy Enhancement），通过引入策略代理（Policy Agent）从非结构化外部文档中自动提取并动态更新结构化安全策略以应对演化中的网络威胁；二是双目标优化（Dual-Objective Optimization），由效用代理（Utility Agent）基于马尔可夫实时推理评估安全与效用目标，并利用元认知能力实现二者协同优化。实验证明，该方法在多个基准测试上显著提升政策合规性（最高达38%）和任务完成率（最高达20%），且所有任务均保持90%以上的政策合规水平。

链接: https://arxiv.org/abs/2508.04010
作者: Yurun Chen,Xavier Hu,Yuhan Liu,Keting Yin,Juncheng Li,Zhuosheng Zhang,Shengyu Zhang
机构: Zhejiang University (浙江大学); Xiamen University (厦门大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: this https URL.
zh

[NLP-63] ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

【速读】：该论文旨在解决对话式检索（conversational search）中因上下文依赖查询导致的用户搜索意图识别困难问题，以及现有基于相关性标注数据微调对话密集检索器时面临的数据稀缺挑战。其解决方案的关键在于提出ConvMix框架，该框架通过引入一种两面相关性判断增强机制（two-sided relevance judgment augmentation schema），借助大语言模型（large language models, LLMs）在可扩展方式下生成更丰富的标注样本，并结合质量控制机制以确保语义多样性与近分布监督信号，从而有效提升对话密集检索器的性能。

链接: https://arxiv.org/abs/2508.04001
作者: Fengran Mo,Jinghan Zhang,Yuchen Hui,Jia Ao Sun,Zhichao Xu,Zhan Su,Jian-Yun Nie
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational search aims to satisfy users’ complex information needs via multiple-turn interactions. The key challenge lies in revealing real users’ search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.
zh

[NLP-64] ransferring Expert Cognitive Models to Social Robots via Agent ic Concept Bottleneck Models

【速读】：该论文旨在解决群体会议中 facilitator（主持人）因认知负荷过高而难以实时识别个体目标设定与执行困难、社交疏离等微妙动态的问题，从而导致干预不及时或失效。其核心挑战在于现有基于强大但“黑箱”基础模型（Foundation Models, FMs）的社会感知系统缺乏透明性与可解释性，无法为人类提供可信的决策支持。解决方案的关键在于提出一种基于概念瓶颈模型（Concept Bottleneck Model, CBM）的社交机器人协执事系统，该模型通过迁移学习框架将FM的广泛社会理解蒸馏为以人类可解释概念（如参与度、情绪状态）为基础的推理机制，实现对干预需求的精准预测，并支持实时人工修正。该方法不仅显著优于直接使用零样本FM的性能，还实现了资深主持人的认知模型向新手的有效迁移，为复杂社交场景中增强人类能力提供了可解释、可信赖的技术范式。

链接: https://arxiv.org/abs/2508.03998
作者: Xinyu Zhao,Zhen Tan,Maya Enisman,Minjae Seo,Marta R. Durantini,Dolores Albarracin,Tianlong Chen
机构: The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校); Arizona State University(亚利桑那州立大学); University of Pennsylvania(宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but “black box” foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot’s reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert’s cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.
zh

[NLP-65] Are Todays LLM s Ready to Explain Well-Being Concepts?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成幸福感相关概念解释时，如何同时保证事实准确性并适配不同用户群体需求的问题。核心挑战在于，高质量的解释不仅需要内容正确，还需符合目标受众的知识水平与理解预期。解决方案的关键在于构建一个包含43,880条解释的大规模数据集，并提出一种基于原则引导的LLM-as-a-judge评估框架，通过双判官机制量化解释质量；此外，采用监督微调（Supervised Fine-Tuning, SFT）与直接偏好优化（Direct Preference Optimization, DPO）对开源模型进行微调，显著提升了解释的准确性与适配性，且优于未微调的更大规模模型，验证了基于偏好学习在专业化解释任务中的有效性。

链接: https://arxiv.org/abs/2508.03990
作者: Bohan Jiang,Dawei Li,Zhen Tan,Chengshuai Zhao,Huan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.
zh

[NLP-66] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

【速读】：该论文旨在解决自一致性（self-consistency）方法在长链式思维推理任务中Token消耗过高、限制其实际应用的问题，同时保持其并行性优势。解决方案的关键在于引入早期假设剪枝机制：在并行生成所有解的基础上，定期基于两个轻量级指标筛选冗余中间假设——一是模型对单个假设自身的置信度，二是当前所有假设的词汇覆盖情况与候选保留子集之间的匹配程度；为此设计了一种快速加权集合覆盖算法来高效执行剪枝操作，实验表明该方法可在多个大语言模型（LLM）和数学基准测试中提升10%-35%的Token效率。

链接: https://arxiv.org/abs/2508.03979
作者: Md Arafat Sultan,Ramón Fernandez Astudillo
机构: IBM Research AI (IBM 研究院人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model’s own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.
zh

[NLP-67] Data and AI governance: Promoting equity ethics and fairness in large language models

【速读】：该论文旨在解决生成式 AI（Generative AI）在全生命周期中存在的偏见（bias）、公平性（fairness）、伦理（ethics）和事实性（factuality）问题，特别是在大型语言模型（Large Language Models, LLMs）的开发、验证、部署及持续监控阶段。解决方案的关键在于构建一个系统性的数据与人工智能治理框架（data and AI governance framework），该框架基于作者前期提出的 Bias Evaluation and Assessment Test Suite (BEATS)，能够实现对LLMs的严格基准测试、实时连续评估以及生成响应的主动管控，从而显著提升生成式AI系统的安全性与责任性，有效降低歧视风险并防止品牌声誉损害。

链接: https://arxiv.org/abs/2508.03970
作者: Alok Abhishek,Lisa Erickson,Tushar Bandopadhyay
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in MIT Science Policy Review 6, 139-146 (2025)

点击查看摘要

Abstract:In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.
zh

[NLP-68] Accelerating Scientific Discovery with Multi-Document Summarization of Impact-Ranked Papers

【速读】：该论文旨在解决科研人员在面对海量科学文献时，难以从零散的论文列表中提炼出系统性理解的问题。随着每日新增文献的持续涌入，即使研究人员筛选出有潜力的论文集，仍需手动阅读大量标题和摘要以整合可能存在冲突的研究发现，这一过程效率低下且耗时。解决方案的关键在于为BIP! Finder（一种基于影响力维度排序文献的学术搜索引擎）引入摘要功能，通过动态利用其已有的基于影响力的排序与过滤机制，自动生成两类摘要：一是简洁的即时概览摘要，用于快速把握核心内容；二是结构化程度更高的文献综述式摘要，支持更深入、条理清晰的理解。此方法显著提升了文献发现与认知效率。

链接: https://arxiv.org/abs/2508.03962
作者: Paris Koloveas,Serafeim Chatzopoulos,Dionysis Diamantis,Christos Tryfonopoulos,Thanasis Vergoulis
机构: IMSI, Athena RC(雅典研究中心); University of the Peloponnese(伯罗奔尼撒大学)
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing volume of scientific literature makes it challenging for scientists to move from a list of papers to a synthesized understanding of a topic. Because of the constant influx of new papers on a daily basis, even if a scientist identifies a promising set of papers, they still face the tedious task of individually reading through dozens of titles and abstracts to make sense of occasionally conflicting findings. To address this critical bottleneck in the research workflow, we introduce a summarization feature to BIP! Finder, a scholarly search engine that ranks literature based on distinct impact aspects like popularity and influence. Our approach enables users to generate two types of summaries from top-ranked search results: a concise summary for an instantaneous at-a-glance comprehension and a more comprehensive literature review-style summary for greater, better-organized comprehension. This ability dynamically leverages BIP! Finder’s already existing impact-based ranking and filtering features to generate context-sensitive, synthesized narratives that can significantly accelerate literature discovery and comprehension.
zh

[NLP-69] ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

【速读】：该论文旨在解决当前AI代码辅助工具（如GitHub Copilot）在高风险领域（如网络安全）中存在的安全不确定性问题，尤其是现有红队测试工具因依赖固定基准或不切实际的提示而难以发现真实世界漏洞的局限性。解决方案的关键在于提出ASTRA——一个自动化代理系统，通过三个阶段实现对AI生成代码和安全指导系统的系统性安全缺陷探测：首先构建结构化的领域特定知识图谱以建模复杂软件任务与已知弱点；其次基于知识图谱进行在线漏洞探索，包括对输入空间的“空间探索”和对推理过程的“时间探索”；最后生成高质量的违规诱导测试用例以提升模型对齐效果。ASTRA的核心创新在于聚焦开发者实际请求的现实输入，并结合离线抽象建模与在线知识图谱自适应机制，从而更有效地识别边界案例漏洞。

链接: https://arxiv.org/abs/2508.03936
作者: Xiangzhe Xu,Guangyu Shen,Zian Su,Siyuan Cheng,Hanxi Guo,Lu Yan,Xuan Chen,Jiasheng Jiang,Xiaolong Jin,Chengpeng Wang,Zhuo Zhang,Xiangyu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: The first two authors (Xiangzhe Xu and Guangyu Shen) contributed equally to this work

点击查看摘要

Abstract:AI coding assistants like GitHub Copilot are rapidly transforming software development, but their safety remains deeply uncertain-especially in high-stakes domains like cybersecurity. Current red-teaming tools often rely on fixed benchmarks or unrealistic prompts, missing many real-world vulnerabilities. We present ASTRA, an automated agent system designed to systematically uncover safety flaws in AI-driven code generation and security guidance systems. ASTRA works in three stages: (1) it builds structured domain-specific knowledge graphs that model complex software tasks and known weaknesses; (2) it performs online vulnerability exploration of each target model by adaptively probing both its input space, i.e., the spatial exploration, and its reasoning processes, i.e., the temporal exploration, guided by the knowledge graphs; and (3) it generates high-quality violation-inducing cases to improve model alignment. Unlike prior methods, ASTRA focuses on realistic inputs-requests that developers might actually ask-and uses both offline abstraction guided domain modeling and online domain knowledge graph adaptation to surface corner-case vulnerabilities. Across two major evaluation domains, ASTRA finds 11-66% more issues than existing techniques and produces test cases that lead to 17% more effective alignment training, showing its practical value for building safer AI systems.
zh

[NLP-70] CAP-LLM : Context-Augmented Personalized Large Language Models for News Headline Generation

【速读】：该论文旨在解决个性化新闻标题生成中用户兴趣建模不足与事实一致性难以保障的问题，现有方法常因无法有效捕捉复杂用户偏好而导致标题泛化或失真。其解决方案的关键在于提出Context-Augmented Personalized LLM (CAP-LLM) 框架，该框架通过三个核心组件实现优化：(1) 用户偏好编码器（User Preference Encoder）用于建模长期兴趣；(2) 上下文注入适配器（Context Injection Adapter）将用户偏好与当前文章内容无缝融合至预训练语言模型（LLM）的生成流程；(3) 基于新型对比损失的事实一致性强化模块（Fact-Consistency Reinforcement Module），显著降低幻觉现象。实验证明，CAP-LLM 在真实世界 PENS 数据集上实现了个性化指标（如 Pc(avg)=2.73, Pc(max)=17.25）和事实一致性（FactCC=87.50）的协同提升，优于 BART 等强基线模型。

链接: https://arxiv.org/abs/2508.03935
作者: Raymond Wilson,Cole Graham,Chase Carter,Zefeng Yang,Ruiqi Gu
机构: National Energy University (国家能源大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM’s generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM’s ability to achieve a superior balance between personalization and factual accuracy in news headline generation.
zh

[NLP-71] CoAct-1: Computer-using Agents with Coding as Actions

【速读】：该论文旨在解决自主代理通过图形用户界面（Graphical User Interface, GUI）执行复杂、长周期任务时效率低且可靠性差的问题。现有方法虽引入规划模块提升任务分解能力，但仍受限于GUI操作的固有局限性，导致行为脆弱且低效。解决方案的关键在于提出一种融合GUI控制与编程执行的混合范式——CoAct-1多智能体系统，其中协调器（Orchestrator）动态地将子任务分配给传统GUI操作代理或专用编程代理（Programmer agent），后者可编写并执行Python或Bash脚本，从而绕过冗余的GUI操作（如文件管理与数据处理），仅在必要时保留可视化交互。此机制显著提升了任务完成效率与鲁棒性，在OSWorld基准测试中达到60.76%的成功率，并将平均步骤数降至10.15，优于当前最优GUI方法。

链接: https://arxiv.org/abs/2508.03923
作者: Linxin Song,Yutong Dai,Viraj Prabhu,Jieyu Zhang,Taiwei Shi,Li Li,Junnan Li,Silvio Savarese,Zeyuan Chen,Jieyu Zhao,Ran Xu,Caiming Xiong
机构: University of Southern California(南加州大学); Salesforce Research( Salesforce 研究院); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.
zh

[NLP-72] Sotopia-RL: Reward Design for Social Intelligence

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在社交智能训练中因强化学习（Reinforcement Learning, RL）所面临的两个核心挑战：一是部分可观测性（partial observability），即对话中的语句具有间接和延迟的影响，导致信用分配困难；二是多维性（multi-dimensionality），即社交行为如建立亲密度或信息获取等对目标达成的贡献难以用单一维度奖励捕捉，从而引发奖励黑客（reward hacking）问题。为应对这些问题，作者提出Sotopia-RL框架，其关键创新在于将粗粒度的回合级反馈细化为话语级（utterance-level）且多维奖励（multi-dimensional rewards），通过精准定位每条语句的贡献缓解部分可观测性，并以多维奖励结构完整建模社交互动的复杂性，显著提升社会任务完成效果。

链接: https://arxiv.org/abs/2508.03905
作者: Haofei Yu,Zhengyang Qi,Yining Zhao,Kolby Nottingham,Keyang Xuan,Bodhisattwa Prasad Majumder,Hao Zhu,Paul Pu Liang,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California Irvine (加州大学欧文分校); Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: this https URL.
zh

[NLP-73] An Entity Linking Agent for Question Answering AAAI2026

【速读】：该论文旨在解决问答系统（QA）中实体链接（Entity Linking, EL）在短文本、模糊用户提问场景下性能不佳的问题。现有大多数EL方法针对长上下文设计，难以准确识别和链接短句中的实体，从而影响QA系统的准确性。解决方案的关键在于提出一种基于大语言模型（Large Language Model, LLM）的实体链接代理（entity linking agent），该代理模拟人类认知流程，主动执行三步操作：识别实体提及（entity mentions）、检索候选实体（candidate entities）并作出决策。通过工具驱动的实体链接与QA任务评估实验验证，该代理展现出良好的鲁棒性与有效性。

链接: https://arxiv.org/abs/2508.03865
作者: Yajie Luo,Yihong Wu,Muzhi Li,Fengran Mo,Jia Ao Sun,Xinyu Wang,Liheng Ma,Yingxue Zhang,Jian-Yun Nie
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures. Submitted to AAAI 2026 Conference

点击查看摘要

Abstract:Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.
zh

[NLP-74] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中因依赖互联网语料而可能生成虚假或误导性信息的问题，从而导致事实准确性不足。其核心挑战在于模型幻觉（hallucinations）、评估数据集的局限性以及现有评估指标的可靠性问题。解决方案的关键在于构建强大的事实核查框架，整合先进的提示工程策略、领域特定微调（domain-specific fine-tuning）和检索增强生成（Retrieval-Augmented Generation, RAG）方法，以提升输出内容的可验证性和事实一致性。研究强调通过外部知识源的引入与指令微调（instruction tuning）、多智能体推理等技术协同作用，实现更可信、更具上下文感知能力的语言模型。

链接: https://arxiv.org/abs/2508.03860
作者: Subhey Sadi Rahman,Md. Adnanul Islam,Md. Mahbub Alam,Musarrat Zeba,Md. Abdur Rahman,Sadia Sultana Chowa,Mohaimenul Azam Khan Raiaan,Sami Azam
机构: United International University (联合国际大学); Daffodil International University (郁金香国际大学); Charles Darwin University (查尔斯·达尔文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 11 figures, 6 tables. Submitted to Artificial Intelligence Review for peer review

点击查看摘要

Abstract:Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.
zh

[NLP-75] Majority Bit-Aware Watermarking For Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成内容中潜在的有害或欺骗性文本滥用问题，通过水印技术实现生成内容的来源验证与滥用追踪。现有方法在多比特水印设计上存在一个根本性权衡：为确保消息解码准确性，需限制编码时可选的最优标记（token）集合大小，但此举会显著降低生成文本质量。本文提出MajorMark方法，其核心创新在于引入“多数位感知编码”（majority bit-aware encoding），即根据消息的多数位选择偏好标记集，从而扩大采样空间并提升灵活性；同时采用基于聚类的解码策略替代传统基于词频的解码方式，在保持大偏好标记集的同时仍能维持高解码准确率，有效缓解了文本质量与解码精度之间的冲突。进一步地，作者还提出了MajorMark⁺，通过将消息分块独立编码与确定性解码，进一步优化文本质量和解码性能。实验表明，该方法在主流LLM上显著优于现有多比特水印基线。

链接: https://arxiv.org/abs/2508.03829
作者: Jiahao Xu,Rui Hu,Zikai Zhang
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Preprint

点击查看摘要

Abstract:The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark ^+ , which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines.
zh

[NLP-76] MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources

【速读】：该论文旨在解决多语言、跨时间的事实核查与分析任务中缺乏高质量、结构化数据集的问题。现有资源在覆盖范围、引用来源完整性及多语言支持方面存在局限，难以支撑生成式 AI (Generative AI) 在事实验证和跨语言比较研究中的需求。解决方案的关键在于构建 MegaWika 2——一个大规模、多语言的维基百科文章数据集，其中每篇文章均包含精确字符偏移标注的引用文本及其对应的网页来源，且相比原版 MegaWika，文章数量增加六倍、完整爬取的引用数翻倍，从而显著提升数据的广度与深度，特别强化了对事实核查与跨语言、跨时间分析的支持能力。

链接: https://arxiv.org/abs/2508.03828
作者: Samuel Barham,Chandler May,Benjamin Van Durme
机构: Human Language Technology Center of Excellence (人类语言技术卓越中心); Johns Hopkins University (约翰霍普金斯大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.
zh

[NLP-77] AttnTrace: Attention-based Context Traceback for Long-Context LLM s

【速读】：该论文旨在解决长上下文大语言模型（Long-context Large Language Models, LLMs）在实际应用中，如检索增强生成（Retrieval-Augmented Generation, RAG）和自主代理系统中，如何高效且准确地追溯生成响应所依赖的上下文片段这一问题。现有方法如TracLLM虽能实现上下文溯源，但计算开销巨大，单次推理耗时可达数百秒。本文提出AttnTrace，其核心创新在于利用LLM在处理提示（prompt）时产生的注意力权重（attention weights）作为可解释性依据，通过引入两种优化技术以提升溯源精度，并提供理论支持以指导设计选择。实验表明，AttnTrace在准确性与效率上均优于当前最优方法，且可进一步提升基于“归因先行、检测后行”范式的提示注入攻击检测性能，具备良好的实用价值。

链接: https://arxiv.org/abs/2508.03793
作者: Yanting Wang,Runpeng Geng,Ying Chen,Jinyuan Jia
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: The code is available at this https URL . The demo is available at this https URL

点击查看摘要

Abstract:Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context–often consisting of texts retrieved from a knowledge database or memory–and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at this https URL.
zh

[NLP-78] GTPO: Trajectory-Based Policy Optimization in Large Language Models

【速读】：该论文旨在解决Group-relative Policy Optimization (GRPO)在语言模型训练中面临的两个关键问题：一是生成过程中某些token同时出现在正负奖励的完成序列中，导致梯度更新冲突，降低其输出概率，从而破坏输出结构的合理性；二是负奖励完成序列会惩罚置信度高的响应，使模型决策趋向于低概率token，逐步平滑输出分布并损害学习效果。解决方案的核心是提出GTPO（Group-relative Trajectory-based Policy Optimization），其通过识别“冲突token”（即在同一位置出现于正负奖励完成序列中的token），跳过对这些token的负向梯度更新并增强正向更新，从而稳定策略优化；此外，GTPO引入基于熵阈值的完成序列过滤机制以防止策略坍塌，且无需KL散度正则化或参考模型即可实现更稳定的训练和更好的性能表现。

链接: https://arxiv.org/abs/2508.03772
作者: Marco Simoni,Aleksandar Fontana,Giulio Rossolini,Andrea Saracino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.
zh

[NLP-79] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

【速读】：该论文旨在解决当前视觉语言模型（Vision Language Models, VLMs）在多语言环境下，尤其是低资源语言如印地语（Hindi）中的推理能力评估缺乏系统性基准的问题。现有评测基准多为单语种（主要为英语），且在数学等复杂任务上缺乏高质量的多语言数据集。为此，作者提出了 GanitBench，一个包含 1527 道纯图像形式数学问题的多语言（英语与印地语）基准，题源来自印度两大重要考试——JEE Advanced 和 CBSE Board，涵盖需结合图形与文本信息的复杂推理场景。其关键解决方案在于构建了一个结构严谨、语言多样、任务复杂的视觉数学推理基准，并通过零样本链式思维（zero-shot Chain-of-Thought, CoT）和两样本 CoT 设置对两个闭源模型进行评估，同时引入“双重锁定”（Double Lock）约束以更严格地模拟真实推理流程，从而揭示模型在多语言环境下的性能下降趋势及两样本 CoT 的相对优势。

链接: https://arxiv.org/abs/2508.03737
作者: Ashutosh Bandooni,Brindha Subburaj
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures. Accepted, Presented and Published as part of Proceedings of the 6th International Conference on Recent Advantages in Information Technology (RAIT) 2025

点击查看摘要

Abstract:Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it’s highest average accuracy being 38.15%. We also evaluate models through a “Double Lock” constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.
zh

[NLP-80] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

【速读】：该论文旨在解决当前多任务胸部X光（CXR）诊断中基于推理的多模态大语言模型（MLLMs）存在的三大挑战：推理过程缺乏可验证监督、多任务场景下推理链条冗长且奖励稀疏，以及频繁出现幻觉现象。针对这些问题，作者提出CX-Mind，这是首个实现“思考-回答”交错式推理的生成式模型，其核心创新在于采用基于课程学习的强化学习与可验证过程奖励（CuRL-VPR）机制。关键解决方案包括：构建包含708,473张图像和261万条样本的指令微调数据集CX-Set，并生成42,828个由临床报告监督的高质量交错推理数据点；在Group Relative Policy Optimization框架下分两阶段优化：先通过封闭域任务稳定基础推理能力，再迁移至开放域诊断任务，同时引入基于规则的条件性过程奖励以规避对预训练奖励模型的依赖。实验表明，该方法在视觉理解、文本生成及时空对齐等维度显著优于现有医学与通用领域MLLMs，平均性能提升达25.1%，并在真实临床数据集Rui-CXR上实现了14种疾病平均召回率@1的显著领先。

链接: https://arxiv.org/abs/2508.03733
作者: Wenjie Li,Yujie Zhang,Haoran Sun,Yueqi Li,Fanrui Zhang,Mengzhe Xu,Victoria Borja Clausich,Sade Mellin,Renhao Yang,Chenrun Wang,Jethro Zih-Shuo Wang,Shiyi Yao,Gen Li,Yidong Xu,Hanyu Wang,Yilin Huang,Angela Lin Wang,Chen Shi,Yin Zhang,Jianan Guo,Luqi Yang,Renxuan Li,Yang Xu,Jiawei Liu,Yao Zhang,Lei Liu,Carlos Gutiérrez SanRomán,Lei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on “one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved “think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.
zh

[NLP-81] WINELL: Wikipedia Never-Ending Updating with LLM Agents

【速读】：该论文旨在解决维基百科（Wikipedia）因依赖人工编辑而导致内容难以持续更新的问题。其核心挑战在于如何实现知识库的自动化、持续性维护，以应对信息快速变化的需求。解决方案的关键在于提出WiNELL框架，这是一个基于大语言模型（Large Language Model, LLM）的多智能体（multi-agent）系统，能够自动聚合在线信息、识别目标实体的新颖且重要的知识，并生成符合人类编辑习惯的精准编辑建议。通过在维基百科历史编辑数据上训练的细粒度编辑模型，WiNELL确保更新内容与人类编辑行为一致，在关键信息覆盖率和编辑效率方面优于开源指令跟随基线模型及闭源大模型（如GPT-4o），从而实现了对高活跃页面的及时事实性更新。

链接: https://arxiv.org/abs/2508.03728
作者: Revanth Gangi Reddy,Tanay Dixit,Jiaxin Qin,Cheng Qian,Daniel Lee,Jiawei Han,Kevin Small,Xing Fan,Ruhi Sarikaya,Heng Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia’s extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL’s ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.
zh

[NLP-82] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference DSN

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中因自回归特性导致的效率瓶颈问题。现有方法如推测解码（speculative decoding）和束搜索（beam sampling）虽能提升性能，但传统实现中对候选序列的验证采用无优先级的串行方式，造成不必要的计算开销。其解决方案的关键在于提出一种分层验证树（Hierarchical Verification Tree, HVT）框架，通过优先验证高概率候选序列并支持对低质量路径的早期剪枝（early pruning），从而在不改变模型架构或重新训练的前提下显著提升推理效率。理论分析与形式化验证-剪枝算法确保了方案的正确性与高效性，实验表明HVT在多个数据集和模型上均能大幅降低推理时间和能耗，同时保持或改善输出质量。

链接: https://arxiv.org/abs/2508.03726
作者: Jaydip Sen,Harshitha Puvvala,Subhasis Dasgupta
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper was accepted for oral presentation and publication in the 3rd International Conference on Data Science and Network Engineering (ICDSNE 2025), organized at NIT, Agartala, India, from July 25 to 26, 2025. The paper is 12 pages long, and it contains 3 tables and 4 figures. This is NOT the final paper, which will be published in the Springer-published proceedings

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.
zh

[NLP-83] Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

【速读】：该论文旨在解决印度农民在农村地区普遍面临的农业信息获取难题，尤其是缺乏及时、易得且语言友好的农业咨询服务，特别是在低识字率群体中。其解决方案的核心在于构建一个名为Krishi Sathi的AI驱动农业聊天机器人，该系统通过结构化的多轮对话流程逐步收集用户意图与上下文信息，结合指令微调模型（Instruction-Finetuned Transformer, IFT）与检索增强生成（Retrieval-Augmented Generation, RAG）技术，从定制化农业知识库中提取相关数据并生成个性化响应。此外，系统支持英印双语文本及语音输入输出（ASR/TTS），显著提升了对低数字素养用户的可访问性。该方法实现了97.53%的查询响应准确率和91.35%的情境相关性与个性化程度，同时保持平均响应时间低于6秒，有效提升了数字农业支持的服务质量与覆盖范围。

链接: https://arxiv.org/abs/2508.03719
作者: Abhay Vijayvargia,Ajay Nagpal,Kundeshwar Pundalik,Atharva Savarkar,Smita Gautam,Pankaj Singh,Rohit Saluja,Ganesh Ramakrishnan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system’s intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.03719 [cs.CL] (or arXiv:2508.03719v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.03719 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-84] Health Insurance Coverag e Rule Interpretation Corpus: Law Policy and Medical Guidance for Health Insurance Coverag e Understanding

【速读】：该论文旨在解决美国健康保险体系复杂性导致的弱势群体对法律权益理解不足和司法救济获取困难的问题，其核心挑战在于现有语料库缺乏评估简单保险申诉案例所需的上下文信息。解决方案的关键在于构建并发布一个包含权威法律与医学文本的语料库，以及设计一个针对健康保险上诉的结果预测任务，并提供标注基准数据集及训练模型，从而支持监管机构和患者自助应用，提升医疗保障获取效率与公平性。

链接: https://arxiv.org/abs/2508.03718
作者: Mike Gartner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:U.S. health insurance is complex, and inadequate understanding and limited access to justice have dire implications for the most vulnerable. Advances in natural language processing present an opportunity to support efficient, case-specific understanding, and to improve access to justice and healthcare. Yet existing corpora lack context necessary for assessing even simple cases. We collect and release a corpus of reputable legal and medical text related to U.S. health insurance. We also introduce an outcome prediction task for health insurance appeals designed to support regulatory and patient self-help applications, and release a labeled benchmark for our task, and models trained on it.
zh

[NLP-85] FeynTune: Large Language Models for High-Energy Theory

【速读】：该论文旨在解决通用大语言模型（Large Language Models, LLMs）在理论高能物理（Theoretical High-Energy Physics）领域知识覆盖不足、专业术语理解不准确及推理能力有限的问题。解决方案的关键在于构建针对特定子领域的专用微调模型：基于80亿参数的Llama-3.1模型，训练了20个变体，每个变体使用arXiv中不同组合的hep-th（量子场论）、hep-ph（粒子物理）和gr-qc（广义相对论与量子引力）类别摘要进行微调，并采用两种低秩适配（Low-Rank Adaptation, LoRA）方法优化训练效率与性能。实验表明，这些专用模型在hep-th摘要补全任务上显著优于基线模型，并在与主流商业模型（如ChatGPT、Claude、Gemini、DeepSeek）的对比中展现出更强的专业性与准确性，为开发面向高能物理研究的高性能专用语言模型提供了有效路径。

链接: https://arxiv.org/abs/2508.03716
作者: Paul Richmond,Prarit Agarwal,Borun Chowdhury,Vasilis Niarchos,Constantinos Papageorgakis
机构: Queen Mary University of London (伦敦玛丽女王大学); Amazon (亚马逊); Meta (Meta); University of Crete (克里特大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
备注: 16 pages

点击查看摘要

Abstract:We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.
zh

[NLP-86] How Deep Is Representational Bias in LLM s? The Cases of Caste and Religion

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中表征偏见（Representational Bias）的系统性问题，尤其是针对全球南方地区如印度的宗教与种姓等少被研究的身份维度。传统研究多聚焦于种族和性别等全球北方中心议题，且依赖单次响应评估，难以揭示偏见的深层结构。本研究通过系统性审计 GPT-4 Turbo 对印度多元身份群体生成生活事件故事的能力，量化其在宗教与种姓维度上的偏见程度及其“粘性”（stickiness），发现即使在鼓励多样性的提示下，模型仍显著高估文化主导群体的代表性，且这种偏见呈现赢家通吃（winner-take-all）特性，远超训练数据中的分布偏差。解决方案的关键在于：单纯扩充训练数据多样性不足以消除此类偏见，必须从模型架构、训练机制及对偏见动态演化机制的理解层面进行根本性改进。

链接: https://arxiv.org/abs/2508.03712
作者: Agrima Seth,Monojit Choudhary,Sunayana Sitaram,Kentaro Toyama,Aditya Vashistha,Kalika Bali
机构: 1: University of Michigan (密歇根大学); 2: Microsoft (微软); 3: Google (谷歌); 4: Meta (Meta)
类目: Computation and Language (cs.CL)
备注: Accepted to AIES 2025

点击查看摘要

Abstract:Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and “stickiness” of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: this https URL
zh

[NLP-87] A Social Data-Driven System for Identifying Estate-related Events and Topics

【速读】：该论文旨在解决如何从社交媒体内容中高效识别并分类与地产（estate）相关事件的问题，以支持城市管理和应急响应。其核心挑战在于从海量非结构化文本中提取具有行动价值的信息，并在缺乏明确地理标签的情况下实现精准定位。解决方案的关键在于构建一个基于语言模型的分层分类系统：首先通过层级分类框架筛选出相关帖子，再将其细分为可操作的地产议题类别；同时引入基于Transformer的地理定位模块，对无显式地理标记的帖子推断其位于兴趣点（point-of-interest）级别的位置信息，从而实现时空维度上的数据驱动洞察。

链接: https://arxiv.org/abs/2508.03711
作者: Wenchuan Mu,Menglin Li,Kwan Hui Lim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Accepted at ASONAM 2025

点击查看摘要

Abstract:Social media platforms such as Twitter and Facebook have become deeply embedded in our everyday life, offering a dynamic stream of localized news and personal experiences. The ubiquity of these platforms position them as valuable resources for identifying estate-related issues, especially in the context of growing urban populations. In this work, we present a language model-based system for the detection and classification of estate-related events from social media content. Our system employs a hierarchical classification framework to first filter relevant posts and then categorize them into actionable estate-related topics. Additionally, for posts lacking explicit geotags, we apply a transformer-based geolocation module to infer posting locations at the point-of-interest level. This integrated approach supports timely, data-driven insights for urban management, operational response and situational awareness.
zh

[NLP-88] Multilingual Source Tracing of Speech Deepfakes: A First Benchmark INTERSPEECH

【速读】：该论文旨在解决多语言语音深度伪造（speech deepfake）的来源追踪问题，即在训练与推理语言不一致的情况下，如何准确识别生成语音所使用的模型。其解决方案的关键在于构建首个覆盖单语和跨语言场景的基准测试平台，并系统比较基于数字信号处理（DSP）与自监督学习（SSL）的建模方法；同时深入分析SSL表征在不同语言微调后的跨语言泛化能力，以及对未见语言和说话人的适应性，从而为语音生成模型的溯源提供可量化的评估框架与初步理论依据。

链接: https://arxiv.org/abs/2508.04143
作者: Xi Xuan,Yang Xiao,Rohan Kumar Das,Tomi Kinnunen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at Interspeech SPSC 2025 - 5th Symposium on Security and Privacy in Speech Communication (Oral)

点击查看摘要

Abstract:Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at this https URL.
zh

[NLP-89] MD-LLM -1: A Large Language Model for Molecular Dynamics

【速读】：该论文旨在解决分子动力学（Molecular Dynamics, MD）在模拟生物大分子系统时面临的计算成本高、难以覆盖广泛构象空间的问题。其解决方案的关键在于引入分子动力学大语言模型（Molecular Dynamics Large Language Model, MD-LLM）框架，利用预训练的大语言模型（如Mistral 7B）进行微调，从而学习蛋白质的构象变化规律，并能够从单一构象状态的训练数据中预测其他未见的构象状态，实现对蛋白质构象景观（conformational landscape）的探索能力。

链接: https://arxiv.org/abs/2508.03709
作者: Mhd Hussein Murtada,Z. Faidon Brotzakis,Michele Vendruscolo
机构: Centre for Misfolding Diseases (错折叠疾病中心); Yusuf Hamied Department of Chemistry (尤苏夫·哈米德化学系); University of Cambridge (剑桥大学)
类目: Biomolecules (q-bio.BM); Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Molecular dynamics (MD) is a powerful approach for modelling molecular systems, but it remains computationally intensive on spatial and time scales of many macromolecular systems of biological interest. To explore the opportunities offered by deep learning to address this problem, we introduce a Molecular Dynamics Large Language Model (MD-LLM) framework to illustrate how LLMs can be leveraged to learn protein dynamics and discover states not seen in training. By applying MD-LLM-1, the first implementation of this approach, obtained by fine-tuning Mistral 7B, to the T4 lysozyme and Mad2 protein systems, we show that training on one conformational state enables the prediction of other conformational states. These results indicate that MD-LLM-1 can learn the principles for the exploration of the conformational landscapes of proteins, although it is not yet modeling explicitly their thermodynamics and kinetics.
zh

计算机视觉

[CV-0] Occupancy Learning with Spatiotemporal Memory STOC ICCV2025

【速读】：该论文旨在解决自动驾驶中多帧3D占用（3D occupancy）感知表示的时序信息高效聚合问题，其核心挑战在于高计算成本以及体素（voxel）的不确定性与动态变化。解决方案的关键在于提出ST-Occ框架，包含两个核心设计：一是场景级时空记忆（spatiotemporal memory），通过场景级表征高效存储和整合历史信息；二是记忆注意力机制（memory attention），在当前帧的占用表示中引入对时序记忆的条件建模，并具备对不确定性和动态性的感知能力。该方法显著提升了多帧输入间的时序依赖建模能力，在3D占用预测任务中实现了更高的性能（提升3 mIoU）并降低了29%的时序不一致性。

链接: https://arxiv.org/abs/2508.04705
作者: Ziyang Leng,Jiawei Yang,Wenlong Yi,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025. Project website: this https URL

点击查看摘要

Abstract:3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%.
zh

[CV-1] BEVCon: Advancing Birds Eye View Perception with Contrastive Learning

【速读】：该论文旨在解决自动驾驶中鸟瞰图（Bird’s Eye View, BEV）感知任务中特征表示学习不足的问题，即现有方法主要聚焦于优化BEV编码器和特定任务头部，而忽视了通过对比学习提升BEV特征表达能力的潜力。其解决方案的关键在于提出BEVCon框架，包含两个对比学习模块：一是实例特征对比模块（instance feature contrast module），用于精细化BEV特征；二是视角对比模块（perspective view contrast module），用于增强图像主干网络的表征能力。通过在检测损失基础上引入密集对比学习，该框架显著提升了BEV编码器与图像主干的特征表示质量，在nuScenes数据集上实现了最高达+2.4%的mAP提升，验证了表示学习对BEV感知的重要作用。

链接: https://arxiv.org/abs/2508.04702
作者: Ziyang Leng,Jiawei Yang,Zhicheng Ren,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Southern California (南加州大学); Aurora Innovation (奥罗拉创新)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird’s Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.
zh

[CV-2] MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics

【速读】：该论文旨在解决当前基于性能的动画（performance-based animation）在驱动3D风格化角色时，难以同时保证几何一致性与感知有效性的问题。其解决方案的关键在于将传统混合形状（blendshape）动画技术与多种机器学习模型相结合：非实时系统中提出3D情绪迁移网络（3D emotion transfer network），利用2D人像生成符合风格化的3D骨骼参数；实时系统则引入混合形状自适应网络（blendshape adaption network），以确保生成的表情在几何上一致且时间上稳定。该方法显著提升了表情的识别度、强度和吸引力，优于商业产品Faceware。

链接: https://arxiv.org/abs/2508.04687
作者: Ye Pan,Ruisi Zhang,Jingying Wang,Nengfu Chen,Yilin Qiu,Yu Ding,Kenny Mitchell
机构: Shanghai Jiaotong University (上海交通大学); Netease Fuxi AI Lab (网易伏羲AI实验室); Edinburgh Napier University (爱丁堡纳皮尔大学); Roblox (罗布乐思)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE VR extended authors version of the article published in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). This work was supported by the European Union’s Horizon 2020 research and innovation programme under Grant 101017779

点击查看摘要

Abstract:Our purpose is to improve performance-based animation which can drive believable 3D stylized characters that are truly perceptual. By combining traditional blendshape animation techniques with multiple machine learning models, we present both non-real time and real time solutions which drive character expressions in a geometrically consistent and perceptually valid way. For the non-real time system, we propose a 3D emotion transfer network makes use of a 2D human image to generate a stylized 3D rig parameters. For the real time system, we propose a blendshape adaption network which generates the character rig parameter motions with geometric consistency and temporally stability. We demonstrate the effectiveness of our system by comparing to a commercial product Faceware. Results reveal that ratings of the recognition, intensity, and attractiveness of expressions depicted for animated characters via our systems are statistically higher than Faceware. Our results may be implemented into the animation pipeline, and provide animators with a system for creating the expressions they wish to use more quickly and accurately.
zh

[CV-3] urboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction ICCV2025

【速读】：该论文旨在解决多智能体感知与预测模型在端到端训练过程中面临的挑战，包括训练流程复杂、依赖大量人工设计与调参，以及多任务学习中梯度冲突导致性能受限等问题。其解决方案的关键在于提出了一种名为TurboTrain的高效训练框架，包含两个核心组件：一是基于掩码重建学习的多智能体时空预训练策略，用于有效捕捉多智能体间的时空特征；二是基于梯度冲突抑制的平衡多任务学习策略，以提升检测与预测性能。该框架显著简化了训练流程，减少了人工干预，同时提升了模型在真实场景下的表现。

链接: https://arxiv.org/abs/2508.04682
作者: Zewei Zhou,Seth Z. Zhao,Tianhui Cai,Zhiyu Huang,Bolei Zhou,Jiaqi Ma
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.
zh

[CV-4] Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions ICCV2025

【速读】：该论文旨在解决当前AI助手在构建通用智能体时面临的两个核心问题：一是现有数据集多局限于特定交互类别，缺乏泛化能力；二是忽视了第一人称视角（egocentric modality）在人机交互中的关键作用。解决方案的关键在于提出InterVLA数据集，这是一个包含11.4小时、120万帧的多模态数据集，涵盖第一人称和第三人称视频（共7类视频）、精确的人体与物体运动轨迹及语音指令，通过结合RGB图像与动作捕捉（MoCap）技术，构建了一个基于视觉-语言-动作（vision-language-action）框架的混合交互场景，从而支持对第一人称视角下人类行为建模、交互合成与预测等任务的系统性研究。

链接: https://arxiv.org/abs/2508.04681
作者: Liang Xu,Chengqun Yang,Zili Lin,Fei Xu,Yifan Liu,Congsheng Xu,Yiyi Zhang,Jie Qin,Xingdong Sheng,Yunhui Liu,Xin Jin,Yichao Yan,Wenjun Zeng,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology (东华理工大学); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative (宁波空间智能与数字衍生重点实验室); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Lenovo (联想)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.
zh

[CV-5] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

【速读】：该论文旨在解决提示调优（prompt tuning）方法在视觉-语言模型（VLMs）中对弱语义扰动（如图像或文本中的细微噪声）敏感的问题，这类扰动会显著降低模型在未见类别上的泛化能力。其解决方案的关键在于提出ANPrompt框架：首先通过融合原始与噪声扰动的文本嵌入构建弱噪声文本特征，并聚类生成噪声提示（noise prompts）；随后将这些噪声提示与可学习提示 token 结合，形成抗噪提示（anti-noise prompts），并注入到图像和文本编码器的深层结构中；同时，通过平均视觉编码器输出的提示 token 得到噪声鲁棒的视觉提示原型（NRVPP），并引入弱语义噪声对齐损失（WALoss）联合优化对齐、鲁棒性和抗噪目标，从而提升模型在噪声环境下的稳定性和新类别泛化性能。

链接: https://arxiv.org/abs/2508.04677
作者: Yansheng Gao,Yufei Zheng,Jinghan Qu,Zixi Zhu,Yukuan Zhang,Shengsheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.
zh

[CV-6] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

【速读】：该论文旨在解决当前生成式 AI（Generative AI）领域中，文本到图像扩散模型（text-to-image diffusion models, DMs）因参数规模庞大（8-11B）而难以在资源受限设备上进行高效推理的问题。其解决方案的核心在于提出一种名为 HierarchicalPrune 的分层压缩框架，关键创新点包括：(1) 基于模块功能层次的分层位置剪枝（Hierarchical Position Pruning），识别并移除对纹理细化贡献较小的后期模块；(2) 位置权重保留机制（Positional Weight Preservation），系统性保护早期模块以维持语义结构完整性；(3) 敏感度引导的知识蒸馏（Sensitivity-Guided Distillation），根据各模块敏感度差异动态调整知识迁移强度。该方法显著降低模型内存占用与推理延迟，同时保持图像质量与感知一致性，使百亿级扩散模型适用于边缘设备部署。

链接: https://arxiv.org/abs/2508.04663
作者: Young D. Kwon,Rui Li,Sijia Li,Da Li,Sourav Bhattacharya,Stylianos I. Venieris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.
zh

[CV-7] PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment ICCV2025

【速读】：该论文旨在解决粗粒度房间布局估计（coarse room layout estimation）问题，即从单视角或全景图像中恢复房间的三维立方体结构（cuboid-shaped room layout），以提供重要的几何线索支持下游任务。其解决方案的关键在于提出PixCuboid方法——一种基于多视角密集深度特征对齐的优化驱动型方法。通过端到端训练使学习到的特征图具备较大的收敛域和光滑的损失景观，从而允许使用简单启发式策略初始化房间布局；同时，该方法在ScanNet++和2D-3D-Semantics数据集上构建了两个新的基准测试，包含人工验证的3D立方体真值，实验证明其显著优于现有方法，并具备扩展至多房间场景（如大户型公寓或办公室）的灵活性。

链接: https://arxiv.org/abs/2508.04659
作者: Gustav Hanning,Kalle Åström,Viktor Larsson
机构: Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the ICCV 2025 Workshop on Large Scale Cross Device Localization

点击查看摘要

Abstract:Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at this https URL. Comments: Accepted at the ICCV 2025 Workshop on Large Scale Cross Device Localization Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4 Cite as: arXiv:2508.04659 [cs.CV] (or arXiv:2508.04659v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.04659 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-8] YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper

【速读】：该论文旨在解决家禽养殖业中鸡只疾病检测依赖人工观察所带来的效率低、易出错的问题。其核心解决方案是基于YOLO v8（You Only Look Once version 8）这一深度学习模型，构建一个AI驱动的实时识别系统，通过分析高分辨率鸡只图像来自动检测行为与外观异常等疾病征兆。关键在于使用大规模标注数据集对算法进行训练，从而实现对感染鸡只的精准实时识别与预警，有效提升养殖场的生物安全水平和健康管理效率，同时减少人工干预需求。

链接: https://arxiv.org/abs/2508.04658
作者: Akhil Saketh Reddy Sabbella,Ch.Lakshmi Prachothan,Eswar Kumar Panta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 Pages, 9 Figures, 2 Tables

点击查看摘要

Abstract:In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques.
zh

[CV-9] X-SAM: From Segment Anything to Any Segmentation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在像素级感知理解上的固有缺陷，以及Segment Anything Model（SAM）在多掩码预测和类别特定分割任务中的局限性，同时克服其无法将所有分割任务统一于单一模型架构的问题。解决方案的关键在于提出X-SAM框架——一个轻量化的多模态大语言模型（Multimodal Large Language Model, MLLM）架构，通过引入一种新的统一训练策略支持跨多个数据集的联合训练，并创新性地定义了视觉 grounded（Visual GrounDed, VGD）分割任务，使模型能够基于交互式视觉提示对所有实例对象进行像素级分割，从而显著提升MLLM在多模态、像素级视觉理解方面的性能与泛化能力。

链接: https://arxiv.org/abs/2508.04655
作者: Hao Wang,Limeng Qiao,Zequn Jie,Zhijian Huang,Chengjian Feng,Qingfang Zheng,Lin Ma,Xiangyuan Lan,Xiaodan Liang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textitsegment anything to \textitany segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL.
zh

[CV-10] EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

【速读】：该论文旨在解决当前多模态视觉-语言模型（Multimodal Vision-Language Models, VLMs）在图表理解任务中表现提升未能全面覆盖视觉推理能力的问题。现有基准测试虽显示模型性能持续进步，但未能系统评估不同视觉编码（visual encoding）与分析任务之间的复杂交互关系。为此，作者提出EncQA这一新型基准，其关键在于通过结构化设计实现对六种视觉编码通道（位置、长度、面积、定量颜色、名义颜色和形状）与八类分析任务（如寻找极值、检索数值、相关性计算等）的均衡覆盖，从而精准识别模型在特定视觉编码-任务组合中的性能瓶颈。实验表明，模型规模扩大并不能普遍提升性能，凸显出针对性优化视觉推理能力的重要性。

链接: https://arxiv.org/abs/2508.04650
作者: Kushin Mukherjee,Donghao Ren,Dominik Moritz,Yannick Assogba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.
zh

[CV-11] RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case ICCV2025

【速读】：该论文旨在解决自动驾驶系统在罕见高风险场景、长尾驾驶事件及复杂交互情境下表现不佳的问题，这些问题由于真实世界数据收集困难而尤为突出。解决方案的关键在于提出RoboTron-Sim框架，其核心包括两个创新：一是构建覆盖13类高风险边缘场景的增强型合成场景数据集HASS（Hard-case Augmented Synthetic Scenarios），并覆盖均衡的环境条件（如昼夜与晴雨）；二是引入场景感知提示工程（Scenario-aware Prompt Engineering, SPE）和图像到自车坐标系编码器（Image-to-Ego Encoder, I2E Encoder），使多模态大语言模型能够从HASS中有效学习真实世界的挑战性驾驶技能，从而适应仿真与现实之间存在的环境偏差和硬件差异。实验表明，该方法在nuScenes数据集上将复杂场景下的驾驶性能提升约50%，达到当前开放环路规划的最先进水平。

链接: https://arxiv.org/abs/2508.04642
作者: Baihui Xiao,Chengjian Feng,Zhijian Huang,Feng yan,Yujie Zhong,Lin Ma
机构: Meituan(美团); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose RoboTron-Sim that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Second, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments on nuScenes show that RoboTron-Sim improves driving performance in challenging scenarios by around 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of RoboTron-Sim in better managing rare high-risk driving scenarios. Project page: this https URL
zh

[CV-12] FinMMR: Make Financial Numerical Reasoning More Multimodal Comprehensive and Challenging ICCV2025

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在金融数值推理任务中推理能力不足的问题。现有基准在金融领域覆盖范围有限、模态融合深度不够，且缺乏对复杂多步数值计算的挑战性设计。解决方案的关键在于构建FinMMR——一个面向金融领域的双语多模态基准，其核心创新包括：(1) 多模态数据构建，整合表格、图表等金融图像与文本信息；(2) 覆盖14个金融子领域，显著扩展了金融知识广度；(3) 设计高难度多步骤数值推理任务，要求模型融合金融专业知识与复杂图像理解能力，从而更真实地评估和推动MLLM在实际金融场景中的推理性能。

链接: https://arxiv.org/abs/2508.04625
作者: Zichen Tang,Haihong E,Jiacheng Liu,Zhongjun Yang,Rongjin Li,Zihua Rong,Haoyang He,Zhuodi Hao,Xinyang Hu,Kun Ji,Ziyan Ma,Mengyuan Ji,Jun Zhang,Chenghao Ma,Qianhe Zheng,Yang Liu,Yiling Huang,Xinyi Hu,Qing Huang,Zijian Xie,Shiyao Peng
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the latest Chinese financial research reports. FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 53.0% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.
zh

[CV-13] How Does Bilateral Ear Symmetry Affect Deep Ear Features?

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）在耳部识别任务中对左右耳对称性影响的关注不足问题，即未充分探索双边耳对称性如何影响CNN所学习特征的有效性。其解决方案的关键在于：首先构建一个耳侧分类器（ear side classifier），自动将耳图像分为左耳或右耳；随后在训练和测试阶段引入耳侧信息，通过跨数据集的实验验证该策略能显著提升识别性能。此外，论文还通过消融实验分析了对齐策略、输入尺寸及超参数设置等因素的影响，为基于大规模数据集训练高准确率的耳识别系统提供了实用指导。

链接: https://arxiv.org/abs/2508.04614
作者: Kagan Ozturk,Deeksha Arun,Kevin W. Bowyer,Patrick Flynn
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale datasets to achieve higher verification rates.
zh

[CV-14] OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment ICCV2025

【速读】：该论文旨在解决单目深度估计（monocular depth estimation）与立体深度估计（stereo depth estimation）两种方法在实际应用中各自存在局限且难以融合的问题：单目方法虽能利用丰富的上下文先验（contextual priors），但缺乏几何精度；而立体方法依赖于极线几何（epipolar geometry），却在镜面反射或无纹理表面等场景下易产生歧义。解决方案的关键在于提出 OmniDepth 框架，其核心创新是一种新颖的交叉注意力对齐机制（cross-attentive alignment mechanism），通过迭代双向对齐两者的潜在表示（latent representations），在统一网络中动态同步单目上下文线索与立体假设表示，从而在推理过程中相互修正——既借助单目结构先验缓解立体歧义（如镜面表面），又利用立体几何信息提升单目深度的几何准确性。

链接: https://arxiv.org/abs/2508.04611
作者: Tongfan Guan,Jiaxin Guo,Chen Wang,Yun-Hui Liu
机构: The Chinese University of Hong Kong (香港中文大学); Spatial AI & Robotics Lab, University at Buffalo (纽约州立大学布法罗分校空间人工智能与机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICCV 2025 Highlight

点击查看摘要

Abstract:Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbfOmniDepth reduces zero-shot generalization error by !!40% on Middlebury and ETH3D, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at this https URL.
zh

[CV-15] Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline IROS2025

【速读】：该论文旨在解决从无位姿约束的RGB视频流中逐步恢复真实尺度三维几何结构的难题，现有方法或难以处理长序列，或依赖于缓慢的测试时优化及深度传感器。其解决方案的关键在于将3D高斯映射（3D Gaussian mapping）引入RGB-D SLAM系统，并结合前馈循环预测模块，直接从光流中推断相机位姿，从而用快速网络推理替代低效的测试时优化，显著提升跟踪速度；同时引入局部图渲染技术以增强前馈位姿预测的鲁棒性。实验表明，该方法在Replica和TUM-RGBD数据集上性能媲美当前最优的SplaTAM，且追踪时间减少超过90%。

链接: https://arxiv.org/abs/2508.04597
作者: Linqing Zhao,Xiuwei Xu,Yirui Wang,Hao Wang,Wenzhao Zheng,Yansong Tang,Haibin Yan,Jiwen Lu
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2025

点击查看摘要

Abstract:Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%.
zh

[CV-16] Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan ICASSP’26

【速读】：该论文旨在解决多语言环境下人脸与语音关联（Face-voice Association）的识别问题，其核心挑战在于如何在跨语言场景中准确匹配个体的面部特征与声音特征。解决方案的关键在于构建并使用一个名为Multilingual Audio-Visual (MAV-Celeb) 的多语言音视频数据集，该数据集专门设计用于模拟真实世界中双语或多语交流环境，并结合基线模型以评估和推动该任务的技术进展。

链接: https://arxiv.org/abs/2508.04592
作者: Marta Moscati,Ahmed Abdullah,Muhammad Saad Saeed,Shah Nawaz,Rohan Kumar Das,Muhammad Zaigham Zaheer,Junaid Mir,Muhammad Haroon Yousaf,Khalid Malik,Markus Schedl
机构: Johannes Kepler University Linz (约翰开普勒林茨大学); National University of Computer and Emerging Sciences (国家计算机与新兴科学大学); University of Michigan (密歇根大学); Fortemedia Singapore (福腾媒体新加坡); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Engineering and Technology Taxila (工程与技术大学塔克西拉分校); Human-centered AI Group, AI Lab, Linz Institute of Technology (以人为中心的人工智能小组，人工智能实验室，林茨技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, ICASSP’26, SP Grand Challenge’26

点击查看摘要

Abstract:The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world’s population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.
zh

[CV-17] Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis ICIP

【速读】：该论文旨在解决皮肤疾病分类中因类别间相似性高、类内差异大及病灶纹理复杂而导致的准确率难题。其解决方案的关键在于系统性评估三种图像预处理技术（标准RGB、CMY色彩空间变换和限制对比度自适应直方图均衡化CLAHE）与多种深度学习模型（包括卷积神经网络DenseNet201、EfficientNetB5以及基于Transformer的ViT、Swin Transformer和DinoV2 Large）的组合效果，最终发现采用RGB预处理配合DinoV2模型可实现最高准确率（达93%）和F1-score，同时通过Grad-CAM可视化实现病灶区域精确定位，从而提升系统的性能与可解释性。

链接: https://arxiv.org/abs/2508.04573
作者: Enam Ahmed Taufik,Abdullah Khondoker,Antara Firoz Parsa,Seraj Al Mahmud Mostafa
机构: European University of Bangladesh (欧洲大学孟加拉国); BRAC University (BRAC大学); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted in the 4th IEEE International Conference on Image Processing and Media Computing (ICIPMC) 2025

点击查看摘要

Abstract:Accurate skin disease classification is a critical yet challenging task due to high inter-class similarity, intra-class variability, and complex lesion textures. While deep learning-based computer-aided diagnosis (CAD) systems have shown promise in automating dermatological assessments, their performance is highly dependent on image pre-processing and model architecture. This study proposes a deep learning framework for multi-class skin disease classification, systematically evaluating three image pre-processing techniques: standard RGB, CMY color space transformation, and Contrast Limited Adaptive Histogram Equalization (CLAHE). We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large) using accuracy and F1-score as evaluation metrics. Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants. Grad-CAM visualizations applied to RGB inputs further reveal precise lesion localization, enhancing interpretability. These findings underscore the importance of effective pre-processing and model choice in building robust and explainable CAD systems for dermatology.
zh

[CV-18] Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

【速读】：该论文旨在解决医学图像中异常定位（abnormality grounding）问题，即根据文本描述精准定位临床发现。现有通用视觉-语言模型（Vision-Language Models, VLMs）在医学领域表现不佳，主要因其难以对罕见、组合性强且领域特定的术语与视觉模式进行有效对齐。为克服这一挑战，作者提出Knowledge to Sight (K2Sight) 框架，其核心创新在于引入结构化语义监督机制：将临床概念分解为可解释的视觉属性（如形状、密度和解剖位置），这些属性源自领域本体论，并被编码为简洁的指令式提示（instruction-style prompts），从而在训练过程中引导区域-文本对齐。相比传统报告级监督，该方法显式融合了领域知识与空间结构信息，实现了数据高效训练，仅用1.5%的标注数据即可训练出性能优于或等同于7B以上医疗VLM的小型模型（0.23B和2B参数量），mAP_50指标提升达9.82%。

链接: https://arxiv.org/abs/2508.04572
作者: Jun Li,Che Liu,Wenjia Bai,Mingxuan Liu,Rossella Arcucci,Cosmin I. Bercea,Julia A. Schnabel
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Imperial College London (伦敦帝国理工学院); University of Trento (特伦托大学); Helmholtz AI and Helmholtz Munich (亥姆霍兹人工智能与亥姆霍兹慕尼黑); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbfKnowledge to Sight (K2Sight), a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in mAP_50 . Code and models: \hrefthis https URL\textcolorSOTAPinkthis https URL.
zh

[CV-19] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling DATE

【速读】：该论文旨在解决扩散磁共振成像（diffusion MRI, dMRI）中纤维追踪（tractography）的精度与鲁棒性问题，尤其针对传统方法在复杂脑结构中易受噪声干扰、难以保持长程一致性以及对不同扫描协议和人群适应性差等挑战。其解决方案的关键在于提出了一种名为DDTracking的深度生成框架，将纤维追踪的轨迹传播建模为条件去噪扩散过程（conditional denoising diffusion process），并通过双路径编码网络联合学习局部空间特征（local spatial encoding，捕捉每个轨迹点的细粒度结构信息）与全局时间依赖关系（global temporal dependencies，保障整条轨迹的长程一致性），进而利用条件扩散模型模块端到端地预测轨迹传播方向，从而实现高保真、可泛化的纤维追踪结果。

链接: https://arxiv.org/abs/2508.04568
作者: Yijie Li,Wei Zhang,Xi Zhu,Ye Wu,Yogesh Rathi,Lauren J. O’Donnell,Fan Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Nanjing University of Science and Technology (南京理工大学); Brigham and Women’s Hospital, Harvard Medical School (布莱根妇女医院，哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. The content may be updated in the future

点击查看摘要

Abstract:This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking’s strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: this https URL
zh

[CV-20] CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

【速读】：该论文旨在解决弱监督下的密集音视频事件定位（Weakly-supervised Dense Audio-Visual Event Localization, W-DAVEL）问题，即在仅提供视频级事件标签、缺乏精确时间边界信息的情况下，实现音频与视觉模态中同步发生事件的精准时序定位。解决方案的关键在于提出了一种基于跨模态显著锚点（cross-modal salient anchors）的方法：首先通过互事件一致性评估模块（Mutual Event Agreement Evaluation module）计算音频与视觉预测事件类别的差异得分以衡量模态间一致性；随后利用该得分在跨模态显著锚点识别模块（Cross-modal Salient Anchor Identification module）中提取全局和局部时间窗口内的可靠锚点特征；最后将融合后的锚点特征输入锚点驱动的时间传播模块（Anchor-based Temporal Propagation module），增强原始时序特征中的事件语义编码，从而提升弱监督条件下的定位性能。

链接: https://arxiv.org/abs/2508.04566
作者: Jinxing Zhou,Ziheng Zhou,Yanghao Zhou,Yuxin Mao,Zhangling Duan,Dan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textitcross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textitMutual Event Agreement Evaluation module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textitCross-modal Salient Anchor Identification module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textitAnchor-based Temporal Propagation module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.
zh

[CV-21] AlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning AAAI2026

【速读】：该论文旨在解决正畸治疗中牙齿排列预测的精度问题，现有深度学习方法多依赖点对点几何约束来预测变换矩阵，但未能充分考虑人类口腔结构的解剖特性及其变换矩阵的潜在分布规律。解决方案的关键在于提出一种名为TAlignDiff的新方法，其核心创新是将基于点云的回归网络（PRN）与基于扩散模型的变换矩阵去噪模块（DTMD）相结合，通过几何约束损失引导点云级对齐，并利用临床数据学习变换矩阵的潜在分布特征，从而实现几何约束与扩散精修之间的双向反馈机制，提升牙齿自动排列的准确性与临床适用性。

链接: https://arxiv.org/abs/2508.04565
作者: Yunbi Liu,Enqi Tang,Shiyu Li,Lei Ma,Juncheng Li,Shu Lou,Yongchu Pan,Qingshan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to AAAI 2026

点击查看摘要

Abstract:Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients’ quality of life. Current deep learning approaches predominantly concentrate on predicting transformation matrices through imposing point-to-point geometric constraints for tooth alignment. Nevertheless, these matrices are likely associated with the anatomical structure of the human oral cavity and possess particular distribution characteristics that the deterministic point-to-point geometric constraints in prior work fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is supported by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. Extensive ablation and comparative experiments demonstrate the effectiveness and superiority of our method, highlighting its potential in orthodontic treatment.
zh

[CV-22] Drone Detection with Event Cameras

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicle, UAV）扩散带来的安全与防护挑战，特别是传统基于帧的摄像机在检测小型、高机动性目标时因运动模糊和光照条件恶劣导致性能下降的问题。其解决方案的关键在于采用事件相机（event camera）驱动的事件视觉（event-based vision）技术：该技术通过稀疏、异步的数据输出有效抑制静态背景，消除运动模糊，并在极端光照条件下保持稳定响应，从而实现低延迟、高鲁棒性的无人机检测与跟踪，为下一代反无人机（counter-UAV）系统提供高效可靠的感知基础。

链接: https://arxiv.org/abs/2508.04564
作者: Gabriele Magrini,Lorenzo Berlincioni,Luca Cultrera,Federico Becattini,Pietro Pala
机构: University of Florence (佛罗伦萨大学); University of Siena (锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.
zh

[CV-23] One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose

【速读】：该论文旨在解决当前基于扩散模型的虚拟试衣（virtual try-on）方法在实际应用中的两大核心限制：一是依赖展示服装（exhibition garments）和分割掩码（segmentation masks），二是难以处理灵活的姿态变化，导致无法实现跨人物的服装迁移或生成任意姿态下的试穿结果。解决方案的关键在于提出了一种名为OMFA（One Model For All）的统一扩散框架，其创新性地引入了部分扩散策略（partial diffusion strategy），该策略能够选择性地对联合输入中的不同组件（如服装、人体图像或面部）施加噪声与去噪操作，从而实现动态子任务控制和高效的双向服装-人体变换。OMFA完全无需掩码，仅需一张人物肖像和目标姿态作为输入，即可完成从“试穿”到“试脱”的双向操作，并支持任意姿态的生成，显著提升了方法的实用性与通用性。

链接: https://arxiv.org/abs/2508.04559
作者: Jinxi Liu,Zijian He,Guangrun Wang,Guanbin Li,Liang Lin
机构: 1. 中国科学院自动化研究所 (Institute of Automation, Chinese Academy of Sciences); 2. 中国科学院大学 (University of Chinese Academy of Sciences); 3. OneModel for All (OneModel for All)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbfOMFA (\emphOne Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emphpartial diffusion strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: this https URL.
zh

[CV-24] CONVERGE: A Multi-Agent Vision-Radio Architecture for xApps

【速读】：该论文旨在解决通信系统与计算机视觉领域长期独立发展所带来的协同效率低下问题，尤其是在高频无线链路（主要工作于视距传播）场景下，如何利用视觉信息实时感知和预测信道动态变化以提升网络性能。其核心解决方案是提出一种基于多智能体架构的新型系统设计，通过引入一个可生成障碍物信息的新视频功能模块，实现无线电与视频感知信息的融合，并将这些信息实时传递给O-RAN xApps（可编程应用），从而支持集成感知与通信（Integrated Sensing and Communications, ISC）。实验表明，该方案能将感知信息延迟控制在1 ms以内，且xApp可据此对5G/6G无线接入网（RAN）进行实时调控，显著增强了网络对环境变化的自适应能力。

链接: https://arxiv.org/abs/2508.04556
作者: Filipe B. Teixeira,Carolina Simões,Paulo Fidalgo,Wagner Pedrosa,André Coelho,Manuel Ricardo,Luis M. Pessoa
机构: INESC TEC (INESC TEC); Faculdade de Engenharia, Universidade do Porto (工程学院，波尔图大学)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Telecommunications and computer vision have evolved independently. With the emergence of high-frequency wireless links operating mostly in line-of-sight, visual data can help predict the channel dynamics by detecting obstacles and help overcoming them through beamforming or handover techniques. This paper proposes a novel architecture for delivering real-time radio and video sensing information to O-RAN xApps through a multi-agent approach, and introduces a new video function capable of generating blockage information for xApps, enabling Integrated Sensing and Communications. Experimental results show that the delay of sensing information remains under 1,ms and that an xApp can successfully use radio and video sensing information to control the 5G/6G RAN in real-time. Comments: 7 pages, 5 figures Subjects: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.04556 [cs.NI] (or arXiv:2508.04556v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2508.04556 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation MICCAI

【速读】：该论文旨在解决医学图像中心脏结构语义分割在域偏移（domain shift）下的性能下降问题，即当训练数据与测试数据来自不同分布时，深度学习模型难以保持稳定准确的分割效果。其核心解决方案包括两个关键策略：一是采用平衡联合训练方法（balanced joint training approach），在训练过程中等量融合来自不同源域的CT和MR数据，以增强模型对多模态数据的泛化能力；二是引入强强度与空间增强技术（strong intensity and spatial augmentation techniques），通过大幅扩充训练数据的多样性来缓解测试阶段遇到的新域带来的影响。该方法在CT和MR数据上分别实现了93.33% DSC和89.30% DSC的分割性能，表明其具备生成高精度患者特异性心脏数字孪生模型的潜力。

链接: https://arxiv.org/abs/2508.04552
作者: Franz Thaler,Darko Stern,Gernot Plank,Martin Urschler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the MICCAI Challenge on Comprehensive Analysis and Computing of Real-World Medical Images 2024, 12 pages

点击查看摘要

Abstract:As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift – i.e. when training and test data are sampled from different data distributions – remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.
zh

[CV-26] wo-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis

【速读】：该论文旨在解决虚拟试穿（Virtual Try-On, VTON）与虚拟脱衣（Virtual Try-Off, VTOFF）任务之间的不对称性问题，即现有方法将二者视为孤立任务，忽视了其在服装重建中的互补对称关系。为填补这一空白，作者提出首个统一框架——双向服装迁移模型（Two-Way Garment Transfer Model, TWGTM），其关键在于通过双向特征解耦（bidirectional feature disentanglement）实现同时处理带掩码引导的VTON和无掩码的VTOFF，并引入分阶段训练范式以逐步弥合两者间的模态差异（mask dependency asymmetry）。该方案在DressCode和VITON-HD数据集上验证了其有效性与先进性。

链接: https://arxiv.org/abs/2508.04551
作者: Angang Zhang,Fang Deng,Hao Chen,Zhongjian Chen,Junyan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images to seamlessly bridge the dual tasks. On the other hand, to resolve the inherent mask dependency asymmetry between mask-guided VTON and mask-free VTOFF, we devise a phased training paradigm that progressively bridges this modality gap. Extensive qualitative and quantitative experiments conducted across the DressCode and VITON-HD datasets validate the efficacy and competitive edge of our proposed approach.
zh

[CV-27] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

【速读】：该论文旨在解决海洋视频理解中因海洋物体动态性、环境变化、相机运动及水下场景复杂性所带来的挑战，现有视频描述数据集多聚焦于通用或以人类为中心的领域，难以有效泛化至复杂的海洋环境并获取海洋生物的相关洞察。其解决方案的关键在于提出一个两阶段的面向海洋物体的视频描述生成流程，并构建了一个基于视频、文本与分割掩码三元组的综合性视频理解基准，从而提升视觉定位与描述准确性，增强海洋视频的理解与分析能力；此外，通过视频切分策略识别显著目标过渡，有效丰富了描述内容的语义层次。

链接: https://arxiv.org/abs/2508.04549
作者: Quang-Trung Truong,Yuk-Kwan Wong,Vo Hoang Kim Tuyen Dang,Rinaldi Gotama,Duc Thanh Nguyen,Sai-Kit Yeung
机构: The Hong Kong University of Science and Technology (香港科技大学); Ho Chi Minh City University of Science (胡志明市科学大学); Indo Ocean Foundation (印度洋基金会); Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Published at ACMMM2025 (Dataset track)

点击查看摘要

Abstract:Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at this https URL.
zh

[CV-28] Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding ICCV2025

【速读】：该论文旨在解决在线视频时间定位（Online Video Temporal Grounding, OnVTG）任务中的两大挑战：一是模型在无法获取未来帧的情况下需实时做出预测，二是现有方法缺乏有效的事件建模能力且难以保留长期历史信息，导致性能受限。其解决方案的关键在于提出一种分层事件记忆机制（hierarchical event memory），通过事件级提案建模不同持续时间的事件信息，并结合层次化存储策略保留近期与长期的历史事件特征；同时引入一个未来预测分支，用于提前判断目标事件是否即将发生并回归其起始时间，从而实现高效、准确的在线预测。

链接: https://arxiv.org/abs/2508.04546
作者: Minghang Zheng,Yuxin Peng,Benyuan Sun,Yi Yang,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); State Key Laboratory of General Artificial Intelligence, Peking University (通用人工智能国家重点实验室，北京大学); Central Media Technology Institute, Huawei (华为中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at this https URL.
zh

[CV-29] InceptoFormer: A Multi-Signal Neural Framework for Parkinsons Disease Severity Evaluation from Gait

【速读】：该论文旨在解决帕金森病（Parkinson’s Disease, PD）严重程度评估中因步态动力学特征复杂且存在类别不平衡导致的分类精度不足问题。其解决方案的关键在于提出一种多信号神经框架 InceptoFormer，该框架融合了1D Inception结构（Inception1D）与Transformer机制：Inception1D通过并行不同卷积核大小的1D卷积滤波器提取多尺度时间特征，而Transformer则建模步态序列中的长程依赖关系，从而同时捕捉局部细微变化与全局动态模式；此外，作者还设计了一种基于过采样的数据预处理策略以缓解PD严重程度分级中的类别不平衡问题，显著提升了模型在Hoehn和Yahr（HY）量表上的分类性能，最终实现96.6%的准确率，优于现有最优方法。

链接: https://arxiv.org/abs/2508.04540
作者: Safwen Naimi,Arij Said,Wassim Bouachir,Guillaume-Alexandre Bilodeau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages; 5 figures. Published in the proceedings of the 2025 Canadian AI conference

点击查看摘要

Abstract:We present InceptoFormer, a multi-signal neural framework designed for Parkinson’s Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (HY) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at this https URL
zh

[CV-30] opKD: Top-scaled Knowledge Distillation

【速读】：该论文旨在解决知识蒸馏（Knowledge Distillation, KD）中长期被忽视的问题：教师模型 logits 分布中蕴含的重要信息未被充分挖掘，尤其是 Top-K 知识（Top-K knowledge）这一关键要素。现有方法多聚焦于特征层面的知识迁移，导致对 logits 中高置信度预测的利用不足。解决方案的关键在于提出一种简单、高效且与架构无关的框架——Top-scaled Knowledge Distillation (TopKD)，其核心包含两个组件：(1) Top-K Scaling Module (TSM)，通过自适应放大最具信息量的 top-K logits，增强重要信号；(2) Top-K Decoupled Loss (TDL)，提供针对性更强的监督信号。该方法无需引入额外模块或修改网络结构，即可显著提升 logit-based distillation 的性能，并在多个数据集和模型架构（包括 Vision Transformers）上验证了其有效性，凸显了 logits 在知识蒸馏中的巨大潜力。

链接: https://arxiv.org/abs/2508.04539
作者: Qi Wang,Jinjia Zhou
机构: Hosei University (明海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, conference, 8 Tables

点击查看摘要

Abstract:Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher’s logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.
zh

[CV-31] No Masks Needed: Explainable AI for Deriving Segmentation from Classification

【速读】：该论文旨在解决当前基于预训练模型的无监督图像分割方法在医学影像领域迁移效果不佳的问题，即现有计算机视觉技术虽在通用图像分割任务中表现优异，但在医疗场景下因数据分布差异和标注稀缺导致性能下降。其解决方案的关键在于：首先针对医学图像特性对预训练模型进行微调（fine-tuning），提升模型对医学图像特征的适应性；其次引入可解释人工智能（Explainable AI, XAI）生成相关性评分（relevance scores），以增强分割过程的可信度与准确性。该方法在CBIS-DDSM、NuInsSeg和Kvasir-SEG等医学图像数据集上实现了优于传统方法的分割性能。

链接: https://arxiv.org/abs/2508.04534
作者: Mosong Ma,Tania Stathaki,Michalis Lazarou
机构: 1: Imperial College London (帝国理工学院); 2: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDIPV 2025

点击查看摘要

Abstract:Medical image segmentation is vital for modern healthcare and is a key element of computer-aided diagnosis. While recent advancements in computer vision have explored unsupervised segmentation using pre-trained models, these methods have not been translated well to the medical imaging domain. In this work, we introduce a novel approach that fine-tunes pre-trained models specifically for medical images, achieving accurate segmentation with extensive processing. Our method integrates Explainable AI to generate relevance scores, enhancing the segmentation process. Unlike traditional methods that excel in standard benchmarks but falter in medical applications, our approach achieves improved results on datasets like CBIS-DDSM, NuInsSeg and Kvasir-SEG.
zh

[CV-32] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

【速读】：该论文旨在解决当前深度伪造（deepfake）检测方法在准确性和可解释性方面的双重挑战：一方面，现有检测模型多采用黑箱分类策略，缺乏对决策过程的透明解释；另一方面，基于大语言模型（LLM）的解释方法存在粒度粗、依赖人工标注等问题。解决方案的关键在于提出RAIDX框架，其创新性地融合了检索增强生成（Retrieval-Augmented Generation, RAG）与组相对策略优化（Group Relative Policy Optimization, GRPO），通过RAG引入外部知识提升检测准确性，利用GRPO自主生成细粒度文本解释和显著性图（saliency maps），从而实现高精度检测与可信赖解释的统一，且无需大量人工标注，是首个整合RAG与GRPO用于深度伪造检测与解释的端到端框架。

链接: https://arxiv.org/abs/2508.04524
作者: Tianxiao Li,Zhenglin Huang,Haiquan Wen,Yiwei He,Shuchang Lyu,Baoyuan Wu,Guangliang Cheng
机构: University of Liverpool(利物浦大学); Beihang University(北京航空航天大学); The Chinese University of Hong Kong Shenzhen(香港中文大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX’s effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.
zh

[CV-33] Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation ICCV2025

【速读】：该论文旨在解决骨架序列（skeleton sequence）上无监督时序动作分割（unsupervised temporal action segmentation）的问题，即在缺乏标注数据的情况下，自动识别和划分视频中不同动作的起止时间。当前主流方法多依赖于监督学习，需大量人工标注，而现有无监督方法主要集中在视频帧数据，对骨架序列的研究较少，尽管后者在实际应用中具有鲁棒性和隐私保护优势。解决方案的关键在于提出一种基于序列到序列的时序自编码器（sequence-to-sequence temporal autoencoder），其能够在嵌入空间中保持各关节信息的解耦，并将潜在骨架序列划分为非重叠块后进行量化，生成独特的骨架运动词（skeleton motion words），从而驱动语义上有意义的动作聚类发现。实验在HuGaDB、LARa和BABEL三个主流骨架数据集上验证了该方法优于现有无监督方法。

链接: https://arxiv.org/abs/2508.04513
作者: Uzay Gökay,Federico Spurio,Dominik R. Bach,Juergen Gall
机构: University of Bonn, Germany(波恩大学, 德国); University College London, UK(伦敦大学学院, 英国); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany(拉马尔机器学习与人工智能研究所, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025

点击查看摘要

Abstract:Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at this https URL .
zh

[CV-34] Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds

【速读】：该论文旨在解决当前多视图三维重建方法依赖精确相机标定与位姿估计的问题，此类方法通常需要复杂且耗时的预处理流程，限制了其实际部署效率。解决方案的关键在于提出一种端到端的前馈式方法Surf3R，其通过多分支多视图解码架构，利用多个参考视图协同引导重建过程，结合分支内处理、跨视图注意力机制和分支间融合策略，在无需相机位姿估计的前提下有效捕捉互补几何信息；同时引入基于显式三维高斯表示的D-Normal正则化项，将表面法向量与其他几何参数耦合优化，显著提升三维一致性与表面细节精度，从而在ScanNet++和Replica数据集上实现领先的重建性能与高效性。

链接: https://arxiv.org/abs/2508.04508
作者: Haodong Zhu,Changbai Li,Yangyang Ren,Zichao Feng,Xuhui Liu,Hanlin Chen,Xiantong Zhen,Baochang Zhang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current multi-view 3D reconstruction methods rely on accurate camera calibration and pose estimation, requiring complex and time-intensive pre-processing that hinders their practical deployment. To address this challenge, we introduce Surf3R, an end-to-end feedforward approach that reconstructs 3D surfaces from sparse views without estimating camera poses and completes an entire scene in under 10 seconds. Our method employs a multi-branch and multi-view decoding architecture in which multiple reference views jointly guide the reconstruction process. Through the proposed branch-wise processing, cross-view attention, and inter-branch fusion, the model effectively captures complementary geometric cues without requiring camera calibration. Moreover, we introduce a D-Normal regularizer based on an explicit 3D Gaussian representation for surface reconstruction. It couples surface normals with other geometric parameters to jointly optimize the 3D geometry, significantly improving 3D consistency and surface detail accuracy. Experimental results demonstrate that Surf3R achieves state-of-the-art performance on multiple surface reconstruction metrics on ScanNet++ and Replica datasets, exhibiting excellent generalization and efficiency.
zh

[CV-35] MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

【速读】：该论文旨在解决从单目视频中重建逼真且可动画化的穿衣服人类三维化身（3D human avatar）这一难题，其核心挑战在于单目输入信息有限导致几何结构恢复困难，以及衣物等非刚性部件的复杂形变难以准确建模。解决方案的关键在于提出了一种基于部件分解（part-based decomposition）的设计策略，将人体划分为身体、面部、双手和服装四个部分，并针对不同部件的重建难度与形变特性分别优化：对人脸和手部注重细节几何恢复，对服装则引入专用的布料模拟模块，利用时序运动线索与几何约束捕捉衣物动态变形。该设计不仅提升了重建质量与动画真实感，还支持服装迁移等扩展任务，体现出方法的灵活性与实用性。

链接: https://arxiv.org/abs/2508.04505
作者: Daisheng Jin,Ying He
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.
zh

[CV-36] Learning Robust Intervention Representations with Delta Embeddings

【速读】：该论文旨在解决因果表示学习中干预（intervention）表征不足导致的分布外（out of distribution, OOD）鲁棒性差的问题。现有方法多聚焦于场景变量的因果建模，而忽视了对干预本身的有效表征。其解决方案的关键在于提出一种因果Delta嵌入（Causal Delta Embedding），该嵌入在潜在空间中对视觉场景不变，且在所影响的因果变量上具有稀疏性，从而能有效分离干预效应与场景变化。通过此机制，模型无需额外监督即可从图像对中学习因果表示，并在Causal Triplet挑战中显著优于基线方法，尤其在合成与真实世界OOD场景下表现突出。

链接: https://arxiv.org/abs/2508.04492
作者: Panagiotis Alimisis,Christos Diou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.
zh

[CV-37] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

【速读】：该论文旨在解决扩散模型在真实世界视频超分辨率（VSR）任务中因计算速度慢和资源消耗高而难以实际部署的问题。现有量化方法在处理具有时序特性和高保真度要求的VSR模型时面临挑战。解决方案的关键在于提出QuantVSR，其核心创新包括：(1) 提出时空复杂度感知（Spatio-Temporal Complexity Aware, STCA）机制，通过校准数据集测量各层的空间与时间复杂度，并据此为低秩全精度（Full-Precision, FP）辅助分支分配层特定秩，实现FP与低比特分支的联合优化；(2) 设计可学习偏置对齐（Learnable Bias Alignment, LBA）模块，以减少量化误差带来的偏差。该方法在合成与真实数据集上均实现了与全精度模型相当的性能，显著优于当前主流低比特量化方法。

链接: https://arxiv.org/abs/2508.04485
作者: Bowen Chai,Zheng Chen,Libo Zhu,Wenbo Li,Yong Guo,Yulun Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Huawei Technologies Co., Ltd. (华为技术有限公司); 3. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: this https URL.
zh

[CV-38] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

【速读】：该论文旨在解决预训练文本到图像生成模型中概念擦除（Concept Erasure）任务的两个关键问题：一是现有方法因“非零对齐残差”导致擦除不彻底，尤其在复杂文本提示下更为明显；二是参数更新集中在少数深层网络层，引发生成质量下降。解决方案的关键在于提出一种新的闭式求解方法ErasePro，其核心创新包括两点：首先，在优化目标中引入严格的零残差约束，确保目标概念与锚点概念特征完全对齐，从而实现更完整的概念擦除；其次，采用渐进式的分层更新策略，从浅层到深层逐步将目标概念特征迁移至锚点概念特征，随着深度增加所需参数调整量递减，有效降低敏感深层的扰动，保障整体生成质量。

链接: https://arxiv.org/abs/2508.04472
作者: Hongxu Chen,Zhen Wang,Taoran Mei,Lin Li,Bowei Zhu,Runshi Li,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to “non-zero alignment residual”, especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.
zh

[CV-39] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

【速读】：该论文旨在解决高维视频数据（如4D视频）直接生成的复杂性问题，传统方法通常通过堆叠跨视图或时间注意力模块同时建模3D空间与时间特征，导致训练困难且生成质量受限。其解决方案的关键在于提出一种级联式视频扩散模型（4DVD），将4D内容生成解耦为两个子任务：首先生成粗粒度多视角布局以保证跨视图和时间一致性；随后基于此结构先验，引入结构感知的时空生成分支，融合输入单目视频的精细外观信息，从而生成高质量密集视图视频。该设计使得显式的4D表示（如4D Gaussian）能够被更准确优化，显著提升了新颖视图合成与4D生成的性能。

链接: https://arxiv.org/abs/2508.04467
作者: Shuzhou Yang,Xiaodong Cun,Xiaoyu Li,Yaowei Li,Jian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is this https URL
zh

[CV-40] Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion IJCAI2025

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在需要深度视觉感知的任务中表现不足的问题，特别是识别图像间细微差异的能力薄弱，其根源在于现有指令微调语料库中视觉知识的匮乏。解决方案的关键在于提出一种基于新型视觉知识密集型任务——因果驱动的视觉对象补全（Causality-driven Visual Object Completion, CVC）的自我改进框架。该任务要求LVLM根据可见信息与其掩蔽对象之间的因果关系进行推理，从而强化模型的视觉理解与推理能力；通过自动化实例构建管道低成本生成丰富训练样本，并利用这些样本进行试错式学习，实现LVLM的持续自我优化。实验表明，该方法在多个专业任务和综合基准上均取得显著性能提升。

链接: https://arxiv.org/abs/2508.04453
作者: Qingguo Hu,Ante Wang,Jia Song,Delai Qiu,Qingsong Liu,Jinsong Su
机构: Xiamen University (厦门大学); Xiamen Unisound Intelligence Technology Co., Ltd (厦门亿迅智能科技有限公司); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fujian and Taiwan (福建和台湾)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underlineCausality-driven \underlineVisual object \underlineCompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textitcausal relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textite.g., GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at this https URL.
zh

[CV-41] Benchmarking Foundation Models for Mitotic Figure Classification

【速读】：该论文旨在解决病理图像中标注数据稀缺导致深度学习模型性能受限的问题，特别是针对有丝分裂图像（mitotic figure）分类任务。其核心解决方案是利用自监督学习训练的大规模基础模型（foundation models），通过低秩适应（Low-Rank Adaptation, LoRA）技术对模型注意力机制进行微调，从而在仅使用10%训练数据的情况下实现接近全量数据训练的性能表现，并显著提升模型在未见肿瘤域上的泛化能力。关键创新在于采用LoRA替代传统线性探针（linear probing）策略，在保持高效参数更新的同时大幅改善跨域鲁棒性。

链接: https://arxiv.org/abs/2508.04441
作者: Jonas Ammeling,Jonathan Ganz,Emely Rosbach,Ludwig Lausser,Christof A. Bertram,Katharina Breininger,Marc Aubreville
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.
zh

[CV-42] Composed Object Retrieval: Object-level Retrieval via Composed Expressions

【速读】：该论文旨在解决多模态系统中基于用户意图进行细粒度视觉内容检索的难题，现有组合图像检索（Composed Image Retrieval, CIR）方法仅限于图像级别的匹配，无法实现对特定目标对象的定位。为此，作者提出新的任务——组合对象检索（Composed Object Retrieval, COR），其核心在于从参考对象与检索文本的组合表达中精确识别并分割目标对象，从而实现对象级别的精准检索。解决方案的关键在于构建了首个大规模COR基准数据集COR127K（含127,166个检索三元组），并提出CORE模型，该模型通过参考区域编码、自适应视觉-文本交互机制以及区域级对比学习实现了端到端的联合优化，在基础类别和新类别上均显著优于现有方法，为细粒度多模态检索研究提供了有效基线。

链接: https://arxiv.org/abs/2508.04424
作者: Tong Wang,Guanyu Yang,Nian Liu,Zongyan Han,Jinxing Zhou,Salman Khan,Fahad Shahbaz Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.
zh

[CV-43] Efficient Inter-Task Attention for Multitask Transformer Models ICONIP2025

【速读】：该论文旨在解决多任务学习（Multitask Learning）中Transformer架构因多头注意力（Multi-Head-Attention）机制导致的计算复杂度过高问题。由于注意力矩阵大小随任务数量呈二次增长（假设各任务查询数相近），在实际硬件限制下难以高效扩展。解决方案的关键在于提出一种新颖的可变形跨任务自注意力机制（Deformable Inter-Task Self-Attention），通过更高效的特征图间信息聚合方式，在显著降低浮点运算次数（FLOPs）和推理延迟的同时，提升各任务预测质量（最高达7.4%）。

链接: https://arxiv.org/abs/2508.04422
作者: Christian Bohn,Thomas Kurbiel,Klaus Friedrichs,Hasan Tercan,Tobias Meisen
机构: University of Wuppertal (伍珀塔尔大学); University of Applied Sciences (应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICONIP 2025

点击查看摘要

Abstract:In both Computer Vision and the wider Deep Learning field, the Transformer architecture is well-established as state-of-the-art for many applications. For Multitask Learning, however, where there may be many more queries necessary compared to single-task models, its Multi-Head-Attention often approaches the limits of what is computationally feasible considering practical hardware limitations. This is due to the fact that the size of the attention matrix scales quadratically with the number of tasks (assuming roughly equal numbers of queries for all tasks). As a solution, we propose our novel Deformable Inter-Task Self-Attention for Multitask models that enables the much more efficient aggregation of information across the feature maps from different tasks. In our experiments on the NYUD-v2 and PASCAL-Context datasets, we demonstrate an order-of-magnitude reduction in both FLOPs count and inference latency. At the same time, we also achieve substantial improvements by up to 7.4% in the individual tasks’ prediction quality metrics.
zh

[CV-44] hink Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

【速读】：该论文旨在解决参考音频-视觉分割（Referring Audio-Visual Segmentation, Ref-AVS）任务中现有方法依赖强像素级监督且缺乏可解释性的问题。其解决方案的关键在于提出TGS-Agent框架，该框架将任务分解为“思考-定位-分割”（Think-Ground-Segment）三阶段流程，模拟人类推理过程：首先通过Ref-Thinker这一多模态语言模型对文本、视觉和听觉线索进行联合推理，生成显式的对象描述；随后利用该描述作为提示引导Grounding-DINO与SAM2完成粗粒度定位与精确分割，从而避免对像素级标注的依赖，并提升模型的可解释性。

链接: https://arxiv.org/abs/2508.04418
作者: Jinxing Zhou,Yanghao Zhou,Mingfei Han,Tong Wang,Xiaojun Chang,Hisham Cholakkal,Rao Muhammad Anwer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript2-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript2-AVSBench. Code will be available at this https URL.
zh

[CV-45] hinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视频理解任务中面临的两个核心挑战：一是现有基于文本的思维链（Chain-of-Thought, CoT）推理方法存在跨模态交互有限、长视频场景下幻觉（hallucination）增强的问题；二是视频问答（Video Question Answering, VQA）与时间定位（Temporal Grounding）任务之间缺乏协同优化机制。解决方案的关键在于提出一个端到端的代理式视频推理框架VITAL（Video Intelligence via Tool-Augmented Learning），其核心创新包括：1）引入视觉工具箱（visual toolbox），使模型能按需密集采样视频帧并生成多模态CoT，从而提升长视频的精准推理能力；2）构建两个高质量多任务数据集MTVR-CoT-72k（监督微调）和MTVR-RL-110k（强化学习），实现VQA与时间定位任务的联合训练；3）设计难度感知组相对策略优化算法（Difficulty-aware Group Relative Policy Optimization, DGRPO），缓解多任务强化学习中的难度不平衡问题。实验表明，VITAL在11个视频理解基准上显著优于现有方法，尤其在长视频场景中表现突出。

链接: https://arxiv.org/abs/2508.04416
作者: Haoji Zhang,Xin Gu,Jiawen Li,Chixiang Ma,Sule Bai,Chubin Zhang,Bowen Zhang,Zhichao Zhou,Dongliang He,Yansong Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.
zh

[CV-46] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models

【速读】：该论文旨在解决现有建筑节能改造前期规划中，如何高效且准确地从图像数据中提取符合Level of Detail (LoD) 3标准的热工三维模型（包含窗户等关键几何特征）的问题。传统方法依赖图像分割与投影，存在视角失真和流程复杂等问题。其解决方案的关键在于提出了一种名为“可扩展图像到三维立面解析器”（Scalable Image-to-3D Facade Parser, SI3FP）的新型管道，该方法直接在正交投影图像平面上建模几何基元（geometric primitives），从而统一处理流程并减少透视畸变，同时兼容稀疏（如Google街景）与密集（如手持相机）数据源，在瑞典典型住宅建筑上实现了约5%的窗墙比估算误差，满足早期改造分析精度要求。

链接: https://arxiv.org/abs/2508.04406
作者: Yinan Yu,Alex Gonzalez-Caceres,Samuel Scheidegger,Sanjay Somanath,Alexander Hollberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in Automation in Construction

点击查看摘要

Abstract:Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.
zh

[CV-47] ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition

【速读】：该论文旨在解决耳部生物特征识别（ear biometrics）在少量标注数据条件下性能受限的问题，主要挑战包括标注数据稀缺性和类内变异较大。解决方案的关键在于提出一种基于原型的少样本学习框架ProtoN，其核心创新是通过构建类特定图结构（class-specific graph），将每个样本印象表示为节点，并引入可学习的原型节点（prototype node）以编码身份级信息；进而设计原型图神经网络（Prototype Graph Neural Network, PGNN）层，利用双路径消息传递机制同时优化样本与原型表示，并结合跨图原型对齐策略增强类内紧凑性与类间可分性，最终通过混合损失函数平衡任务内与全局分类目标，从而显著提升嵌入空间的结构质量与识别准确率。

链接: https://arxiv.org/abs/2508.04381
作者: Santhoshkumar Peddi,Sadhvik Bathini,Arun Balasubramanian,Monalisa Sarma,Debasis Samanta
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions.
zh

[CV-48] VisionTS: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

【速读】：该论文旨在解决视觉模型（vision models）向时间序列预测（time series forecasting, TSF）跨模态迁移中的三大核心挑战：（1）图像数据与时间序列在结构和范围上的模态差异；（2）标准RGB三通道视觉模型难以处理任意数量变量的多变量时间序列问题；（3）视觉模型输出为确定性结果，而时间序列预测需具备不确定性感知的概率性输出能力。解决方案的关键在于提出VisionTS++，其包含三项创新：（1）基于视觉模型的过滤机制用于筛选高质量时间序列数据，缓解模态差异并提升预训练稳定性；（2）彩色多变量转换方法将多变量时间序列映射为多子图RGB图像，有效建模变量间复杂依赖关系；（3）采用多分位数预测策略，通过并行重建头生成不同分位数水平的预测结果，无需假设特定分布即可灵活逼近任意输出分布。该方法在多个分布内与分布外基准上均达到SOTA性能，验证了其作为通用时间序列基础模型（universal time series foundation model）的潜力。

链接: https://arxiv.org/abs/2508.04379
作者: Lefei Shen,Mouxiang Chen,Xu Liu,Han Fu,Xiaoxue Ren,Jianling Sun,Zhuo Li,Chenghao Liu
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); State Street Technology (Zhejiang) Ltd. (State Street Technology (浙江)有限公司); Salesforce Research Asia (Salesforce 研究亚洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.
zh

[CV-49] SPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理长时视频输入时面临的两大核心挑战：一是受限于上下文长度和训练成本，需对视频进行稀疏帧采样；二是现有方法如均匀采样或关键帧搜索缺乏事件感知能力，难以捕捉重要事件，且难以通过训练优化。解决方案的关键在于提出一种基于强化学习的时序采样策略优化方法（Temporal Sampling Policy Optimization, TSPO），其核心创新包括：1）设计了一个可训练的事件感知时序代理（event-aware temporal agent），通过建模事件查询相关性实现概率化的关键帧选择；2）构建一个将关键帧选择与语言生成联合决策的强化学习框架，支持端到端的群体相对优化，并采用规则化奖励机制提升采样效率与准确性；3）开发了包含丰富时序数据和“针 haystack”式长视频数据的训练数据构建流程，从而有效优化采样策略并提升跨不同先进Video-MLLMs的迁移能力。

链接: https://arxiv.org/abs/2508.04369
作者: Canhui Tang,Zifan Han,Hongbo Sun,Sanping Zhou,Xuchong Zhang,Xin Wei,Ye Yuan,Jinglin Xu,Hao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models’ event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO’s training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
zh

[CV-50] Continual Multiple Instance Learning for Hematologic Disease Diagnosis MICCAI2024

【速读】：该论文旨在解决多实例学习（Multiple Instance Learning, MIL）场景下模型在持续学习（continual learning）过程中因灾难性遗忘（catastrophic forgetting）导致性能下降的问题，尤其是在单细胞血液疾病诊断（如白血病检测）中，面对动态变化的数据分布时模型难以保持稳定性能。其解决方案的关键在于提出了一种专为MIL设计的基于回放（rehearsal-based）的持续学习方法：通过结合实例注意力得分（instance attention score）与实例到袋均值向量和类别均值向量的距离，智能选择来自不同袋（bag）中的代表性样本存储于示例集（exemplary sets），从而有效保留数据多样性并缓解遗忘问题。实验基于真实白血病实验室一个月的数据，在类增量场景下验证了该方法显著优于现有主流持续学习方法，首次实现了适用于MIL的持续学习框架。

链接: https://arxiv.org/abs/2508.04368
作者: Zahra Ebrahimi,Raheleh Salehi,Nassir Navab,Carsten Marr,Ario Sadafi
机构: Helmholtz Munich (慕尼黑亥姆霍兹研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: Accepted for publication at MICCAI 2024 workshop on Efficient Medical AI

点击查看摘要

Abstract:The dynamic environment of laboratories and clinics, with streams of data arriving on a daily basis, requires regular updates of trained machine learning models for consistent performance. Continual learning is supposed to help train models without catastrophic forgetting. However, state-of-the-art methods are ineffective for multiple instance learning (MIL), which is often used in single-cell-based hematologic disease diagnosis (e.g., leukemia detection). Here, we propose the first continual learning method tailored specifically to MIL. Our method is rehearsal-based over a selection of single instances from various bags. We use a combination of the instance attention score and distance from the bag mean and class mean vectors to carefully select which samples and instances to store in exemplary sets from previous tasks, preserving the diversity of the data. Using the real-world input of one month of data from a leukemia laboratory, we study the effectiveness of our approach in a class incremental scenario, comparing it to well-known continual learning methods. We show that our method considerably outperforms state-of-the-art methods, providing the first continual learning approach for MIL. This enables the adaptation of models to shifting data distributions over time, such as those caused by changes in disease occurrence or underlying genetic alterations.
zh

[CV-51] RotatedMVPS: Multi-view Photometric Stereo with Rotated Natural Light

【速读】：该论文旨在解决多视角光度立体（Multiview Photometric Stereo, MVPS）方法在自然光照条件下难以准确恢复表面形状与反射率的问题，尤其是现有方法通常依赖受控暗室环境来实现光照变化，且常忽略反射率和光源属性的联合恢复，从而限制了其在真实场景中的应用及下游逆渲染任务的性能。解决方案的关键在于提出RotatedMVPS框架，通过使用旋转平台实现自然光下不同相机和物体姿态下的光照一致性，有效减少复杂环境光带来的不确定性；同时，将现成的学习型单视角光度立体方法的数据先验引入MVPS框架，显著提升了形状与反射率恢复的精度。

链接: https://arxiv.org/abs/2508.04366
作者: Songyun Yang,Yufei Han,Jilong Zhang,Kongming Liang,Peng Yu,Zhaowei Qu,Heng Guo
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Multiview photometric stereo (MVPS) seeks to recover high-fidelity surface shapes and reflectances from images captured under varying views and illuminations. However, existing MVPS methods often require controlled darkroom settings for varying illuminations or overlook the recovery of reflectances and illuminations properties, limiting their applicability in natural illumination scenarios and downstream inverse rendering tasks. In this paper, we propose RotatedMVPS to solve shape and reflectance recovery under rotated natural light, achievable with a practical rotation stage. By ensuring light consistency across different camera and object poses, our method reduces the unknowns associated with complex environment light. Furthermore, we integrate data priors from off-the-shelf learning-based single-view photometric stereo methods into our MVPS framework, significantly enhancing the accuracy of shape and reflectance recovery. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach.
zh

[CV-52] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

【速读】：该论文旨在解决3D直线在相机定位与结构映射中参数化冗余的问题，特别是现有方法仅处理独立直线而忽略了人造环境中普遍存在的平行线结构规律。其解决方案的关键在于提出一种基于黎曼流形（Riemannian manifold）的统一最小参数化表示方法——RiemanLine，通过将每条直线分解为共享的消失方向（vanishing direction）和约束在正交子空间中的归一化方向向量，实现对单条直线及平行线组的紧凑编码；对于n条平行线，参数空间从传统的4n维压缩至2n+2维，无需显式约束即可自然嵌入平行性，同时集成到因子图框架中进行基于流形的联合优化，从而提升姿态估计精度、重建质量并增强收敛稳定性。

链接: https://arxiv.org/abs/2508.04335
作者: Yanyan Li,Ze Yang,Keisuke Tateno,Federico Tombari Liang Zhao,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbfRiemanLine, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere \mathcalS^2 , and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For n parallel lines, the proposed representation reduces the parameter space from 4n (orthonormal form) to 2n+2 , naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.
zh

[CV-53] mpFlow-GRPO: When Timing Matters for GRPO in Flow Models

【速读】：该论文旨在解决流模型（flow model）在文本到图像生成任务中与强化学习（reinforcement learning, RL）结合时，难以实现人类偏好对齐的问题，尤其是现有方法因时间均匀性假设导致的信用分配（credit assignment）效率低下，从而影响细粒度奖励优化和收敛性能。解决方案的关键在于提出 TempFlow-GRPO 框架，其核心创新包括：(i) 轨迹分支机制（trajectory branching mechanism），通过在指定分支点集中引入随机性以提供过程奖励（process rewards），实现无需专用中间奖励模型的精确信用分配；(ii) 噪声感知加权策略（noise-aware weighting scheme），根据每个时间步的内在探索潜力动态调节策略优化强度，在高影响力早期阶段优先学习，后期则确保稳定微调，从而实现尊重生成动力学的时间感知优化。

链接: https://arxiv.org/abs/2508.04324
作者: Xiaoxuan He,Siming Fu,Yuke Zhao,Wanli Li,Jian Yang,Dacheng Yin,Fengyun Rao,Bo Zhang
机构: ZheJiang University (浙江大学); WeChat Vision, Tencent Inc (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbfTempFlow-GRPO (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.
zh

[CV-54] A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks

【速读】：该论文旨在解决分布式声学传感（Distributed Acoustic Sensing, DAS）信号识别中因异构感知环境导致的数据分布差异问题，以及由此引发的跨域泛化能力弱和标注训练数据稀缺问题。解决方案的关键在于提出一种基于掩码自编码器（Masked Autoencoder, MAE）的DAS信号基础模型MAEPD，通过在包含多种场景（如步态时空信号、周界安防二维图像、管道泄漏时频图像及开放数据如鲸鱼鸣叫和地震活动）的635,860样本数据集上进行自监督掩码重建预训练，以提取深层语义特征；同时采用视觉提示微调（Visual Prompt Tuning, VPT）策略，在冻结预训练主干参数的前提下，仅对少量可学习的视觉提示向量进行微调，从而实现高效且高精度的下游任务迁移，例如在室内步态识别任务中仅用0.322%参数微调即达到96.94%准确率，显著优于全参数微调方法。

链接: https://arxiv.org/abs/2508.04316
作者: Kun Gui,Hongliang Ren,Shang Shi,Jin Lu,Changqiu Yu,Quanjun Cao,Guomin Gu,Qi Xuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.
zh

[CV-55] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding

【速读】：该论文旨在解决时序句子定位（Temporal Sentence Grounding, TSG）任务中因缺乏显式监督导致查询（query）角色重叠、产生冗余预测的问题。解决方案的关键在于引入长度感知的Transformer架构（Length-Aware Transformer, LATR），通过将查询分组并赋予其对应的时间长度先验（短、中、长段落），并在训练中增加长度分类辅助任务，抑制不匹配长度的预测，从而引导每个查询专注于特定时间跨度的定位任务，提升模型的判别能力和定位精度。

链接: https://arxiv.org/abs/2508.04299
作者: Yifan Wang,Ziyi Liu,Xiaolong Sun,Jiawei Wang,Hongmin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal sentence grounding (TSG) is a highly challenging task aiming to localize the temporal segment within an untrimmed video corresponding to a given natural language description. Benefiting from the design of learnable queries, the DETR-based models have achieved substantial advancements in the TSG task. However, the absence of explicit supervision often causes the learned queries to overlap in roles, leading to redundant predictions. Therefore, we propose to improve TSG by making each query fulfill its designated role, leveraging the length priors of the video-description pairs. In this paper, we introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths. Specifically, we divide all queries into three groups, responsible for segments with short, middle, and long temporal durations, respectively. During training, an additional length classification task is introduced. Predictions from queries with mismatched lengths are suppressed, guiding each query to specialize in its designated function. Extensive experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks. Furthermore, the ablation studies validate the contribution of each component of our method and the critical role of incorporating length priors into the TSG task.
zh

[CV-56] MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction ICCV2025

【速读】：该论文旨在解决多视角图像合成（novel view synthesis）中因输入视图基线（baseline）差异较大时导致的重建质量下降问题，尤其在稀疏视图下小基线与大基线场景中表现不稳定。其核心解决方案是提出Multi-Baseline Gaussian Splatting (MuRF)，通过融合多视角立体视觉（Multi-View Stereo, MVS）和单目深度估计（Monocular Depth Estimation, MDE）特征以增强泛化能力；设计投影与采样机制实现深度特征的深层融合，并构建精细的概率体素引导特征图回归；同时引入参考视图损失（reference-view loss）提升几何一致性与优化效率；最终利用3D高斯表示加速训练与推理过程，在保持高质量渲染的同时显著提升跨基线设置下的鲁棒性。

链接: https://arxiv.org/abs/2508.04297
作者: Yaopeng Lou,Liao Shen,Tianqi Liu,Jiaqi Li,Zihao Huang,Huiqiang Sun,Zhiguo Cao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by ICCV 2025

点击查看摘要

Abstract:We present Multi-Baseline Gaussian Splatting (MuRF), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency. We leverage 3D Gaussian representations to accelerate training and inference time while enhancing rendering quality. MuRF achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets.
zh

[CV-57] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space

【速读】：该论文旨在解决点云配准（Point Cloud Registration）中因相似变换（平移、缩放、旋转）、非均匀密度、随机噪声点以及几何结构缺失等因素导致的局部最优解问题。解决方案的关键在于提出了一种基于预肯德尔形状空间（Pre-Kendall Shape Space, PKSS）的形状特征相似性度量方法，该方法不依赖于点到点或点到面的度量，而是通过流形度量方式实现对不同表示形式下点云的鲁棒匹配，从而直接生成变换矩阵，且无需数据训练或复杂特征编码，显著提升了算法的效率与实用性。

链接: https://arxiv.org/abs/2508.04286
作者: Chenlei Lv,Hui Huang
机构: Shenzhen University (深圳大学); College of Computer Science and Software Engineering (计算机科学与软件工程学院); Visual Computing Research Center (视觉计算研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 15 figures, and will be published in IEEE TVCG

点击查看摘要

Abstract:Point cloud registration is a classical topic in the field of 3D Vision and Computer Graphics. Generally, the implementation of registration is typically sensitive to similarity transformations (translation, scaling, and rotation), noisy points, and incomplete geometric structures. Especially, the non-uniform scales and defective parts of point clouds increase probability of struck local optima in registration task. In this paper, we propose a robust point cloud registration PKSS-Align that can handle various influences, including similarity transformations, non-uniform densities, random noisy points, and defective parts. The proposed method measures shape feature-based similarity between point clouds on the Pre-Kendall shape space (PKSS), \textcolorblackwhich is a shape measurement-based scheme and doesn’t require point-to-point or point-to-plane metric. The employed measurement can be regarded as the manifold metric that is robust to various representations in the Euclidean coordinate system. Benefited from the measurement, the transformation matrix can be directly generated for point clouds with mentioned influences at the same time. The proposed method does not require data training and complex feature encoding. Based on a simple parallel acceleration, it can achieve significant improvement for efficiency and feasibility in practice. Experiments demonstrate that our method outperforms the relevant state-of-the-art methods.
zh

[CV-58] Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval ACM-MM2025

【速读】：该论文针对视频时刻检索（Video Moment Retrieval, VMR）任务中现有方法忽视音频模态、或对多模态信息进行粗粒度融合的问题展开研究。具体而言，当前多数方法仅依赖视觉与文本模态，而少数尝试融合音频的方法未考虑音频质量差异（如背景噪声干扰），且缺乏细粒度的跨模态交互机制，导致性能受限。解决方案的关键在于提出一种重要性感知的多粒度融合模型（Importance-aware Multi-Granularity fusion model, IMG）：首先设计伪标签监督的音频重要性预测器，动态评估音频贡献并加权抑制噪声干扰；其次构建多粒度音频融合模块，在局部、事件和全局三个层次上自适应融合音频与视觉模态，充分挖掘其互补语义信息；此外引入跨模态知识蒸馏策略以应对推理阶段缺失音频的情况。该方法显著提升了VMR在含音频场景下的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.04273
作者: Junan Lin,Daizong Liu,Xianke Chen,Xiaoye Qu,Xun Yang,Jixiang Zhu,Sanyuan Zhang,Jianfeng Dong
机构: Zhejiang University(浙江大学); Peking University(北京大学); Zhejiang Gongshang University(浙江工商大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); University of Science and Technology of China(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model’s capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at this https URL.
zh

[CV-59] DSNNs: Competitive Topographic Deep Spiking Neural Networks for Visual Cortex Modeling

【速读】：该论文旨在解决传统深度人工神经网络（Artificial Neural Networks, ANNs）在模拟灵长类视觉皮层(topographic organization)时忽视关键时间动态性的问题，这一缺陷导致模型在物体识别等任务中性能下降且生物合理性不足。解决方案的关键在于引入脉冲神经网络（Spiking Neural Networks, SNNs），并设计一种新颖的时空约束损失函数（Spatio-Temporal Constraints, STC），从而在深度脉冲神经网络中实现从低级感觉输入到高级抽象表征的层次化空间功能组织。该方法不仅保持了ImageNet Top-1准确率无损（相比当前最优拓扑ANN TopoNet的3%下降），还显著提升了模型与大脑结构的相似性，并揭示了拓扑组织通过脉冲机制促进稳定高效的时序信息处理，增强了模型鲁棒性。

链接: https://arxiv.org/abs/2508.04270
作者: Deming Zhou,Yuetong Fang,Zhaorui Wang,Renjing Xu
机构: Hong Kong University of Science and Technology (广州)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The primate visual cortex exhibits topographic organization, where functionally similar neurons are spatially clustered, a structure widely believed to enhance neural processing efficiency. While prior works have demonstrated that conventional deep ANNs can develop topographic representations, these models largely neglect crucial temporal dynamics. This oversight often leads to significant performance degradation in tasks like object recognition and compromises their biological fidelity. To address this, we leverage spiking neural networks (SNNs), which inherently capture spike-based temporal dynamics and offer enhanced biological plausibility. We propose a novel Spatio-Temporal Constraints (STC) loss function for topographic deep spiking neural networks (TDSNNs), successfully replicating the hierarchical spatial functional organization observed in the primate visual cortex from low-level sensory input to high-level abstract representations. Our results show that STC effectively generates representative topographic features across simulated visual cortical areas. While introducing topography typically leads to significant performance degradation in ANNs, our spiking architecture exhibits a remarkably small performance drop (No drop in ImageNet top-1 accuracy, compared to a 3% drop observed in TopoNet, which is the best-performing topographic ANN so far) and outperforms topographic ANNs in brain-likeness. We also reveal that topographic organization facilitates efficient and stable temporal information processing via the spike mechanism in TDSNNs, contributing to model robustness. These findings suggest that TDSNNs offer a compelling balance between computational performance and brain-like features, providing not only a framework for interpreting neural science phenomena but also novel insights for designing more efficient and robust deep learning models.
zh

[CV-60] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models

【速读】：该论文旨在解决持续语义分割（Continual Semantic Segmentation, CSS）中因灾难性遗忘（catastrophic forgetting）导致模型性能下降的问题。尽管先前研究普遍认为直接微调（Direct Fine-Tuning, DFT）作为最简单的策略，存在严重遗忘现象并被视为性能下界，但本文通过系统性实验发现，预训练视觉模型（Pre-trained Vision Models, PVMs）本身具有较强的抗遗忘能力，现有方法低估了其内在稳定性。关键洞察在于：遗忘主要源于分类器（classifier）偏离PVM特征空间，而非骨干网络（backbone）表示退化。基于此，作者提出DFT*，一种轻量级增强方案，核心包括冻结PVM骨干和已学习分类器、预分配未来分类器等策略，在显著减少可训练参数和训练时间的同时，超越或媲美16种先进CSS方法的性能表现。

链接: https://arxiv.org/abs/2508.04267
作者: Duzhen Zhang,Yong Ren,Wei Cong,Junhao Zheng,Qiaoyi Su,Shuncheng Jia,Zhong-Zhi Li,Xuanle Zhao,Ye Bai,Feilong Chen,Qi Tian,Tielin Zhang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Chinese Academy of Sciences (中国科学院); Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所); South China University of Technology (华南理工大学); Huawei Inc. (华为公司); Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences (中国科学院脑科学与智能技术卓越创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier’s drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.
zh

[CV-61] Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark

【速读】：该论文旨在解决当前预训练大模型（如Segment Anything Model, SAM）在车辆部件细粒度分割任务中的局限性问题，具体表现为：SAM缺乏文本提示分割功能的公开访问权限，且默认模式生成的掩码区域缺少语义标签，难以满足结构化、类别特定的分割需求。为此，作者提出SAV框架，其核心创新在于三个关键组件：一是基于SAM的编码器-解码器结构以利用通用分割能力；二是引入车辆部件知识图谱（Vehicle Part Knowledge Graph），通过结构化本体显式建模部件间的空间与几何关系，嵌入先验结构知识；三是设计上下文样本检索编码模块，从训练集中识别视觉相似的车辆实例，提供丰富的上下文先验以增强泛化性能。该方案有效提升了车辆部件分割的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.04260
作者: Xiao Wang,Ziwen Wang,Wentao Wu,Anjie Wang,Jiashu Wu,Yantao Pan,Chenglong Li
机构: Anhui University (安徽大学); Chery (奇瑞)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on this https URL
zh

[CV-62] From eye to AI: studying rodent social behavior in the era of machine Learning

【速读】：该论文旨在解决传统啮齿类动物社会行为研究中依赖人工观察所带来的主观偏差及难以捕捉复杂社交互动的问题。其解决方案的关键在于整合人工智能（AI）、计算机视觉与行为学（ethology）和神经科学的方法，通过计算模型对行为进行更细致、客观的量化分析，从而提升社会神经科学领域的研究深度与准确性。文中还系统梳理了当前可用工具及其优劣，并提出针对常见技术障碍的实用对策，以帮助青年研究人员有效采用这些先进方法，并推动相关工具在科研应用中的持续优化。

链接: https://arxiv.org/abs/2508.04255
作者: Giuseppe Chindemi,Camilla Bellone,Benoit Girard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 28 pages, 7 figures, 4 tables, methodological review

点击查看摘要

Abstract:The study of rodent social behavior has shifted in the last years from relying on direct human observation to more nuanced approaches integrating computational methods in artificial intelligence (AI) and machine learning. While conventional approaches introduce bias and can fail to capture the complexity of rodent social interactions, modern approaches bridging computer vision, ethology and neuroscience provide more multifaceted insights into behavior which are particularly relevant to social neuroscience. Despite these benefits, the integration of AI into social behavior research also poses several challenges. Here we discuss the main steps involved and the tools available for analyzing rodent social behavior, examining their advantages and limitations. Additionally, we suggest practical solutions to address common hurdles, aiming to guide young researchers in adopting these methods and to stimulate further discussion among experts regarding the evolving requirements of these tools in scientific applications.
zh

[CV-63] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

【速读】：该论文旨在解决图像拼接（Image Stitching）中因场景深度变化和较大相机基线导致的显著视差（Parallax）问题，此类情况会使得传统拼接方法难以生成无缝、几何一致的全景图像。其解决方案的关键在于提出了一种名为PIS3R的新方法，该方法基于深度三维重建（Deep 3D Reconstruction）思想：首先利用基于视觉几何约束的Transformer网络从两张存在大视差的输入图像中联合估计相机的内参（Intrinsic Parameters）、外参（Extrinsic Parameters）及稠密三维场景结构；随后通过重投影重建的稠密点云至参考视图实现像素级对齐，生成初始拼接结果；最后引入一种点条件图像扩散模块（Point-conditioned Image Diffusion Module）对初始结果进行优化，有效消除孔洞或噪声等伪影，从而在保持全部像素几何完整性的前提下，为后续三维视觉任务（如SfM）提供可直接使用的高质量拼接图像。

链接: https://arxiv.org/abs/2508.04236
作者: Muhua Zhu,Xinhao Jin,Chengbo Wang,Yongcong Zhang,Yifei Xue,Tie Ji,Yizhen Lao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined this http URL with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.
zh

[CV-64] A machine learning approach for image classification in synthetic aperture RADAR

【速读】：该论文旨在解决合成孔径雷达（Synthetic Aperture Radar, SAR）图像中地表目标的识别与分类问题，具体包括几何形状分类（如物体形状识别）和环境类型分类（如冰类识别）。其解决方案的关键在于采用卷积神经网络（Convolutional Neural Networks, CNNs）对SAR数据进行处理，分别基于模拟的SAR数据和从该数据重建的图像进行分类，并对比两种方法的效果；同时考察不同天线高度对分类性能的影响。实验表明，CNN在两类任务中均实现了≥75%的高分类准确率，验证了其在利用SAR数据进行几何与环境分类任务中的有效性。

链接: https://arxiv.org/abs/2508.04234
作者: Romina Gaburro,Patrick Healy,Shraddha Naidu,Clifford Nolan
机构: University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 22 pagesd

点击查看摘要

Abstract:We consider the problem in Synthetic Aperture RADAR (SAR) of identifying and classifying objects located on the ground by means of Convolutional Neural Networks (CNNs). Specifically, we adopt a single scattering approximation to classify the shape of the object using both simulated SAR data and reconstructed images from this data, and we compare the success of these approaches. We then identify ice types in real SAR imagery from the satellite Sentinel-1. In both experiments we achieve a promising high classification accuracy ( \geq 75%). Our results demonstrate the effectiveness of CNNs in using SAR data for both geometric and environmental classification tasks. Our investigation also explores the effect of SAR data acquisition at different antenna heights on our ability to classify objects successfully.
zh

[CV-65] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

【速读】：该论文旨在解决文档图像分类模型决策过程缺乏透明性与可解释性的问题，尤其是在高风险应用场景中，模型可能因偏见或虚假相关性导致严重后果。现有方法虽尝试通过特征重要性图（feature-importance maps）解释模型决策，但这些图难以理解且无法揭示模型学习到的全局特征。为填补这一研究空白，作者提出DocVCE——一种基于潜在扩散模型（latent diffusion models）结合分类器引导（classifier guidance）生成可解释的文档反事实样本（document counterfactuals）的新方法。其关键创新在于：首先生成分布内（in-distribution）的视觉反事实解释，再通过分层块级（patch-wise）精细化优化，找到最接近目标真实图像的反事实样本，从而提供具有行动意义的解释。

链接: https://arxiv.org/abs/2508.04233
作者: Saifullah Saifullah,Stefan Agne,Andreas Dengel,Sheraz Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model’s decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets – RVL-CDIP, Tobacco3482, and DocLayNet – and 3 different models – ResNet, ConvNeXt, and DiT – using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors’ knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
zh

[CV-66] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction ITSC

【速读】：该论文旨在解决人群轨迹预测中因人类行为固有的多模态性和不确定性而导致的准确率难题，尤其针对现有基于扩散模型的方法未能显式建模行人运动意图、从而影响预测可解释性与精度的问题。其解决方案的关键在于提出一种融合行人运动意图的扩散模型框架，将运动意图分解为横向（lateral）和纵向（longitudinal）分量，并引入专门的意图识别模块以有效捕捉这些特征；同时设计了一种高效的引导机制，提升生成轨迹的可解释性与合理性。

链接: https://arxiv.org/abs/2508.04229
作者: Yu Liu,Zhijie Liu,Xiao Ren,You-Fu Li,He Kong
机构: Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology (广东省全驱动系统控制理论与技术重点实验室); Southern University of Science and Technology (南方科技大学); Department of Mechanical Engineering, City University of Hong Kong (香港城市大学机械工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be presented at the 28th IEEE International Conference on Intelligent Transportation Systems (ITSC), 2025

点击查看摘要

Abstract:Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians’ motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.
zh

[CV-67] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation KR

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）生成中多对象运动轨迹控制难题，尤其是在多个运动物体存在时，现有方法往往因语义冲突导致轨迹相交区域性能显著下降。其解决方案的关键在于提出LayerT2V，一种基于分层合成的视频生成方法，通过将背景与前景对象逐层生成并分别置于独立“层”上，实现多对象的灵活集成与协同控制，从而提升复杂多对象场景下的生成一致性与可控性。

链接: https://arxiv.org/abs/2508.04228
作者: Kangrui Cen,Baixuan Zhao,Yi Xin,Siqi Luo,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct “layer” and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at this https URL .
zh

[CV-68] Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在持续学习（Continual Learning, CL）过程中面临的独特挑战，包括跨模态特征漂移（cross-modal feature drift）、由于共享架构导致的参数干扰（parameter interference）以及零样本能力退化（zero-shot capability erosion）。其解决方案的关键在于提出一个以问题驱动的分类体系，将现有方法归纳为三类：(1) 多模态回放策略（Multi-Modal Replay Strategies），通过显式或隐式记忆机制缓解跨模态漂移；(2) 跨模态正则化（Cross-Modal Regularization），在更新过程中保持模态间对齐；(3) 参数高效适应（Parameter-Efficient Adaptation），利用模块化或低秩更新减少参数干扰。该分类体系不仅系统梳理了当前VLM持续学习的研究路径，也为未来构建具备长期泛化能力的视觉-语言系统提供了诊断性参考。

链接: https://arxiv.org/abs/2508.04227
作者: Yuyang Liu,Qiuhe Hong,Linlan Huang,Alexandra Gomez-Villa,Dipam Goswami,Xialei Liu,Joost van de Weijer,Yonghong Tian
机构: Peking University (北京大学); Nankai University (南开大学); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textitMulti-Modal Replay Strategies address cross-modal drift through explicit or implicit memory mechanisms; (2) \textitCross-Modal Regularization preserves modality alignment during updates; and (3) \textitParameter-Efficient Adaptation mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: this https URL.
zh

[CV-69] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition

【速读】：该论文旨在解决单目视频中动态三维场景重建的难题，即如何从有限观测中联合推断运动、结构与外观。现有基于高斯点绘（Gaussian Splatting）的方法常将静态与动态元素混杂在同一表征中，导致运动泄露（motion leakage）、几何畸变和时间闪烁等问题。其根本原因在于几何与外观在时间维度上的耦合建模，损害了重建的稳定性和可解释性。解决方案的关键在于提出SplitGaussian框架，通过显式分解场景表示为静态与动态两个独立分支，解耦运动建模与背景几何，并仅允许动态分支随时间变形，从而避免静态区域出现运动伪影，同时支持视图和时间相关的外观优化，显著提升时序一致性、重建精度及收敛速度。

链接: https://arxiv.org/abs/2508.04224
作者: Jiahui Li,Shengeng Tang,Jingxuan He,Gang Huang,Zhangye Wang,Yantao Pan,Lechao Cheng
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular video remains fundamentally challenging due to the need to jointly infer motion, structure, and appearance from limited observations. Existing dynamic scene reconstruction methods based on Gaussian Splatting often entangle static and dynamic elements in a shared representation, leading to motion leakage, geometric distortions, and temporal flickering. We identify that the root cause lies in the coupled modeling of geometry and appearance across time, which hampers both stability and interpretability. To address this, we propose \textbfSplitGaussian, a novel framework that explicitly decomposes scene representations into static and dynamic components. By decoupling motion modeling from background geometry and allowing only the dynamic branch to deform over time, our method prevents motion artifacts in static regions while supporting view- and time-dependent appearance refinement. This disentangled design not only enhances temporal consistency and reconstruction fidelity but also accelerates convergence. Extensive experiments demonstrate that SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.
zh

[CV-70] What Holds Back Open-Vocabulary Segmentation? ICCV25

【速读】：该论文旨在解决开放词汇（Open-vocabulary）分割模型在识别训练数据中未包含概念时性能受限的问题，即标准分割设置无法泛化到训练类别之外的语义理解任务。其解决方案的关键在于提出新颖的“oracle组件”，通过利用真实标签（groundtruth）信息来识别并解耦导致性能瓶颈的多个因素，从而揭示开放词汇模型失效的根本原因，并为未来研究提供可操作的改进方向。

链接: https://arxiv.org/abs/2508.04211
作者: Josip Šarić,Ivan Martinović,Matej Kristan,Siniša Šegvić
机构: University of Ljubljana (卢布尔雅那大学); University of Zagreb (萨格勒布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICCV 25 Workshop: What is Next in Multimodal Foundation Models?

点击查看摘要

Abstract:Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.
zh

[CV-71] Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification

【速读】：该论文旨在解决医学疾病诊断中因小病灶误诊导致的挑战，尤其是医学影像与电子健康记录（Electronic Health Record, EHR）数据在维度上的差异对有效特征对齐与融合造成的困难。解决方案的关键在于提出一种多模态多尺度交叉注意力融合网络（Multimodal Multiscale Cross-Attention Fusion Network, MMCAF-Net），该模型结合特征金字塔结构与高效的3D多尺度卷积注意力模块，从3D医学图像中提取病灶特异性特征，并引入多尺度交叉注意力机制以解决模态间维度不一致问题，从而实现更有效的跨模态特征融合，显著提升了诊断准确率。

链接: https://arxiv.org/abs/2508.04205
作者: Jianxun Yu,Ruiquan Ge,Zhipeng Wang,Cheng Yang,Chenyu Lin,Xianjun Fu,Jikui Liu,Ahmed Elazab,Changmiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at this https URL
zh

[CV-72] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

【速读】：该论文旨在解决视觉语言模型（Visual-Language Model, VLM）在推理过程中出现的假阳性（False Positive, FP）问题，即模型虽然得出正确答案但推理路径错误，导致推理不可靠。现有方法依赖特定多步推理数据集和强化学习策略，存在训练成本高、泛化能力差的问题。解决方案的关键在于提出一个通用框架ViFP，其核心机制包括：1）基于视觉推理的核心维度（如目标定位、特征描述、目标发现）构建子问题模板，减少对特定数据集的依赖；2）通过多轮问答（multi-turn QA）生成有效的推理链；3）动态分析推理路径一致性以识别潜在FP，并引入自适应Chain-of-Thought（CoT）机制，针对性引导FP与非FP样本；4）设计可靠性评估指标VoC，量化整合答案准确率与FP率，从而实现推理准确性与逻辑可靠性的协同提升。实验表明，ViFP在多个闭源VLM上显著降低FP率并提升准确率，验证了其有效性。

链接: https://arxiv.org/abs/2508.04201
作者: Ben Zhang,LuLu Yu,Lei Gao,Jing Liu,QuanJiang Guo,Hui Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.
zh

[CV-73] Bootstrap Deep Spectral Clustering with Optimal Transport

【速读】：该论文旨在解决传统谱聚类（Spectral Clustering）中存在的两个关键问题：一是各阶段优化过程相互独立、缺乏联合优化机制，二是表示能力受限导致聚类性能不佳。解决方案的核心在于提出一种端到端的深度谱聚类模型 BootSC，通过单一神经网络联合学习谱聚类的三个核心步骤——亲和矩阵构建、谱嵌入（Spectral Embedding）和 k-均值聚类（k-means Clustering）。BootSC 利用基于最优传输（Optimal Transport）的有效且高效的监督信号来引导亲和矩阵和聚类分配矩阵的初始化与优化，并引入语义一致的正交重参数化技术对谱嵌入进行正交约束，显著提升嵌入表示的判别能力，从而在多个数据集上实现当前最优的聚类性能。

链接: https://arxiv.org/abs/2508.04200
作者: Wengang Guo,Wei Ye,Chunchun Chen,Xin Sun,Christian Böhm,Claudia Plant,Susanto Rahardja
机构: Tongji University (同济大学); Shanghai Institute of Intelligent Science and Technology, Tongji University (同济大学智能科学与技术研究院); Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University (同济大学智能 autonomous 系统研究院); City University of Macau (澳门城市大学); University of Vienna (维也纳大学); Singapore Institute of Technology (新加坡科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering – affinity matrix construction, spectral embedding, and k -means clustering – using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at this https URL.
zh

[CV-74] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective ACM-MM

【速读】：该论文旨在解决视频文本视觉问答（Video TextVQA）任务中现有帧级框架存在的冗余文本实体和隐式关系建模问题，从而提升准确率与推理效率。其解决方案的关键在于提出一种面向实例的新型模型GAT（Gather and Trace），通过两个核心模块实现：首先，设计上下文聚合的实例收集模块（context-aggregated instance gathering module），将相关实体的视觉特征、布局特性及文本内容统一为结构化表示；其次，引入实例聚焦的轨迹追踪模块（instance-focused trajectory tracing module），建立视频流中文本实例间的时空关联并推断最终答案。此方法显著提升了模型在多个公开数据集上的性能与推理速度。

链接: https://arxiv.org/abs/2508.04197
作者: Yan Zhang,Gangyan Zeng,Daiqing Wu,Huawen Shen,Binbin Li,Yu Zhou,Can Ma,Xiaojun Bi
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Cyber Science and Engineering, Nanjing University of Science and Technology (南京理工大学网络科学与工程学院); College of Computer Science, Nankai University (南开大学计算机学院); Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China (教育部民族语言智能分析与安全治理重点实验室，中央民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by 2025 ACM MM

点击查看摘要

Abstract:Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at this https URL.
zh

[CV-75] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models

【速读】：该论文旨在解决生物医学多模态大语言模型（Multimodal Large Language Models, MLLMs）在训练数据中可能包含隐私信息或错误知识而导致的隐私泄露与输出不安全问题。传统方法通过重新处理训练集并从头训练模型虽可行，但计算成本过高，难以应用于大规模模型。为此，论文提出了一种基于新型数据生成管道的首个基准测试集——多模态大语言模型生物医学去学习基准（MLLMU-Med），该管道可有效将合成隐私数据和事实性错误注入训练集，从而模拟两类关键场景：1）隐私保护场景，即患者敏感信息被误纳入训练集导致推理时泄露；2）错误知识清除场景，即来自不可靠来源的错误知识嵌入模型引发不安全响应。解决方案的关键在于引入一种新的“去学习效率评分”（Unlearning Efficiency Score），能够量化评估不同子集上的整体去学习性能，并在此基础上系统评测了五种去学习方法，发现现有技术在移除生物医学MLLMs中的有害知识方面效果有限，揭示了该领域亟待改进的研究空间。

链接: https://arxiv.org/abs/2508.04192
作者: Dunyuan Xu,Xikai Yang,Yaoqian Li,Jinpeng Li,Pheng-Ann Heng
机构: 1. The Chinese University of Hong Kong (香港中文大学); 2. The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.
zh

[CV-76] RPCANet: Deep Interpretable Robust PCA for Sparse Object Segmentation

【速读】：该论文旨在解决传统鲁棒主成分分析（Robust Principal Component Analysis, RPCA）在实际应用中面临的三大问题：计算负担重、对超参数敏感以及先验假设僵化导致的动态场景适应性差。其解决方案的关键在于提出RPCANet++，一个将RPCA的可解释性与深度网络高效架构融合的稀疏目标分割框架。该框架通过结构化设计将松弛后的RPCA模型展开为三个模块：背景近似模块（Background Approximation Module, BAM）、对象提取模块（Object Extraction Module, OEM）和图像恢复模块（Image Restoration Module, IRM），并引入记忆增强模块（Memory-Augmented Module, MAM）以缓解BAM阶段特征传输损失，以及深度对比先验模块（Deep Contrast Prior Module, DCPM）利用显著性线索加速对象提取，从而在保持理论优势的同时显著提升效率与适应性。

链接: https://arxiv.org/abs/2508.04190
作者: Fengyi Wu,Yimian Dai,Tianfang Zhang,Yixuan Ding,Jian Yang,Ming-Ming Cheng,Zhenming Peng
机构: University of Electronic Science and Technology of China (电子科技大学); Nankai University (南开大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage this https URL.
zh

[CV-77] Deeper Inside Deep ViT

【速读】：该论文旨在解决大规模视觉模型（如ViT-22B）在实际训练中的稳定性问题及其在图像生成任务中的适用性不足。其关键解决方案在于：首先，通过改进模型结构以增强训练稳定性，使从头训练的ViT-22B在相同参数规模下优于原始ViT；其次，提出一种基于ViT的图像生成架构，并对比分析ViT与ViT-22B在图像生成任务中的性能表现，从而确定更适合图像生成任务的模型结构。

链接: https://arxiv.org/abs/2508.04181
作者: Sungrae Hong
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.
zh

[CV-78] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement

【速读】：该论文旨在解决低光照图像增强中因特征表示内在不确定性导致的模型可靠性下降与因果推理失效问题，尤其在极端暗光条件下，梯度消失和噪声主导严重影响了增强效果。解决方案的关键在于提出U2CLLIE框架，其核心创新为：(1) 引入基于熵的不确定性感知机制，设计Uncertainty-Aware Dual-domain Denoise (UaD)模块，通过Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF)实现频域去噪与熵驱动表征优化，提升空间纹理提取能力并抑制噪声；(2) 构建分层因果感知架构，结合Luminance Enhancement Network (LEN)进行粗粒度亮度增强，并在编码-解码阶段引入Neighborhood Correlation State Space (NeCo)与Adaptive Spatial-Color Calibration (AsC)两个不对称因果关联建模模块，协同构建层次化因果约束，从而重建邻域结构并强化特征空间中的颜色一致性，显著提升模型鲁棒性与泛化性能。

链接: https://arxiv.org/abs/2508.04176
作者: Jin Kuang,Dong Liu,Yukuang Zhang,Shengsheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing low-light image enhancement approaches primarily focus on architectural innovations, while often overlooking the intrinsic uncertainty within feature representations particularly under extremely dark conditions where degraded gradient and noise dominance severely impair model reliability and causal reasoning. To address these issues, we propose U2CLLIE, a novel framework that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling. From the perspective of entropy-based uncertainty, our framework introduces two key components: (1) An Uncertainty-Aware Dual-domain Denoise (UaD) Module, which leverages Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF) to suppress frequency-domain noise and optimize entropy-driven representations. This module enhances spatial texture extraction and frequency-domain noise suppression/structure refinement, effectively mitigating gradient vanishing and noise dominance. (2) A hierarchical causality-aware framework, where a Luminance Enhancement Network (LEN) first performs coarse brightness enhancement on dark regions. Then, during the encoder-decoder phase, two asymmetric causal correlation modeling modules Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) collaboratively construct hierarchical causal constraints. These modules reconstruct and reinforce neighborhood structure and color consistency in the feature space. Extensive experiments demonstrate that U2CLLIE achieves state-of-the-art performance across multiple benchmark datasets, exhibiting robust performance and strong generalization across various scenes.
zh

[CV-79] AD-FM: Multimodal LLM s for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在专业化异常检测（Anomaly Detection, AD）任务中因领域适应（domain adaptation）困难而导致性能受限的问题。现有基于群体相对策略优化（Group Relative Policy Optimization, GRPO）的方法存在两大瓶颈：一是当模型产生一致响应时训练数据利用率低；二是缺乏对推理过程的充分监督，导致模型倾向于做出即时二分类决策而忽视深度分析。其解决方案的关键在于两个协同创新：一是提出多阶段反思式推理机制（multi-stage deliberative reasoning process），引导模型从区域识别逐步过渡到聚焦检查，生成多样化的响应模式以支持GRPO优化，并实现对分析流程的结构化监督；二是设计细粒度奖励机制（fine-grained reward mechanism），融合分类准确率与定位监督信号，将二元反馈转化为连续信号，从而区分真实的分析洞见与表面正确的偶然结果。该框架显著提升了通用视觉-语言模型在工业异常检测场景下的适应效率和精度。

链接: https://arxiv.org/abs/2508.04175
作者: Jingyi Liao,Yongyi Su,Rong-Cheng Tu,Zhao Jin,Wenhao Sun,Yiting Li,Dacheng Tao,Xun Xu,Xulei Yang
机构: 1. Hong Kong Polytechnic University (香港理工大学); 2. Alibaba Group (阿里巴巴集团); 3. Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.
zh

[CV-80] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

【速读】：该论文旨在解决流媒体人脸视频在传输或存储过程中因压缩、模糊或低分辨率等复杂退化导致的质量下降问题，尤其关注视觉与音频特征之间内在关联的利用不足，尤其是在嘴部区域。其解决方案的关键在于提出一种通用的音频辅助人脸视频恢复网络（General Audio-assisted face Video restoration Network, GAVN），通过身份特征（identity features）和时序特征（temporal features）的互补学习机制实现多类型退化的联合恢复：首先在低分辨率空间中捕获帧间时序信息以粗略重建；随后在高分辨率空间中结合音频信号与人脸关键点提取帧内身份特征以增强面部细节；最终通过重构模块融合两类特征生成高质量人脸视频。

链接: https://arxiv.org/abs/2508.04161
作者: Yuqin Cao,Yixuan Gao,Wei Sun,Xiaohong Liu,Yulun Zhang,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.
zh

[CV-81] DRIVE-T: A Methodology for Discriminative and Representative Data Viz Item Selection for Literacy Construct and Assessment

【速读】：该论文试图解决数据可视化素养（Data Visualization Literacy）测量工具设计中因难度层级刻画不充分而导致的测量表达力受限问题，即现有评估测试在设计和复用过程中难以准确反映不同难度层次的能力差异。其解决方案的关键在于提出并验证了DRIVE-T（Discriminating and Representative Items for Validating Expressive Tests）方法论，该方法通过三个步骤实现：首先对基于任务的数据可视化项目进行标签化；其次由独立评分者对项目难度进行评级；最后利用多facet Rasch测量模型分析评分者的原始分数，从而识别出具有区分度与代表性、可有序排列为多facet难度层级的任务项。此方法实现了从数据可视化内容出发，基于符号学（语法、语义、语用）理论构建可操作化的测量构念，并在后设计阶段以归纳方式支持形成性测量构念的 emergent（涌现）过程，有效提升了测量工具的表达力与可复用性。

链接: https://arxiv.org/abs/2508.04160
作者: Angela Locoro,Silvia Golia,Davide Falessi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The underspecification of progressive levels of difficulty in measurement constructs design and assessment tests for data visualization literacy may hinder the expressivity of measurements in both test design and test reuse. To mitigate this problem, this paper proposes DRIVE-T (Discriminating and Representative Items for Validating Expressive Tests), a methodology designed to drive the construction and evaluation of assessment items. Given a data vizualization, DRIVE-T supports the identification of task-based items discriminability and representativeness for measuring levels of data visualization literacy. DRIVE-T consists of three steps: (1) tagging task-based items associated with a set of data vizualizations; (2) rating them by independent raters for their difficulty; (3) analysing raters’ raw scores through a Many-Facet Rasch Measurement model. In this way, we can observe the emergence of difficulty levels of the measurement construct, derived from the discriminability and representativeness of task-based items for each data vizualization, ordered into Many-Facets construct levels. In this study, we show and apply each step of the methodology to an item bank, which models the difficulty levels of a measurement construct approximating a latent construct for data visualization literacy. This measurement construct is drawn from semiotics, i.e., based on the syntax, semantics and pragmatics knowledge that each data visualization may require to be mastered by people. The DRIVE-T methodology operationalises an inductive approach, observable in a post-design phase of the items preparation, for formative-style and practice-based measurement construct emergence. A pilot study with items selected through the application of DRIVE-T is also presented to test our approach.
zh

[CV-82] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

【速读】：该论文旨在解决预训练低秩适配（Low-Rank Adaptation, LoRA）模型在多任务适应中因权重冲突导致的灾难性遗忘问题，尤其是在长尾分布数据下难以实现良好泛化能力的挑战。现有方法通过分解权重矩阵进行融合，虽能共享相似参数，但易引发跨任务优化方向冲突，进而损害模型性能。其解决方案的关键在于提出一种基于元学习与上下文自适应协同的框架——In-Context Meta LoRA Fusion (ICM-Fusion)，核心创新是引入任务向量算术（task vector arithmetic），通过学习的流形投影动态平衡不同域间的冲突优化方向，并利用自设计的融合变分自编码器（Fusion VAE, F-VAE）重构多任务LoRA，从而在潜在空间中获得最优任务向量方向，显著降低多任务损失并在少样本场景下实现任务增强。

链接: https://arxiv.org/abs/2508.04153
作者: Yihua Shao,Xiaofeng Lin,Xinwei Long,Siyu Chen,Minxi Yan,Yang Liu,Ziyang Yan,Ao Ma,Hao Tang,Jingcai Guo
机构: Hong Kong Polytechnic University (香港理工大学); Tsinghua University (清华大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.
zh

[CV-83] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

【速读】：该论文旨在解决现有RGB-D视频生成方法中RGB图像与深度图之间几何一致性不足、相机轨迹控制精度低的问题。传统方法通常将RGB和深度生成过程独立处理，导致帧间空间对齐不一致，难以实现精确的相机运动控制。其解决方案的关键在于提出IDC-Net（Image-Depth Consistency Network），这是一个统一的几何感知扩散模型，通过联合学习RGB图像与对应深度图的生成，强化帧间的空间与几何一致性；同时构建了一个具有度量对齐的RGB视频、深度图和准确相机位姿的多模态数据集，提供高保真几何监督，并引入几何感知Transformer模块以实现细粒度的相机控制，从而显著提升生成序列的视觉质量和几何一致性，且生成结果可直接用于下游三维场景重建任务。

链接: https://arxiv.org/abs/2508.04147
作者: Lijuan Liu,Wenfa Li,Dongbo Zhang,Shuo Wang,Shaohui Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at this https URL.
zh

[CV-84] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

【速读】：该论文旨在解决少样本细粒度视觉分类（Few-shot Fine-grained Visual Classification, FGVC）中因数据稀缺导致的过拟合与泛化能力弱的问题。现有方法多依赖于微调预训练视觉语言模型（Vision-Language Models, VLMs），但难以在小样本场景下保持稳定性能。其解决方案的关键在于提出一种无需训练的通用框架UniFGVC，将FGVC任务重构为多模态检索问题：首先设计类别判别性视觉描述生成器（Category-Discriminative Visual Captioner, CDV-Captioner），利用多模态大语言模型（Multimodal Large Language Models, MLLMs）的开放世界知识，结合思维链提示（chain-of-thought prompting）和视觉相似参考图像，生成结构化且高区分度的文本描述，从而构建包含细粒度属性特征的图像-描述对；随后使用现成的视觉与文本编码器嵌入查询和模板对，在联合语义空间中通过最近邻检索完成分类。该方法不依赖特定模型微调，具备跨不同MLLMs和编码器的良好兼容性与泛化能力。

链接: https://arxiv.org/abs/2508.04136
作者: Hongyu Guo,Kuan Zhu,Xiangzhao Hao,Haiyun Guo,Ming Tang,Jinqiao Wang
机构: Beijing Jiaotong University (北京交通大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.
zh

[CV-85] DS2Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

【速读】：该论文旨在解决现有深度监督网络在医学图像分割中仅单独监督粗粒度语义特征或细粒度细节特征的问题，忽略了二者之间的关键关联性。解决方案的核心在于提出一种细节-语义深度监督网络（DS² Net），通过细节增强模块（Detail Enhance Module, DEM）和语义增强模块（Semantic Enhance Module, SEM）分别对低层细节特征和高层语义特征进行引导，生成对应的细节掩码和语义掩码以实现多视角特征监督；此外，引入基于不确定性的监督损失函数，依据不同尺度特征的不确定性自适应调整其监督强度，从而替代以往依赖启发式设计的固定监督策略，显著提升模型性能。

链接: https://arxiv.org/abs/2508.04131
作者: Zhaohong Huang,Yuxin Zhang,Mingbao Lin,Taojian Zhou,Guorong Cai,Rongrong Ji
机构: Xiamen University (厦门大学); Rakuten (乐天); Jimei University (集美大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS ^2 Net). DS ^2 Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS ^2 Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS ^2 Net consistently outperforms state-of-the-art methods for medical image analysis.
zh

[CV-86] SVC 2025: the First Multimodal Deception Detection Challenge ACM-MM2025

【速读】：该论文旨在解决当前欺骗检测（deception detection）模型在跨域场景下性能显著下降的问题，尤其是在多模态数据（如音频、视频和文本）中因领域偏移（domain shift）导致的泛化能力不足。其解决方案的关键在于提出SVC 2025多模态欺骗检测挑战赛这一新基准，要求参赛模型不仅在单一数据域内表现优异，还需具备跨多个异构数据集的良好泛化能力，从而推动可适应、可解释且实用的多模态欺骗检测系统的发展。

链接: https://arxiv.org/abs/2508.04129
作者: Xun Lin,Xiaobao Guo,Taorui Wang,Yingjie Ma,Jiajian Huang,Jiayu Zhang,Junzhe Cao,Zitong Yu
机构: Great Bay University(大湾区大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Workshop SVC of ACM MM 2025

点击查看摘要

Abstract:Deception detection is a critical task in real-world applications such as security screening, fraud prevention, and credibility assessment. While deep learning methods have shown promise in surpassing human-level performance, their effectiveness often depends on the availability of high-quality and diverse deception samples. Existing research predominantly focuses on single-domain scenarios, overlooking the significant performance degradation caused by domain shifts. To address this gap, we present the SVC 2025 Multimodal Deception Detection Challenge, a new benchmark designed to evaluate cross-domain generalization in audio-visual deception detection. Participants are required to develop models that not only perform well within individual domains but also generalize across multiple heterogeneous datasets. By leveraging multimodal data, including audio, video, and text, this challenge encourages the design of models capable of capturing subtle and implicit deceptive cues. Through this benchmark, we aim to foster the development of more adaptable, explainable, and practically deployable deception detection systems, advancing the broader field of multimodal learning. By the conclusion of the workshop competition, a total of 21 teams had submitted their final results. this https URL for more information.
zh

[CV-87] Learning Using Privileged Information for Litter Detection

【速读】：该论文旨在解决全球范围内垃圾污染日益严重背景下，如何高效、准确地实现自动化垃圾检测的问题。其核心挑战在于小尺寸垃圾目标的识别以及被植被或石块部分遮挡物体的检测难题。解决方案的关键在于首次将特权信息（privileged information）与深度学习目标检测相结合，并提出一种将边界框（bounding box）信息编码为二值掩码（binary mask）的方法，以此作为额外指导信号输入检测模型，从而在不增加模型复杂度或额外层数的前提下，显著提升检测精度和泛化能力。

链接: https://arxiv.org/abs/2508.04124
作者: Matthias Bartolo,Konstantinos Makantasis,Dylan Seychell
机构: University of Malta (马耳他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Performance (cs.PF)
备注: This paper was accepted at the 13th European Workshop on Visual Information Processing (EUVIP 2025)

点击查看摘要

Abstract:As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.
zh

[CV-88] Excavate the potential of Single-Scale Features: A Decomposition Network for Water-Related Optical Image Enhancement

【速读】：该论文旨在解决水下图像增强（Underwater Image Enhancement, UIE）中因光吸收和散射效应导致的颜色失真、模糊和对比度低等问题，传统方法普遍依赖多尺度特征提取（Multi-Scale Feature Extraction, MSFE）机制进行多分辨率特征融合以提升重建质量。然而，作者通过大量实验发现，高质量图像重建并不必然依赖于多尺度特征融合，单尺度特征提取即可达到甚至超越多尺度方法的性能，从而显著降低模型复杂度。解决方案的关键在于提出一种新颖的单尺度分解网络（Single-Scale Decomposition Network, SSD-Net），其核心创新是引入不对称分解机制，将输入图像解耦为干净层（包含场景内在信息）与退化层（编码介质干扰信息）；并通过两个关键模块实现高效特征解耦与互补融合：1）并行特征分解块（Parallel Feature Decomposition Block, PFDB），利用高效注意力操作和自适应稀疏Transformer实现双分支特征空间解耦；2）双向特征通信块（Bidirectional Feature Communication Block, BFCB），通过跨层残差交互促进互补特征挖掘与融合。该设计在保持特征分解独立性的同时建立动态跨层信息通道，有效增强了退化因素的解耦能力。

链接: https://arxiv.org/abs/2508.04123
作者: Zheng Cheng,Wenri Wang,Guangyong Chen,Yakun Ju,Yihua Cheng,Zhisong Liu,Yanda Meng,Jintao Song
机构: Qingdao University (青岛大学); Fuzhou University (福州大学); University of Leicester (莱斯特大学); University of Birmingham (伯明翰大学); Lappeenranta-Lahti University of Technology (拉彭兰塔-拉赫蒂理工大学); University of Exeter (埃克塞特大学); Shandong Technology and Business University (山东工商学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN’s local feature extraction capabilities with Transformer’s global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive sparse transformer; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.
zh

[CV-89] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation ICCV2025

【速读】：该论文旨在解决零样本实例分割（zero-shot instance segmentation）问题，即在不依赖目标域标注数据的情况下，准确识别并分割图像中的各个物体实例。其解决方案的关键在于提出了一种面向对象中心预测的扩散模型（object-centric diffusion model），称为OC-DiT，该模型通过在扩散过程的潜在空间中条件化于对象模板和图像特征，实现对物体实例的有效解耦。具体而言，模型包含两个变体：粗粒度生成器用于生成初始实例提案，精炼器则并行优化所有提案，从而在无需微调的情况下达到先进性能。

链接: https://arxiv.org/abs/2508.04122
作者: Maximilian Ulmer,Wout Boerdijk,Rudolph Triebel,Maximilian Durner
机构: German Aerospace Center (DLR); Karlsruhe Institute of Technology; Technical University of Munich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model’s latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.
zh

[CV-90] CLIPVehicle: A Unified Framework for Vision-based Vehicle Search

【速读】：该论文旨在解决车辆搜索任务中检测与重识别（Re-Identification, Re-ID）分离导致的资源消耗大、实用性差的问题，即现有方法需先独立完成车辆检测并存储所有候选区域，再进行重识别，流程冗余且效率低。其解决方案的关键在于提出一个统一框架CLIPVehicle，包含两个核心组件：一是双粒度语义区域对齐模块（dual-granularity semantic-region alignment module），利用视觉语言模型（Vision-Language Models, VLMs）建模车辆判别性特征；二是多层级车辆身份学习策略（multi-level vehicle identification learning strategy），从全局、实例和特征三个层次联合学习身份表示，从而在端到端系统中协同优化检测与重识别任务。

链接: https://arxiv.org/abs/2508.04120
作者: Likai Wang,Ruize Han,Xiangqun Zhang,Wei Feng
机构: Tianjin University (天津大学); Chinese Academy of Sciences (中国科学院); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vehicles, as one of the most common and significant objects in the real world, the researches on which using computer vision technologies have made remarkable progress, such as vehicle detection, vehicle re-identification, etc. To search an interested vehicle from the surveillance videos, existing methods first pre-detect and store all vehicle patches, and then apply vehicle re-identification models, which is resource-intensive and not very practical. In this work, we aim to achieve the joint detection and re-identification for vehicle search. However, the conflicting objectives between detection that focuses on shared vehicle commonness and re-identification that focuses on individual vehicle uniqueness make it challenging for a model to learn in an end-to-end system. For this problem, we propose a new unified framework, namely CLIPVehicle, which contains a dual-granularity semantic-region alignment module to leverage the VLMs (Vision-Language Models) for vehicle discrimination modeling, and a multi-level vehicle identification learning strategy to learn the identity representation from global, instance and feature levels. We also construct a new benchmark, including a real-world dataset CityFlowVS, and two synthetic datasets SynVS-Day and SynVS-All, for vehicle search. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods of both vehicle Re-ID and person search tasks.
zh

[CV-91] Unlocking the Potential of MLLM s in Referring Expression Segmentation via a Light-weight Mask Decode

【速读】：该论文旨在解决参考表达分割（Reference Expression Segmentation, RES）任务中，多模态大模型（Multimodal Large Models, MLLMs）在像素级密集预测能力不足的问题。现有方法要么依赖参数量庞大的Segment Anything Model（SAM），导致计算成本高；要么采用无SAM的轻量化方案，但牺牲了分割精度。其解决方案的关键在于提出一种名为MLLMSeg的新框架：首先充分利用MLLM视觉编码器中已有的视觉细节特征，无需引入额外视觉编码器；其次设计了一个细节增强且语义一致的特征融合模块（Detail-enhanced and Semantic-consistent Feature Fusion, DSFF），将视觉细节特征与LLM输出的语义特征深度融合；最后构建一个仅含34M参数的轻量级掩码解码器，高效整合空间细节与语义信息以实现精准掩码预测，从而在性能与计算成本之间取得更优平衡。

链接: https://arxiv.org/abs/2508.04107
作者: Jingchao Wang,Zhijian Wu,Dingjiang Huang,Yefeng Zheng,Hong Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at this https URL.
zh

[CV-92] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models

【速读】：该论文旨在解决计算机视觉（Computer Vision, CV）模型性能评估中缺乏有效人类感知研究方法的问题。传统的人类感知实验往往依赖复杂、难以扩展的端到端系统，导致研究效率低下且难以规模化。其解决方案的关键在于提出ARCADE平台，该平台利用增强现实（Augmented Reality, AR）的丰富上下文和交互特性，支持跨平台AR数据采集、通过插件式模型推理实现自定义实验协议，并提供AR流媒体用于用户研究，从而显著提升CV模型的人类中心化评估效率与灵活性。

链接: https://arxiv.org/abs/2508.04102
作者: Ashkan Ganj,Yiqin Zhao,Tian Guo
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human perception studies can provide complementary insights to qualitative evaluation for understanding computer vision (CV) model performance. However, conducting human perception studies remains a non-trivial task, it often requires complex, end-to-end system setups that are time-consuming and difficult to scale. In this paper, we explore the unique opportunity presented by augmented reality (AR) for helping CV researchers to conduct perceptual studies. We design ARCADE, an evaluation platform that allows researchers to easily leverage AR’s rich context and interactivity for human-centered CV evaluation. Specifically, ARCADE supports cross-platform AR data collection, custom experiment protocols via pluggable model inference, and AR streaming for user studies. We demonstrate ARCADE using two types of CV models, depth and lighting estimation and show that AR tasks can be effectively used to elicit human perceptual judgments of model quality. We also evaluate the systems usability and performance across different deployment and study settings, highlighting its flexibility and effectiveness as a human-centered evaluation platform.
zh

[CV-93] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在医学图像分析中因领域差距（domain gap）导致性能受限的问题。现有方法如提示学习和单向模态交互虽能引入领域知识，但常引发模态错位（modality misalignment），难以充分发挥VLM的潜力。解决方案的关键在于提出NEARL-CLIP框架，其核心创新包括：(1) 统一协同嵌入Transformer（Unified Synergy Embedding Transformer, USEformer），通过动态生成跨模态查询促进多模态知识相互增强；(2) 正交交叉注意力适配器（Orthogonal Cross-Attention Adapter, OCA），利用正交性约束将新知识解耦为真正新颖信息与增量知识两部分，从而隔离干扰、聚焦学习，进一步强化模态交互并释放VLM能力。该方案仅引入1.46M可学习参数，实现高效且有效的跨模态交互。

链接: https://arxiv.org/abs/2508.04101
作者: Zelin Peng,Yichen Zhao,Yu Huang,Piao Yang,Feilong Tang,Zhengqin Xu,Xiaokang Yang,Wei Shen
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Chinese Academy of Science (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbfNEARL-CLIP (i\underlineNteracted qu\underlineEry \underlineAdaptation with o\underlineRthogona\underlineL Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf1.46M learnable parameters.
zh

[CV-94] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting

【速读】：该论文旨在解决稀疏视图条件下3D Gaussian Splatting（3DGS）在几何重建精度不足的问题，尤其针对现有方法依赖非局部深度正则化导致细节结构丢失、对深度估计噪声敏感，以及传统平滑策略忽视语义边界从而破坏关键边缘和纹理的问题。解决方案的关键在于提出一个统一的深度与边缘感知正则化框架——DET-GS，其核心创新包括：(1) 设计分层几何深度监督机制，自适应地施加多层级几何一致性约束，提升结构保真度并增强对深度噪声的鲁棒性；(2) 基于Canny边缘检测获得的语义掩码构建边缘感知深度正则化项，以保护场景边界；(3) 引入RGB引导的边缘保持总变差损失（Total Variation loss），在平滑均匀区域的同时严格保留高频细节与纹理信息。

链接: https://arxiv.org/abs/2508.04099
作者: Zexu Huang,Min Xu,Stuart Perry
机构: University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.
zh

[CV-95] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework ICCV2025

【速读】：该论文旨在解决三维场景重建中高分辨率图像生成时的视觉质量与3D一致性难以兼顾的问题。现有方法如图像超分辨率或视频超分辨率通常无法保证多视角下的3D结构一致性，或仅以隐式方式引入3D约束。其解决方案的关键在于提出一种基于3D高斯点绘（3D Gaussian-splatting）的超分辨率框架——3DSR，该框架利用现成的扩散模型进行2D超分，并通过显式的3D高斯点绘表示强制跨视角的3D一致性，从而在无需额外微调的情况下提升重建场景的空间连贯性与视觉质量。

链接: https://arxiv.org/abs/2508.04090
作者: Yi-Ting Chen,Ting-Hsuan Liao,Pengsheng Guo,Alexander Schwing,Jia-Bin Huang
机构: University of Maryland, College Park (马里兰大学学院市分校); Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don’t consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.
zh

[CV-96] RLGS: Reinforcement Learning-Based Adaptive Hyperparameter Tuning for Gaussian Splatting

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）中超参数调优（hyperparameter tuning）依赖人工经验、耗时且易导致重建结果不一致和次优的问题。其解决方案的关键在于提出一种即插即用的强化学习框架RLGS，通过轻量级策略模块动态调整关键超参数（如学习率和密集化阈值），实现对3DGS训练过程的自适应优化；该框架具有模型无关性，无需修改现有3DGS架构即可无缝集成，并在多个先进3DGS变体（如Taming-3DGS和3DGS-MCMC）及多样化数据集上验证了其泛化能力和鲁棒性，显著提升渲染质量（例如在Tanks and Temples数据集上使Taming-3DGS的PSNR提升0.7dB）。

链接: https://arxiv.org/abs/2508.04078
作者: Zhan Li,Huangying Zhan,Changyang Li,Qingan Yan,Yi Xu
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Hyperparameter tuning in 3D Gaussian Splatting (3DGS) is a labor-intensive and expert-driven process, often resulting in inconsistent reconstructions and suboptimal results. We propose RLGS, a plug-and-play reinforcement learning framework for adaptive hyperparameter tuning in 3DGS through lightweight policy modules, dynamically adjusting critical hyperparameters such as learning rates and densification thresholds. The framework is model-agnostic and seamlessly integrates into existing 3DGS pipelines without architectural modifications. We demonstrate its generalization ability across multiple state-of-the-art 3DGS variants, including Taming-3DGS and 3DGS-MCMC, and validate its robustness across diverse datasets. RLGS consistently enhances rendering quality. For example, it improves Taming-3DGS by 0.7dB PSNR on the Tanks and Temple (TNT) dataset, under a fixed Gaussian budget, and continues to yield gains even when baseline performance saturates. Our results suggest that RLGS provides an effective and general solution for automating hyperparameter tuning in 3DGS training, bridging a gap in applying reinforcement learning to 3DGS.
zh

[CV-97] FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中后门攻击的灵活性与隐蔽性不足问题，现有方法通常依赖固定模式或单一目标触发器（trigger），导致攻击方式僵化且易被检测。其解决方案的关键在于提出FLAT（FL Arbitrary-Target Attack），通过引入潜变量驱动的条件自编码器（latent-driven conditional autoencoder），动态生成多样化的、目标特定的触发器，从而实现任意目标选择而无需重新训练，并有效规避传统检测机制。该方法在统一框架内实现了攻击成功率、隐蔽性和触发器多样性，显著提升了后门攻击的灵活性与隐蔽性。

链接: https://arxiv.org/abs/2508.04064
作者: Tuan Nguyen,Khoa D Doan,Kok-Seng Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.
zh

[CV-98] Net: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation

【速读】：该论文旨在解决现有遥感图像分割网络中全局上下文依赖关系建模不足的问题，尤其是当前基于UNet架构的模型在解码阶段多聚焦于单尺度内的局部特征交互，忽视了跨多分辨率的全局信息融合。其解决方案的关键在于提出一种仅使用卷积和加法操作的分层融合机制—— Terrace Convolutional Decoder Network (TNet)，通过逐步将低分辨率特征（富含全局语义信息）融合至高分辨率特征（富含局部细节），使模型能够学习空间感知的卷积核，在每个解码阶段自然地整合全局与局部信息，从而实现高效且准确的多尺度特征协同优化。

链接: https://arxiv.org/abs/2508.04061
作者: Chengqian Dai,Yonghong Guo,Hongzhao Xiang,Yigui Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global-local feature interactions within decoder stages. However, these enhancements typically focus on intra-scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low-resolution features (rich in global context) into higher-resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially-aware convolutional kernels that naturally blend global and local information in a stage-wise manner. We implement TNet with a ResNet-18 encoder (TNet-R) and evaluate it on three benchmark datasets. TNet-R achieves competitive performance with a mean Intersection-over-Union (mIoU) of 85.35% on ISPRS Vaihingen, 87.05% on ISPRS Potsdam, and 52.19% on LoveDA, while maintaining high computational efficiency. Code is publicly available.
zh

[CV-99] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在遮挡感知（occlusion perception）任务上表现不足的问题，即如何有效整合视觉识别与推理能力以实现人类水平的空间理解。其解决方案的关键在于构建了首个专门针对遮挡感知的视觉问答（Visual Question Answering, VQA）基准测试集 O-Bench：基于 SA-1B 数据集，采用新颖的分层合成方法生成 1,365 张语义一致的遮挡场景图像，并在此基础上标注了 4,588 对问题-答案对，涵盖五个定制化任务，通过可靠且半自动化的流程确保数据质量。该基准揭示了现有 MLLMs 与人类在遮挡感知上的显著性能差距，并识别出三种典型失败模式，为未来提升 MLLMs 的视觉智能提供了重要方向。

链接: https://arxiv.org/abs/2508.04059
作者: Zhaochen Liu,Kaiwen Gao,Shuyi Liang,Bin Xiao,Limeng Qiao,Lin Ma,Tingting Jiang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. National University of Defense Technology (国防科技大学); 3. Alibaba Group (阿里巴巴集团); 4. Tsinghua University (清华大学); 5. Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.
zh

[CV-100] CSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

【速读】：该论文旨在解决基于Transformer的医学图像分割方法中存在的两大问题：一是计算复杂度随输入序列长度呈二次增长，限制了模型在高分辨率医学图像上的应用；二是标准Transformer中的前馈网络（Feed-Forward Network, FFN）模块依赖全连接层，难以有效捕捉局部上下文信息和多尺度特征，从而影响分割精度。解决方案的关键在于提出一种名为TCSAFormer的高效医学图像分割网络，其核心创新包括两个方面：一是引入压缩注意力（Compressed Attention, CA）模块，通过令牌压缩与像素级稀疏注意力机制动态聚焦关键查询-键值对，显著降低计算开销并增强跨令牌关系建模能力；二是设计双分支前馈网络（Dual-Branch Feed-Forward Network, DBFFN）模块替代传统FFN，以协同提取局部上下文信息与多尺度特征，提升模型的表征能力。实验表明，TCSAFormer在多个公开数据集上实现了优于现有最先进方法的分割性能，同时保持更低的计算成本，达成效率与精度的最佳平衡。

链接: https://arxiv.org/abs/2508.04058
作者: Zunhui Xia,Hongxing Li,Libin Lan
机构: College of Computer Science and Engineering, Chongqing University of Technology (重庆理工大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 4 tables; The code is available at this https URL

点击查看摘要

Abstract:In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models’ ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model’s ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model’s feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.
zh

[CV-101] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

【速读】：该论文旨在解决文档修复任务中因独立建模导致系统复杂、难以扩展的问题，以及现有统一方法在可扩展性上的局限性和任务间协同利用不足的缺陷。其解决方案的关键在于提出一个基于扩散模型的统一文档修复框架Uni-DocDiff，通过引入可学习的任务提示（learnable task prompt）设计实现高度可扩展性，并创新性地构建了Prior Pool机制融合局部高频与全局低频特征，同时设计Prior Fusion Module（PFM）以自适应选择最相关的先验信息，从而有效增强多任务能力并缓解任务干扰。

链接: https://arxiv.org/abs/2508.04055
作者: Fangmin Zhao,Weichao Zeng,Zhenhang Li,Dongbao Yang,Binbin Li,Xiaojun Bi,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Binghamton University (宾汉姆顿大学); Minzu University of China (中央民族大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbfPrior \textbfPool, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbfPrior \textbfFusion \textbfModule (PFM), which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.
zh

[CV-102] owards Globally Predictable k-Space Interpolation: A White-box Transformer Approach

【速读】：该论文旨在解决加速磁共振成像（MRI）中k空间数据插值的问题，现有方法如基于卷积神经网络的深度学习模型主要依赖局部可预测性，忽略了k空间中存在的全局依赖关系。其解决方案的关键在于提出一种白盒Transformer框架GPI-WT，该框架基于全局可预测插值（Globally Predictable Interpolation, GPI），将k空间建模为结构化低秩（Structured Low-Rank, SLR）模型，并通过annihilation机制定义全局滤波器作为可学习参数；进一步地，SLR模型的次梯度自然诱导出可学习的注意力机制，并通过展开次梯度优化算法构建级联网络，从而实现首个专为加速MRI设计的可解释Transformer模型，在提升插值精度的同时显著增强结果的可解释性。

链接: https://arxiv.org/abs/2508.04051
作者: Chen Luo,Qiyu Jin,Taofeng Xie,Xuemei Wang,Huayu Wang,Congcong Liu,Liming Tang,Guoqing Chen,Zhuo-Xu Cui,Dong Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Interpolating missing data in k-space is essential for accelerating imaging. However, existing methods, including convolutional neural network-based deep learning, primarily exploit local predictability while overlooking the inherent global dependencies in k-space. Recently, Transformers have demonstrated remarkable success in natural language processing and image analysis due to their ability to capture long-range dependencies. This inspires the use of Transformers for k-space interpolation to better exploit its global structure. However, their lack of interpretability raises concerns regarding the reliability of interpolated data. To address this limitation, we propose GPI-WT, a white-box Transformer framework based on Globally Predictable Interpolation (GPI) for k-space. Specifically, we formulate GPI from the perspective of annihilation as a novel k-space structured low-rank (SLR) model. The global annihilation filters in the SLR model are treated as learnable parameters, and the subgradients of the SLR model naturally induce a learnable attention mechanism. By unfolding the subgradient-based optimization algorithm of SLR into a cascaded network, we construct the first white-box Transformer specifically designed for accelerated MRI. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in k-space interpolation accuracy while providing superior interpretability.
zh

[CV-103] DOMR: Establishing Cross-View Segmentation via Dense Object Matching ACM-MM2025

【速读】：该论文旨在解决跨视角物体对应（cross-view object correspondence）问题，即在第一人称（egocentric）与第三人称（exocentric）视角之间建立密集的物体匹配关系，这是视觉理解中的关键挑战之一。其解决方案的核心在于提出了一种名为Dense Object Matching and Refinement（DOMR）的框架，其中的关键组件是Dense Object Matcher（DOM）模块，该模块通过联合建模多个物体之间的位置和语义关系，而非直接将单个物体掩码与图像特征进行匹配，从而实现更精确的跨视角密集匹配。DOM模块整合了提议生成与密集匹配机制，显式构建物体间的相互关系，同时结合掩码精化头以提升预测掩码的完整性和准确性，最终在Ego-Exo4D基准上实现了当前最优性能。

链接: https://arxiv.org/abs/2508.04050
作者: Jitong Liao,Yulu Gao,Shaofei Huang,Jialin Gao,Jie Lei,Ronghua Liang,Si Liu
机构: Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); Faculty of Science and Technology, University of Macau (澳门科技大学科技学院); Meituan (美团); College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学计算机科学与技术学院); School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego \to Exo and 55.2% on Exo \to Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.
zh

[CV-104] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation

【速读】：该论文旨在解决手语视频生成中面临的两个关键问题：一是对特定签名者数据的过度依赖，二是模型泛化能力差。解决方案的核心在于提出一种两阶段合成框架，通过解耦运动语义与签名者身份实现高效且灵活的生成。首先构建一个与签名者无关的多模态运动词典（multimodal motion lexicon），每个词汇（gloss）以身份无关的姿态、手势和3D网格序列形式存储，仅需每个手势一次录制即可完成建模；其次引入离散到连续的运动合成阶段，将检索到的词汇序列转化为时序一致的运动轨迹，并结合身份感知的神经渲染技术生成任意签名者的逼真视频。这一方法使运动成为可迁移的“编舞层”，显著提升了合成质量与签名者个性化灵活性。

链接: https://arxiv.org/abs/2508.04049
作者: Jiayi He,Xu Wang,Shengeng Tang,Yaxiong Wang,Lechao Cheng,Dan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable “choreography layer” that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.
zh

[CV-105] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation

【速读】：该论文旨在解决半监督学习（Semi-supervised Learning, SSL）在医学图像分割中对小体积肿瘤或大量肿瘤的分割性能不足的问题，以及现有数据增强策略在标注与未标注数据上尚未被充分挖掘的挑战。其解决方案的关键在于提出一种名为迭代伪标签自适应复制粘贴监督（Iterative Pseudo-labeling based Adaptive Copy-Paste Supervision, IPA-CP）的方法：一方面引入基于双向不确定性的自适应增强机制，将均值教师（Mean Teacher）架构中的肿瘤不确定性信息融入增强过程；另一方面采用迭代伪标签过渡策略，生成更鲁棒且信息量更高的未标注样本伪标签，从而显著提升肿瘤分割精度。

链接: https://arxiv.org/abs/2508.04044
作者: Qiangguo Jin,Hui Cui,Junbo Wang,Changming Sun,Yimiao He,Ping Xuan,Linlin Wang,Cong Cong,Leyi Wei,Ran Su
机构: Northwestern Polytechnical University (西北工业大学); La Trobe University (拉特罗布大学); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61); Shantou University (汕头大学); Shandong First Medical University and Shandong Academy of Medical Sciences (山东第一医科大学和山东省医学科学院); Macquarie University (麦考瑞大学); Shandong University (山东大学); Macao Polytechnic University (澳门理工学院); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has attracted considerable attention in medical image processing. The latest SSL methods use a combination of consistency regularization and pseudo-labeling to achieve remarkable success. However, most existing SSL studies focus on segmenting large organs, neglecting the challenging scenarios where there are numerous tumors or tumors of small volume. Furthermore, the extensive capabilities of data augmentation strategies, particularly in the context of both labeled and unlabeled data, have yet to be thoroughly investigated. To tackle these challenges, we introduce a straightforward yet effective approach, termed iterative pseudo-labeling based adaptive copy-paste supervision (IPA-CP), for tumor segmentation in CT scans. IPA-CP incorporates a two-way uncertainty based adaptive augmentation mechanism, aiming to inject tumor uncertainties present in the mean teacher architecture into adaptive augmentation. Additionally, IPA-CP employs an iterative pseudo-label transition strategy to generate more robust and informative pseudo labels for the unlabeled samples. Extensive experiments on both in-house and public datasets show that our framework outperforms state-of-the-art SSL methods in medical image segmentation. Ablation study results demonstrate the effectiveness of our technical contributions.
zh

[CV-106] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

【速读】：该论文旨在解决现有视觉变换推理（Visual Transformation Reasoning, VTR）基准在真实世界人-物交互场景中存在模拟到现实的差距（sim-to-real gap）、任务复杂度有限以及推理覆盖不全的问题，从而限制了其在实际应用中的有效性。解决方案的关键在于提出首个专门针对真实世界人-物交互场景设计的综合性VTR基准——VisualTrans，该基准包含12个语义多样化的操作任务，并系统评估空间、流程和量化三个核心推理维度，通过6种明确的子任务类型进行结构化评测；其数据构建采用可扩展的流水线，基于第一人称操作视频整合任务选择、图像对提取、大模型自动元数据标注及结构化问题生成，并辅以人工验证确保质量与可解释性，从而为推动具备更强时序建模与因果推理能力的通用VTR系统研究提供可靠评估平台。

链接: https://arxiv.org/abs/2508.04043
作者: Yuheng Ji,Yipu Wang,Yuyang Liu,Xiaoshuai Hao,Yue Liu,Yuting Zhao,Huaihai Lyu,Xiaolong Zheng
机构: 1,2,3,4,5
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at this https URL.
zh

[CV-107] SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration

【速读】：该论文旨在解决当前暗光图像恢复方法中存在的三大效率瓶颈问题：（1）依赖外部先验（人工或跨模态）带来的计算负担与误差校正成本；（2）复杂多阶段增强流水线中的冗余操作；（3）频域方法对所有频率成分 indiscriminate（无差别）处理导致的全局计算开销过大。其解决方案的关键在于提出一种高效自挖掘先验引导的联合频域增强网络（SPJFNet），核心创新包括：（1）设计自挖掘引导模块（SMGM），直接从网络内部生成轻量级内生引导，消除对外部先验的依赖并降低推理延迟；（2）基于对不同频域特性的细致分析，通过无损小波分解和联合傅里叶优势频段增强重构操作链，显著压缩参数量；（3）构建双频引导框架（DFGF），分离高频（小波域增强）与低频（傅里叶域恢复）分支，实现频域处理解耦，大幅降低计算复杂度。

链接: https://arxiv.org/abs/2508.04041
作者: Tongshun Zhang,Pingling Liu,Zijian Zhang,Qiuzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at this https URL.
zh

[CV-108] CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion

【速读】：该论文旨在解决无监督域自适应（Unsupervised Domain Adaptation, UDA）在行人再识别（Person ReID）、车辆再识别（Vehicle ReID）以及通用目标再识别（Object ReID）中的关键挑战，即跨域特征分布差异导致的性能下降问题。其解决方案的核心在于：首先在预训练阶段利用CycleGAN生成多样化合成数据以弥合不同域间的图像特征差距；其次在微调阶段引入一种先进的集成融合机制，结合高效通道注意力模块（Efficient Channel Attention Block, ECAB）与简化版高效通道注意力模块（Simplified Efficient Channel Attention Block, SECAB），从而增强局部与全局特征表示能力，并有效降低目标域样本伪标签的歧义性。该方法在多个主流UDA ReID数据集上显著优于现有最先进方法，在mAP和Rank-k准确率指标上均取得最优结果，同时支持轻量级骨干网络（如ResNet18/34），兼顾模型性能与计算效率。

链接: https://arxiv.org/abs/2508.04036
作者: Trinh Quoc Nguyen,Oky Dicky Ardiansyah Prima,Syahid Al Irfan,Hindriyanto Dwi Purnomo,Radius Tanone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AI Sens. 2025, Submission received: 8 May 2025 / Revised: 4 June 2025 / Accepted: 30 June 2025 / Published: 4 July 2025. 3042-5999/1/1/4

点击查看摘要

Abstract:This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at this https URL.
zh

[CV-109] Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation IROS

【速读】：该论文旨在解决城市道路环境中因路边停车导致的非视距（Non-Line-of-Sight, NLoS）盲区问题，尤其关注行人突然从停放车辆之间出现时的检测难题。现有方法多依赖预定义空间信息或假设简单墙面反射，难以适应动态变化的停车环境，导致误检和漏检。解决方案的关键在于融合单目相机图像与二维雷达点云（2D radar point cloud, PCD）数据：首先通过图像分割识别停放车辆并估计深度以推断近似空间特征，再利用雷达点云对空间信息进行精细化修正，从而实现对NLoS区域中行人的高精度定位，提升早期预警能力与道路安全性。

链接: https://arxiv.org/abs/2508.04033
作者: Hee-Yeun Kim,Byeonggyu Park,Byonghyok Choi,Hansang Cho,Byungkwan Kim,Soomok Lee,Mingu Jeon,Seung-Woo Seo,Seong-Woo Kim
机构: Seoul National University (首尔国立大学); Samsung Electro-Mechanics Co., Ltd. (三星电子机械有限公司); Chungnam National University (忠南国立大学); Ajou University (亚洲大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025. 8 pages, 3 figures

点击查看摘要

Abstract:The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at this https URL.
zh

[CV-110] Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

【速读】：该论文旨在解决预训练视觉-语言模型（Vision-Language Models, VLMs）在下游图像-文本检索（Image-Text Retrieval, ITR）任务中因难以区分细粒度属性和相似子类别而导致的匹配精度不足问题。解决方案的关键在于提出一种双提示学习框架——Dual Prompt Learning with Joint Category-Attribute Reweighting (DCAR)，其核心机制包括：(1) 在属性层面，基于文本-图像互信息相关性动态调整属性描述的权重；(2) 在类别层面，引入多视角负样本并结合类别匹配加权策略以增强子类别的判别能力。该方法通过联合优化属性与类别特征，在细粒度表征学习上显著提升CLIP模型在ITR任务中的性能。

链接: https://arxiv.org/abs/2508.04028
作者: Yifan Wang,Tao Wang,Chenwei Tang,Caiyang Yu,Zhengqing Zang,Mengmi Zhang,Shudong Huang,Jiancheng Lv
机构: Sichuan University (四川大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 10 pages, 7figures

点击查看摘要

Abstract:Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.
zh

[CV-111] Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation

【速读】：该论文旨在解决遥感图像语义分割中因类内差异大（high intra-class variance）和类间相似度高（high inter-class similarity）导致的完整地物对象分割不充分的问题。传统方法难以统一类表示并区分相似特征，而现有类引导方法也受限于粗粒度类原型表示及对目标结构信息的忽视。解决方案的核心在于提出原型驱动的结构协同网络（Prototype-Driven Structure Synergy Network, PDSSNet），其关键创新在于将完整地物对象建模为不变类语义与可变空间结构的联合表征；具体设计包含三个模块：自适应原型提取模块（Adaptive Prototype Extraction Module, APEM）确保从真值中提取无偏类原型以保障语义准确性；语义-结构协同模块（Semantic-Structure Coordination Module, SSCM）遵循“先语义后结构”的层次化策略，通过结构信息约束和优化语义表示以提升完整性；以及通道相似性调整模块（Channel Similarity Adjustment Module, CSAM）利用动态步长调整机制聚焦类别间判别特征，从而显著提升分割精度。

链接: https://arxiv.org/abs/2508.04022
作者: Junyi Wang,Jinjiang Li,Guodong Fan,Yakun Ju,Xiang Fang,Alex C. Kot
机构: Shandong Technology and Business University (山东工商学院); University of Leicester (莱斯特大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2508.04022 [cs.CV] (or arXiv:2508.04022v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.04022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在面对存在缺陷的输入时，缺乏主动检测和审查错误前提的能力这一关键问题。现有研究表明，LMMs倾向于被动接受无效输入，导致推理失效，但其是否具备主动识别并验证输入有效性的能力仍属未知。为此，作者提出输入审查能力评估框架（Input Scrutiny Ability Evaluation Framework, ISEval），涵盖七类错误前提和三种量化指标，系统评估了十种先进LMMs的表现。关键发现包括：多数模型需依赖显式提示才能识别前提错误，且不同类型的错误对性能影响显著——逻辑谬误识别能力强，而表面语言错误和特定条件错误则表现不佳；此外，模型在跨模态信息融合上的信任度存在差异，如Gemini 2.5 Pro与Claude Sonnet 4能平衡视觉与文本信息，而aya-vision-8b则过度依赖文本信息。该研究揭示了提升LMMs主动输入验证能力的紧迫性，并为后续改进提供了新视角。

链接: https://arxiv.org/abs/2508.04017
作者: Haiqi Yang,Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9pages, 2figures

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs’ proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at this https URL.
zh

[CV-113] textS2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

【速读】：该论文旨在解决视频扩散模型（Video Diffusion Models, V-DMs）在量化（quantization）过程中因时空联合建模导致的长序列token问题，从而引发高校准方差和学习挑战。其核心解决方案是提出S²Q-VDiT框架，关键在于两个创新：一是引入基于海森矩阵感知的显著数据选择（Hessian-aware Salient Data Selection），通过结合扩散过程与量化特性构建高质量校准数据集；二是设计注意力引导的稀疏token蒸馏（Attention-guided Sparse Token Distillation），利用token级注意力分布聚焦对输出影响更大的关键token，有效降低训练复杂度并提升量化性能。在W4A6量化下实现无损性能的同时，模型压缩比达3.9倍，推理加速1.3倍。

链接: https://arxiv.org/abs/2508.04016
作者: Weilun Feng,Haotong Qin,Chuanguang Yang,Xiangqi Li,Han Yang,Yuqi Li,Zhulin An,Libo Huang,Michele Magno,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf \textS^2 Q-VDiT, a post-training quantization framework for V-DMs that leverages \textbfSalient data and \textbfSparse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textitHessian-aware Salient Data Selection, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textitAttention-guided Sparse Token Distillation, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model’s output. Under W4A6 quantization, \textS^2 Q-VDiT achieves lossless performance while delivering 3.9\times model compression and 1.3\times inference acceleration. Code will be available at \hrefthis https URLthis https URL.
zh

[CV-114] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

【速读】：该论文旨在解决文本到CAD（Text-to-CAD）系统在实际应用中面临的两大核心问题：一是CAD模型渲染效率低，导致生成过程缓慢；二是依赖视觉语言模型（VLM）进行CAD模型评估时成本高且易出现奖励劫持（reward hacking），影响系统性能与可靠性。解决方案的关键在于提出一种名为CAD-Judge的可验证奖励机制，其核心创新包括两个模块：一是编译器即裁判模块（Compiler-as-a-Judge Module, CJM），利用编译器直接提供快速、准确的奖励信号，通过前景理论优化生成效用以提升模型对齐效果；二是编译器即审查模块（Compiler-as-a-Review Module, CRM），结合轻量级代理式CAD生成策略，实现对生成CAD模型的高效语法校验与迭代修正，从而显著提升测试阶段的鲁棒性与生成质量。实验表明，该方法在多个挑战性CAD数据集上达到当前最优性能，同时具备优异的计算效率。

链接: https://arxiv.org/abs/2508.04002
作者: Zheyuan Zhou,Jiayi Han,Liang Du,Naiyu Fang,Lemiao Qiu,Shuyou Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.
zh

[CV-115] JanusNet: Hierarchical Slice-Block Shuffle and Displacement for Semi-Supervised 3D Multi-Organ Segmentation

【速读】：该论文旨在解决弱监督医学图像分割中因训练样本和标注数据稀缺而导致的模型性能受限问题，尤其是现有数据增强方法（如随机混合体素块）破坏了三维医学图像在正交轴上的解剖连续性，造成结构不一致且难以有效训练小器官等困难区域。其解决方案的关键在于提出JanusNet框架，该框架采用双阶段、轴对齐的数据增强策略：第一阶段“Slice-Block Shuffle”沿随机轴对齐地 shuffle 同一索引切片块，同时保持与扰动轴垂直平面上的解剖上下文；第二阶段“Confidence-Guided Displacement”利用预测置信度替换每张切片内的块，从而增强难分割区域的信号。该方法可无缝集成至多数师生架构中，显著提升分割性能，例如在Synapse数据集上仅用20%标注数据即可获得4%的Dice相似系数（DSC）提升。

链接: https://arxiv.org/abs/2508.03997
作者: Zheng Zhang,Tianzhuzi Tan,Guanchun Yin,Bo Zhang,Xiuzhuang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Limited by the scarcity of training samples and annotations, weakly supervised medical image segmentation often employs data augmentation to increase data diversity, while randomly mixing volumetric blocks has demonstrated strong performance. However, this approach disrupts the inherent anatomical continuity of 3D medical images along orthogonal axes, leading to severe structural inconsistencies and insufficient training in challenging regions, such as small-sized organs, etc. To better comply with and utilize human anatomical information, we propose JanusNet, a data augmentation framework for 3D medical data that globally models anatomical continuity while locally focusing on hard-to-segment regions. Specifically, our Slice-Block Shuffle step performs aligned shuffling of same-index slice blocks across volumes along a random axis, while preserving the anatomical context on planes perpendicular to the perturbation axis. Concurrently, the Confidence-Guided Displacement step uses prediction reliability to replace blocks within each slice, amplifying signals from difficult areas. This dual-stage, axis-aligned framework is plug-and-play, requiring minimal code changes for most teacher-student schemes. Extensive experiments on the Synapse and AMOS datasets demonstrate that JanusNet significantly surpasses state-of-the-art methods, achieving, for instance, a 4% DSC gain on the Synapse dataset with only 20% labeled data.
zh

[CV-116] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images

【速读】：该论文旨在解决仅基于2D图像进行食物营养成分估计的难题，该任务因食物呈现方式、光照变化以及缺乏深度信息导致体积和质量难以准确推断而极具挑战性。同时，现有方法依赖专有数据集进行大规模预训练，限制了研究的可复现性和泛化能力。解决方案的关键在于系统评估不同规模与特性的预训练数据集对深度学习模型性能的影响，特别是对比了在公开数据集ImageNet和COYO上预训练的Vision Transformer (ViT) 模型与基准卷积神经网络（CNN）模型及在专有数据集JFT-300M上预训练的最先进方法在Nutrition5k数据集上的表现。实验结果表明，预训练数据集的规模、领域相关性和整理质量是影响迁移学习效果的核心因素，其中JFT-300M预训练模型显著优于公共数据集模型，而COYO数据集反而表现劣于ImageNet，揭示了数据特性对特定回归任务的重要性。

链接: https://arxiv.org/abs/2508.03996
作者: Michele Andrade,Guilherme A. L. Silva,Valéria Santos,Gladston Moreira,Eduardo Luz
机构: Universidade Federal de Ouro Preto (联邦大学奥罗普雷托分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.
zh

[CV-117] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

【速读】：该论文旨在解决AI生成图像检测中存在的泛化能力弱和鲁棒性差的问题。现有方法多依赖于低级图像伪影或模型特定特征，难以适应多种生成模型及图像退化场景。其解决方案的关键在于提出RAVID框架，首次将视觉检索增强生成（Visual Retrieval-Augmented Generation, RAG）引入图像检测任务：通过微调的CLIP图像编码器（RAVID CLIP）结合类别相关提示词提升表征学习能力，动态从数据库中检索与查询图像相关的参考图像，并利用视觉语言模型（Vision-Language Model, VLM）融合检索结果与原图形成增强输入，从而显著提升检测准确率与鲁棒性。实验表明，RAVID在UniversalFakeDetect基准上平均准确率达93.85%，且在高斯模糊和JPEG压缩等退化条件下仍保持80.27%的平均准确率，优于当前最优方法C2P-CLIP（63.44%）。

链接: https://arxiv.org/abs/2508.03967
作者: Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Abdenour Hadid
机构: Laboratory of IEMN, Univ. Polytechnique Hauts-de-France (IEMN 实验室，上法兰西理工大学); KU 6G Research Center, Khalifa University (Khalifa 大学 6G 研究中心); Sorbonne Center for Artificial Intelligence, Sorbonne University, Abu Dhabi, UAE (索邦人工智能中心，索邦大学，阿联酋阿布扎比)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.
zh

[CV-118] Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

【速读】：该论文旨在解决当前音频同步视觉动画（audio-synchronized visual animation）方法在开放世界中难以扩展的问题，即现有技术高度依赖昂贵的人工标注高质量、类别特定的训练视频，限制了其在多样音频-视频类别中的应用。解决方案的关键在于提出一种高效的两阶段训练范式：第一阶段自动筛选大规模噪声视频用于预训练，使模型学习到多样但不完美的音视频对齐关系；第二阶段仅用少量人工精调样本进行微调，大幅降低人力成本。此外，通过多特征条件控制和窗口注意力机制增强每帧对丰富音频上下文的感知能力，并利用预训练文本到视频生成器与音频编码器，仅引入1.9%额外可训练参数即可实现音频条件建模，同时保留生成器的知识先验。

链接: https://arxiv.org/abs/2508.03955
作者: Lin Zhang,Zefan Cai,Yufan Zhou,Shentong Mo,Jinhong Lin,Cheng-En Wu,Yibing Wei,Yijing Zhang,Ruiyi Zhang,Wen Xiao,Tong Sun,Junjie Hu,Pedro Morgado
机构: University of Wisconsin Madison (威斯康星大学麦迪逊分校); Carnegie Mellon University (卡内基梅隆大学); Luma AI; Adobe Research (Adobe 研究院); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9% additional trainable parameters to learn audio-conditioning capability without compromising the generator’s prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3 \times more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10 \times , while generalizing to many open classes.
zh

[CV-119] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation

【速读】：该论文旨在解决医学影像分割中因多模态图像信息利用不充分而导致的前列腺癌分割精度不足问题，尤其在复杂病理情况下表现不佳。其核心解决方案是提出一种基于策略网络（policy network）的推荐系统，通过动态决策机制自动选择最优成像模态和感兴趣区域（Region of Interest, ROI），引导分割模型逐轮聚焦于最可能包含肿瘤的局部区域，从而提升整体分割性能。关键创新在于将放射科医生的多模态图像阅读行为建模为可学习的策略，并借助预训练分割网络模拟专家判读过程，在迭代中优化定位与分割效果，最终实现更高效、准确的肿瘤边界识别。

链接: https://arxiv.org/abs/2508.03953
作者: Xiangcen Wu,Shaheer U. Saeed,Yipei Wang,Ester Bonmati Coll,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.
zh

[CV-120] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

【速读】：该论文旨在解决当前深度生成模型在点云形状表示生成中缺乏点对应关系（point correspondence）建模的问题。传统统计形状模型虽重视点对应，但现代深度学习方法多关注无序点云，导致生成的形状之间无法保持结构一致性。其解决方案的关键在于提出一种扩散模型（diffusion model），能够生成保留训练数据中点对应关系的真实点状形状表示，从而实现具有结构一致性的形状生成。通过OASIS-3数据集上的实验验证，该方法在海马体形状生成任务中表现出优于现有方法的现实性与可控性，并支持健康与阿尔茨海默病（AD）条件生成及疾病进展的反事实预测等下游应用。

链接: https://arxiv.org/abs/2508.03925
作者: Shen Zhu,Yinzhu Jin,Ifrah Zawar,P. Thomas Fletcher
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.
zh

[CV-121] Deep learning framework for crater detection and identification on the Moon and Mars

【速读】：该论文旨在解决行星表面撞击坑（impact crater）自动检测与识别的难题，以提升行星科学中对地貌特征分析的效率和精度。其解决方案的关键在于构建一个两阶段深度学习框架：第一阶段利用经典卷积神经网络（CNN）、ResNet-50 和 YOLO 模型进行撞击坑的初步识别；第二阶段采用基于 YOLO 的检测方法实现撞击坑的精确定位。实验表明，YOLO 在整体检测性能上表现最优，而 ResNet-50 在识别大型撞击坑方面具有更高精度，从而为火星与月球区域的撞击坑测绘提供了高效、准确的自动化工具。

链接: https://arxiv.org/abs/2508.03920
作者: Yihan Ma,Zeyang Yu,Rohitash Chandra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Impact craters are among the most prominent geomorphological features on planetary surfaces and are of substantial significance in planetary science research. Their spatial distribution and morphological characteristics provide critical information on planetary surface composition, geological history, and impact processes. In recent years, the rapid advancement of deep learning models has fostered significant interest in automated crater detection. In this paper, we apply advancements in deep learning models for impact crater detection and identification. We use novel models, including Convolutional Neural Networks (CNNs) and variants such as YOLO and ResNet. We present a framework that features a two-stage approach where the first stage features crater identification using simple classic CNN, ResNet-50 and YOLO. In the second stage, our framework employs YOLO-based detection for crater localisation. Therefore, we detect and identify different types of craters and present a summary report with remote sensing data for a selected region. We consider selected regions for craters and identification from Mars and the Moon based on remote sensing data. Our results indicate that YOLO demonstrates the most balanced crater detection performance, while ResNet-50 excels in identifying large craters with high precision.
zh

[CV-122] HPSv3: Towards Wide-Spectrum Human Preference Score ICCV2025

【速读】：该论文旨在解决文本到图像生成模型评估中与人类感知对齐的问题，现有以人类为中心的指标受限于数据覆盖范围有限、特征提取不优以及损失函数效率低下。其解决方案的关键在于提出Human Preference Score v3 (HPSv3)，包含两个核心创新：一是发布首个广谱人类偏好数据集HPDv3，整合108万条文本-图像对及117万条成对标注比较；二是引入基于视觉语言模型（VLM）的偏好模型，采用不确定性感知排序损失进行细粒度排序，并提出Chain-of-Human-Preference (CoHP) 迭代图像优化方法，利用HPSv3在每一步选择最优图像，无需额外数据即可提升生成质量。

链接: https://arxiv.org/abs/2508.03789
作者: Yuhang Ma,Xiaoshi Wu,Keqiang Sun,Hongsheng Li
机构: Mizzen AI; CUHK MMLab; King’s College London; Shanghai AI Laboratory; CPII, InnoHK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025

点击查看摘要

Abstract:Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.
zh

[CV-123] 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

【速读】：该论文旨在解决高通量四维扫描透射电子显微镜（4D-STEM）数据预处理中的关键瓶颈问题，即由于普遍噪声、束流中心漂移和椭圆畸变导致的衍射图样系统性偏差，从而影响定量测量的准确性。传统校正算法通常依赖于特定材料，缺乏鲁棒性和泛化能力。其解决方案的关键在于提出一个端到端的深度学习流水线——4D-PreNet，该模型融合了注意力增强型U-Net与ResNet架构，能够同时实现去噪、中心校正和椭圆畸变校准；通过在大规模模拟数据集上训练，涵盖多种噪声水平、漂移幅度和畸变类型，使模型具备良好的跨实验条件泛化能力，显著提升实时分析的可靠性与精度。

链接: https://arxiv.org/abs/2508.03775
作者: Mingyu Liu(1),Zian Mao(1 and 2),Zhu Liu(1 and 3),Haoran Zhang(1 and 2),Jintao Guo(1),Xiaoya He(1 and 2),Xi Huang(1),Shufen Chu(1),Chun Cheng(1),Jun Ding(4),Yujun Xie(1) ((1) Global Institute of Future Technology of Shanghai Jiao Tong University, (2) University of Michigan Shanghai Jiao Tong University Joint Institute, (3) School of Chemistry and Chemical Engineering of Shanghai Jiao Tong University, (4) Center for Alloy Innovation and Design State Key Laboratory for Mechanical Behavior of Materials of Xian Jiaotong University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 17 pages,5 figures

点击查看摘要

Abstract:Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization.
zh

[CV-124] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

【速读】：该论文旨在解决当前基于强化学习微调（Reinforcement Fine-Tuning, RFT）的图像质量评估（Image Quality Assessment, IQA）方法中存在的两个关键问题：一是现有方法仅对模型输出结果提供规则奖励，缺乏对“思考”过程（即质量推理过程）的监督，导致其正确性和有效性不可控；二是这些方法直接在下游IQA任务上微调，未显式提升模型对低级视觉质量的本征感知能力，从而限制了性能上限。解决方案的关键在于提出多阶段RFT-IQA框架（Refine-IQA）：第一阶段构建Refine-Perception-20K数据集并设计多任务奖励函数以增强模型的视觉质量感知能力；第二阶段引入概率差异奖励策略，对“思考”过程进行监督，从而实现高质量评分与质量解释能力的协同优化。

链接: https://arxiv.org/abs/2508.03763
作者: Ziheng Jia,Jiaying Qian,Zicheng Zhang,Zijian Chen,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model’s rollouts but provide no reward supervision for the “think” process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model’s visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for “think” process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust “think” (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.
zh

[CV-125] LRTuckerRep: Low-rank Tucker Representation Model for Multi-dimensional Data Completion

【速读】：该论文旨在解决多维数据补全（multi-dimensional data completion）问题，该问题在计算机视觉、信号处理和科学计算等领域具有重要意义。现有方法通常依赖全局低秩近似或局部平滑正则化，但各自存在局限：低秩方法计算复杂度高且易破坏数据内在结构，而基于平滑性的方法则需大量人工调参且泛化能力差。论文提出一种新颖的低秩Tucker表示模型（Low-Rank Tucker Representation, LRTuckerRep），其核心在于将全局低秩性和局部平滑性统一建模于Tucker分解框架内：通过因子矩阵上的自适应加权核范数与稀疏Tucker核心编码低秩性，同时利用无参数的拉普拉斯正则化捕获因子空间中的平滑特性。为高效求解由此产生的非凸优化问题，作者设计了两种具有收敛性保证的迭代算法，实验表明LRTuckerRep在高缺失率下显著优于基线方法，在图像修复和交通数据插补任务中均展现出更高的精度与鲁棒性。

链接: https://arxiv.org/abs/2508.03755
作者: Wenwu Gong,Lili Yang
机构: Southern University of Science and Technology (南方科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Multi-dimensional data completion is a critical problem in computational sciences, particularly in domains such as computer vision, signal processing, and scientific computing. Existing methods typically leverage either global low-rank approximations or local smoothness regularization, but each suffers from notable limitations: low-rank methods are computationally expensive and may disrupt intrinsic data structures, while smoothness-based approaches often require extensive manual parameter tuning and exhibit poor generalization. In this paper, we propose a novel Low-Rank Tucker Representation (LRTuckerRep) model that unifies global and local prior modeling within a Tucker decomposition. Specifically, LRTuckerRep encodes low rankness through a self-adaptive weighted nuclear norm on the factor matrices and a sparse Tucker core, while capturing smoothness via a parameter-free Laplacian-based regularization on the factor spaces. To efficiently solve the resulting nonconvex optimization problem, we develop two iterative algorithms with provable convergence guarantees. Extensive experiments on multi-dimensional image inpainting and traffic data imputation demonstrate that LRTuckerRep achieves superior completion accuracy and robustness under high missing rates compared to baselines.
zh

[CV-126] Generating Synthetic Invoices via Layout-Preserving Content Replacement

【速读】：该论文旨在解决自动化发票处理中机器学习模型性能受限于高质量、多样化标注数据稀缺的问题，尤其在隐私法规限制和人工标注成本高昂的背景下。解决方案的关键在于构建一个端到端的合成发票生成流水线：首先通过光学字符识别（Optical Character Recognition, OCR）提取原始发票的文本内容与空间布局；随后利用大语言模型（Large Language Model, LLM）生成语义合理且上下文一致的合成字段内容；最后采用图像修复（inpainting）技术替换原始文本并保持字体与版式特征不变，从而产出视觉逼真且结构化数据对齐的合成发票图像与JSON文件。此方法实现了小规模私有数据集的高效扩增，为训练更鲁棒的文档智能模型提供了可扩展、自动化的数据生成方案。

链接: https://arxiv.org/abs/2508.03754
作者: Bevin V,Ananthakrishnan P V,Ragesh KR,Sanjay M,Vineeth S,Bibin Wilson
机构: BEO AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.
zh

[CV-127] Modular Transformer Architecture for Precision Agriculture Imaging

【速读】：该论文旨在解决无人机视频中杂草分割的效率与准确性问题，特别是在图像因噪声和模糊等退化因素导致质量下降时的表现瓶颈。其解决方案的关键在于提出了一种质量感知的模块化深度学习框架，通过分析图像的退化类型（如噪声和模糊），动态地将输入数据路由至针对特定退化优化的预处理和视觉Transformer（Vision Transformer, ViT）模型：对于清晰图像使用基准ViT，对噪声图像采用嵌入Fisher Vector编码的改进ViT进行去噪处理，对模糊图像则使用带有Unrolled Lucy-Robinson解卷积模块的ViT进行校正。这种基于质量条件的智能路由机制显著提升了分割精度与计算效率，优于传统卷积神经网络（Convolutional Neural Network, CNN）方法。

链接: https://arxiv.org/abs/2508.03751
作者: Brian Gopalan(1),Nathalia Nascimento(1),Vishal Monga(1) ((1) The Pennsylvania State University)
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint of paper submitted to IEEE-AIOT 2025

点击查看摘要

Abstract:This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.
zh

[CV-128] GlaBoost: A multimodal Structured Framework for Glaucoma Risk Stratification

【速读】：该论文旨在解决青光眼（glaucoma）早期准确检测中因依赖单一模态数据且缺乏可解释性而导致的临床实用性不足问题。其解决方案的关键在于提出GlaBoost框架，该框架通过融合结构化临床特征、眼底图像嵌入（fundus image embeddings）与专家标注的文本描述（expert-curated textual descriptions），构建多模态特征空间，并利用改进的XGBoost模型进行分类，从而实现高精度且可解释的青光眼风险预测。

链接: https://arxiv.org/abs/2508.03750
作者: Cheng Huang,Weizheng Xie,Karanjit Kooner,Tsengdar Lee,Jui-Kai Wang,Jia Zhang
机构: Southern Methodist University (南方卫理公会大学); University of Texas Southwestern Medical Center (德克萨斯大学西南医学中心); National Aeronautics and Space Administration (美国国家航空航天局)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Early and accurate detection of glaucoma is critical to prevent irreversible vision loss. However, existing methods often rely on unimodal data and lack interpretability, limiting their clinical utility. In this paper, we present GlaBoost, a multimodal gradient boosting framework that integrates structured clinical features, fundus image embeddings, and expert-curated textual descriptions for glaucoma risk prediction. GlaBoost extracts high-level visual representations from retinal fundus photographs using a pretrained convolutional encoder and encodes free-text neuroretinal rim assessments using a transformer-based language model. These heterogeneous signals, combined with manually assessed risk scores and quantitative ophthalmic indicators, are fused into a unified feature space for classification via an enhanced XGBoost model. Experiments conducted on a real-world annotated dataset demonstrate that GlaBoost significantly outperforms baseline models, achieving a validation accuracy of 98.71%. Feature importance analysis reveals clinically consistent patterns, with cup-to-disc ratio, rim pallor, and specific textual embeddings contributing most to model decisions. GlaBoost offers a transparent and scalable solution for interpretable glaucoma diagnosis and can be extended to other ophthalmic disorders.
zh

[CV-129] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation

【速读】：该论文旨在解决城市轨道交通站台客流密度实时估算难题，以提升运营决策的精准性、安全性与乘客体验。传统方法依赖于自动售检票数据或人工观察等间接代理指标，存在精度不足和时效性差的问题。论文提出的关键解决方案是利用闭路电视（CCTV）图像数据，结合三种先进的计算机视觉技术：基于YOLOv11、RT-DETRv2和APGCC的目标检测与计数方法、基于自定义训练的Vision Transformer（Crowd-ViT）的群体级分类方法，以及DeepLabV3语义分割方法，并创新性地引入一种线性优化算法从分割图中提取乘客数量，同时考虑图像中物体深度信息对乘客沿站台分布的影响。实验在与华盛顿都会区交通局（WMATA）合作构建的隐私保护数据集上进行，涵盖超过600小时视频，结果表明仅使用CCTV数据即可实现高精度的实时客流估计，为平台拥挤管理提供及时响应依据。

链接: https://arxiv.org/abs/2508.03749
作者: Riccardo Fiorista,Awad Abdelhalim,Anson F. Stewart,Gabriel L. Pincus,Ian Thistle,Jinhua Zhao
机构: Massachusetts Institute of Technology (麻省理工学院); Washington Metropolitan Area Transit Authority (华盛顿都会区交通局)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 26 pages, 17 figures, 4 tables

点击查看摘要

Abstract:Accurately estimating urban rail platform occupancy can enhance transit agencies’ ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and © semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.
zh

[CV-130] oblers First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision

【速读】：该论文旨在解决当前地理空间人工智能（GeoAI）研究中面临的两大核心挑战：一是训练数据稀缺，二是现有深度学习模型在设计时忽视了空间原理与空间效应，从而限制了AI与地理空间科学的深度融合。其解决方案的关键在于提出一种基于Tobler第一地理定律（Tobler’s first law of geography）的显式空间建模方法，仅使用弱标签即可实现对象检测，尤其适用于自然特征识别；同时通过引入注意力图（attention maps）并设计多阶段训练策略，显著提升了模型性能。该方法已在火星撞击坑自动检测任务中验证有效，且具备泛化至地球及其他行星表面自然与人造地物的能力，推动了GeoAI理论与方法论的发展。

链接: https://arxiv.org/abs/2508.03745
作者: Wenwen Li,Chia-Yu Hsu,Maosheng Hu
机构: Arizona State University (亚利桑那州立大学); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent interest in geospatial artificial intelligence (GeoAI) has fostered a wide range of applications using artificial intelligence (AI), especially deep learning, for geospatial problem solving. However, major challenges such as a lack of training data and the neglect of spatial principles and spatial effects in AI model design remain, significantly hindering the in-depth integration of AI with geospatial research. This paper reports our work in developing a deep learning model that enables object detection, particularly of natural features, in a weakly supervised manner. Our work makes three contributions: First, we present a method of object detection using only weak labels. This is achieved by developing a spatially explicit model based on Tobler’s first law of geography. Second, we incorporate attention maps into the object detection pipeline and develop a multistage training strategy to improve performance. Third, we apply this model to detect impact craters on Mars, a task that previously required extensive manual effort. The model generalizes to both natural and human-made features on the surfaces of Earth and other planets. This research advances the theoretical and methodological foundations of GeoAI.
zh

[CV-131] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission

【速读】：该论文旨在解决语义特征数字化过程中如何在压缩为离散符号的同时保持连续性和上下文信息，并确保对信道退化具有鲁棒性的问题。解决方案的关键在于提出一种基于向量量化（Vector Quantization, VQ）的数字语义通信系统（VQ-DeepISC），其核心包括：1）采用Swin Transformer骨干网络进行分层语义特征提取，并通过VQ模块将特征映射至离散潜在空间，实现基于索引的高效传输；2）设计注意力机制驱动的信道自适应模块以动态优化索引传输策略；3）引入Kullback-Leibler散度（KLD）正则化与指数移动平均（EMA）技术，防止训练过程中的码本坍塌问题，保障码字使用分布均衡和训练稳定性。最终结合IEEE 802.11a标准的QPSK调制与OFDM多载波传输方案，实现了高质量图像重建性能。

链接: https://arxiv.org/abs/2508.03740
作者: Jianqiao Chen,Tingting Zhu,Huishi Song,Nan Ma,Xiaodong Xu
机构: ZGC Institute of Ubiquitous-X Innovation and Applications (ZGC ubiquitous-x 创新与应用研究所); State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (北京邮电大学网络与交换技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods.
zh

[CV-132] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

【速读】：该论文旨在解决智能城市环境中建筑地图构建的准确性问题，传统方法如卫星图像、激光雷达扫描和人工标注存在成本高、可访问性差及精度不足等局限；同时，开源地图平台虽被广泛用于提供地面真值数据，但其易受人为误差和现实环境动态变化的影响，从而对训练神经网络造成偏差。解决方案的关键在于提出一种基于深度学习的方法，融合DINOv2视觉Transformer架构与多源无线电信号（RF）数据（来自多个用户设备和基站），在统一框架内联合处理地图与射频模态信息，有效捕捉空间依赖性和结构先验，提升建筑映射精度。实验结果表明，该方法在合成数据集上实现了65.3%的宏观交并比（macro IoU），显著优于错误地图基线（40.1%）、纯RF方法（37.3%）以及自研的非AI融合基线（42.2%）。

链接: https://arxiv.org/abs/2508.03736
作者: Rafayel Mkrtchyan,Armen Manukyan,Hrant Khachatrian,Theofanis P. Raptis
机构: Yerevan State University (叶里温州立大学); Italian Institute of Technology (意大利技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work partly supported by the RA Science Committee grant No. 22rl-052 (DISTAL) and the EU under Italian National Recovery and Resilience Plan of NextGenerationEU on “Telecommunications of the Future” (PE00000001 - program “RESTART”)

点击查看摘要

Abstract:Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%.
zh

[CV-133] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

【速读】：该论文旨在解决使用文本到图像扩散模型生成连贯视觉故事时，如何在不同场景中保持主体一致性的问题（subject consistency）。现有方法通常依赖于微调或重新训练模型，但存在计算成本高、耗时长且可能破坏模型原有能力的缺点。其解决方案的关键在于提出一种无需训练的方法：通过引入掩码交叉图像注意力共享（masked cross-image attention sharing）动态对齐一批图像中的主体特征，并结合区域特征和谐化（Regional Feature Harmonization）优化视觉相似细节，从而在不损害扩散模型创造力的前提下显著提升主体一致性。

链接: https://arxiv.org/abs/2508.03735
作者: Gopalji Gaur,Mohammadreza Zolfaghari,Thomas Brox
机构: University of Freiburg (弗莱堡大学); Zebracat AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures, GCPR

点击查看摘要

Abstract:Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model’s pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.
zh

[CV-134] What is Beneath Misogyny: Misogynous Memes Classification and Explanation

【速读】：该论文旨在解决如何有效检测、分类并解释图像与文本相结合的迷因（meme）中隐含的厌女主义（misogyny）内容这一挑战，其核心难点在于迷因的多模态特性（图像和文本）以及厌女思想在不同社会语境下的微妙表现形式。解决方案的关键在于提出了一种新颖的多模态框架MM-Misogyny，该框架分别处理图像和文本模态，并通过交叉注意力机制（cross-attention mechanism）将二者融合为统一的多模态上下文，进而利用分类器和大语言模型（Large Language Model, LLM）实现精准标注、细粒度分类及可解释性分析。该方法不仅提升了检测与分类性能，还揭示了厌女主义在家庭、职场、领导力和消费等生活领域中的具体运作机制。

链接: https://arxiv.org/abs/2508.03732
作者: Kushal Kanwar,Dushyant Singh Chauhan,Gopendra Vikram Singh,Asif Ekbal
机构: Jaypee University of Information Technology (贾伊皮大学信息与技术学院); CortexTor Labs; Amity Center for Artificial Intelligence (阿米蒂人工智能中心); Indian Institute of Technology Jodhpur (印度理工学院焦德尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memes are popular in the modern world and are distributed primarily for entertainment. However, harmful ideologies such as misogyny can be propagated through innocent-looking memes. The detection and understanding of why a meme is misogynous is a research challenge due to its multimodal nature (image and text) and its nuanced manifestations across different societal contexts. We introduce a novel multimodal approach, \textitnamely, \textit\textbfMM-Misogyny to detect, categorize, and explain misogynistic content in memes. \textit\textbfMM-Misogyny processes text and image modalities separately and unifies them into a multimodal context through a cross-attention mechanism. The resulting multimodal context is then easily processed for labeling, categorization, and explanation via a classifier and Large Language Model (LLM). The evaluation of the proposed model is performed on a newly curated dataset (\textit\textbfWhat’s \textbfBeneath \textbfMisogynous \textbfStereotyping (WBMS)) created by collecting misogynous memes from cyberspace and categorizing them into four categories, \textitnamely, Kitchen, Leadership, Working, and Shopping. The model not only detects and classifies misogyny, but also provides a granular understanding of how misogyny operates in domains of life. The results demonstrate the superiority of our approach compared to existing methods. The code and dataset are available at \hrefthis https URLthis https URL.
zh

[CV-135] IR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization ICRA2025

【速读】：该论文旨在解决热红外成像（Thermal Infrared Imaging, TIR）图像中普遍存在的非均匀固定模式噪声（non-uniform fixed-pattern noise）问题，此类噪声会显著干扰机器人感知任务如目标检测、定位与建图的性能。解决方案的关键在于提出一种基于扩散模型（diffusion-based）的TIR图像去噪框架，其核心创新包括：利用预训练稳定扩散模型（Stable Diffusion）进行潜空间（latent-space）优化，并引入离散小波变换（Discrete Wavelet Transform, DWT）与双树复小波变换（Dual-Tree Complex Wavelet Transform, DTCWT）损失函数以增强高频细节保留；同时设计级联精化阶段（cascaded refinement stage）进一步提升去噪结果的保真度，从而在基准数据集上实现优于现有方法的性能，并展现出对多样真实世界TIR场景的零样本泛化能力。

链接: https://arxiv.org/abs/2508.03727
作者: Tai Hyoung Rhee,Dong-guw Lee,Ayoung Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted at Thermal Infrared in Robotics (TIRO) Workshop, ICRA 2025

点击查看摘要

Abstract:Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.
zh

[CV-136] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding

【速读】：该论文旨在解决集成电路（Integrated Circuit, IC）印刷电路板（Printed-Circuit-board, PCB）封装几何标注的自动化难题，即如何从IC机械图纸中准确提取并结构化描述引脚数量、中心坐标及尺寸等几何信息。当前大型多模态模型（Large Multimodal Models, LMMs）在几何感知方面存在显著不足，难以胜任此类任务。解决方案的关键在于提出一种名为LLM4-IC8K的新框架，其将IC机械图视为图像输入，并通过分步推理策略模拟工程师的思维过程：首先识别引脚数量，再计算每个引脚中心坐标，最后估计单个引脚尺寸；该框架采用两阶段训练机制，先在合成数据上预训练以掌握基础几何推理能力，再在真实数据手册图像上微调以提升实际场景下的鲁棒性和准确性，同时构建了包含8608个标注样本的多模态数据集ICGeo8K用于支持研究与验证。

链接: https://arxiv.org/abs/2508.03725
作者: Yida Wang,Taiting Lu,Runze Liu,Lanqing Yang,Yifan Yang,Zhe Chen,Yuehai Wang,Yixin Liu,Kaiyuan Lin,Xiaomeng Chen,Dian Ding,Yijie Li,Yi-Chao Chen,Yincheng Jin,Mahanth Gowda
机构: Shanghai Jiao Tong University (上海交通大学); Pennsylvania State University (宾夕法尼亚州立大学); Microsoft Research (微软研究院); National University of Singapore (新加坡国立大学); Binghamton University (布林顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.
zh

[CV-137] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

【速读】：该论文旨在系统性地梳理和总结音频-视觉分割（Audio-Visual Segmentation, AVS）领域的研究进展，解决当前多模态感知中对声音产生物体进行细粒度识别与分割的挑战。其关键解决方案在于全面分析AVS方法的核心组成部分：包括单模态与多模态编码架构、音频-视觉融合策略、解码器设计，以及从全监督到弱监督乃至无训练范式的多种学习方式；并通过在标准基准上的广泛对比，揭示不同架构选择、融合机制和训练范式对性能的影响，从而为未来研究提供清晰的方向指引。

链接: https://arxiv.org/abs/2508.03724
作者: Jia Li,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.
zh

[CV-138] Multimodal Video Emotion Recognition with Reliable Reasoning Priors

【速读】：该论文旨在解决多模态情感识别（Multimodal Emotion Recognition, MER）中因类别不平衡导致的性能瓶颈问题，并提升模型对来自大语言模型（Multimodal Large Language Models, MLLMs）先验推理知识的信任度与利用效率。其关键解决方案在于：首先，利用Gemini生成细粒度且模态可分离的推理轨迹（reasoning traces），将其作为先验知识注入融合阶段以增强跨模态交互；其次，提出平衡双对比学习（Balanced Dual-Contrastive Learning）损失函数，同时优化类间与类内分布的平衡性，从而缓解数据类别不平衡问题。该方法在MER2024基准上显著提升了性能，验证了MLLM先验知识与轻量级融合网络之间协同增强的有效性。

链接: https://arxiv.org/abs/2508.03722
作者: Zhepeng Wang,Yingjian Zhu,Guanghao Dong,Hongzhu Yi,Feng Chen,Xinming Wang,Jun Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.
zh

[CV-139] Enhancing Diameter Measurement Accuracy in Machine Vision Applications

【速读】：该论文旨在解决相机测量系统中因机械和软件因素导致的测量误差问题，尤其是在使用相同测量设置对不同直径工件进行检测时出现的精度下降问题。解决方案的关键在于利用少量已知尺寸的参考件，提出两种创新校正方法：一是基于转换因子的方法，通过参考件估算转换系数以计算未知工件直径（mm）；二是基于像素的方法，直接利用参考件的像素直径信息建立映射关系来估计目标直径。实验表明，这两种方法可将原本13–114微米的测量误差显著降低至1–2微米，从而在不增加硬件成本的前提下大幅提升测量精度与可靠性。

链接: https://arxiv.org/abs/2508.03721
作者: Ahmet Gokhan Poyraz,Ahmet Emir Dirik,Hakan Gurkan,Mehmet Kacmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Preprint

点击查看摘要

Abstract:In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera’s field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.
zh

[CV-140] Outlier Detection Algorithm for Circle Fitting

【速读】：该论文旨在解决工业场景中因点集噪声导致圆拟合算法性能下降的问题，特别是在高精度测量如工业垫圈直径检测中的应用。其解决方案的关键在于提出一种基于极坐标系的异常值检测（Polar Coordinate-Based Outlier Detection, PCOD）算法：首先将原始点集转换至极坐标空间，随后计算局部与全局标准差，通过比较局部均值与全局标准差来识别并剔除异常点，从而提升后续圆拟合的精度和鲁棒性。实验表明，该方法在多种圆拟合算法和异常值检测策略对比中表现最优，具有良好的工业适用性。

链接: https://arxiv.org/abs/2508.03720
作者: Ahmet Gökhan Poyraz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Preprint, not peer-reviewed

点击查看摘要

Abstract:Circle fitting methods are extensively utilized in various industries, particularly in quality control processes and design applications. The effectiveness of these algorithms can be significantly compromised when the point sets to be predicted are noisy. To mitigate this issue, outlier detection and removal algorithms are often applied before the circle fitting procedure. This study introduces the Polar Coordinate-Based Outlier Detection (PCOD) algorithm, which can be effectively employed in circle fitting applications. In the proposed approach, the point set is first transformed into polar coordinates, followed by the calculation of both local and global standard deviations. Outliers are then identified by comparing local mean values with the global standard deviation. The practicality and efficiency of the proposed method are demonstrated by focusing on the high-precision diameter measurement of industrial washer parts. Images from a machine vision system are processed through preprocessing steps, including sub-pixel edge detection. The resulting sub-pixel edge points are then cleaned using the proposed outlier detection and removal algorithm, after which circle fitting is performed. A comparison is made using ten different circle fitting algorithms and five distinct outlier detection methods. The results indicate that the proposed method outperforms the other approaches, delivering the best performance in terms of accuracy within the dataset, thereby demonstrating its potential for enhancing circle fitting applications in industrial environments.
zh

[CV-141] Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention IEEE-VIS

【速读】：该论文旨在解决现有可视化设计中忽视个体差异（特别是视觉素养水平）对视觉注意力行为影响的问题，从而提升可视化解释的有效性。其关键解决方案在于提出两个基于视觉注意力模式与视觉素养水平关联性的计算模型：一是Lit2Sal，一种能够根据用户视觉素养水平预测其注意力分布的新型视觉显著性模型；二是Sal2Lit，一种仅需单张注意力图即可以86%准确率反推用户视觉素养水平的模型。这两个模型共同构建了一个可个性化调整的视觉数据传达框架，为增强个体理解提供了新路径。

链接: https://arxiv.org/abs/2508.03713
作者: Minsuk Chang,Yao Wang,Huichen Will Wang,Yuanhong Zhou,Andreas Bulling,Cindy Xiong Bearfield
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures, Accepted to 2025 IEEE VIS (Visualization and Visual Analytics)

点击查看摘要

Abstract:Accounting for individual differences can improve the effectiveness of visualization design. While the role of visual attention in visualization interpretation is well recognized, existing work often overlooks how this behavior varies based on visual literacy levels. Based on data from a 235-participant user study covering three visualization tests (mini-VLAT, CALVI, and SGL), we show that distinct attention patterns in visual data exploration can correlate with participants’ literacy levels: While experts (high-scorers) generally show a strong attentional focus, novices (low-scorers) focus less and explore more. We then propose two computational models leveraging these insights: Lit2Sal – a novel visual saliency model that predicts observer attention given their visualization literacy level, and Sal2Lit – a model to predict visual literacy from human visual attention data. Our quantitative and qualitative evaluation demonstrates that Lit2Sal outperforms state-of-the-art saliency models with literacy-aware considerations. Sal2Lit predicts literacy with 86% accuracy using a single attention map, providing a time-efficient supplement to literacy assessment that only takes less than a minute. Taken together, our unique approach to consider individual differences in salience models and visual attention in literacy assessments paves the way for new directions in personalized visual data communication to enhance understanding.
zh

[CV-142] xt2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

【速读】：该论文旨在解决虚拟现实（Virtual Reality, VR）培训应用开发中面临的核心挑战，即创建准确且具有吸引力的教学内容所需的时间、专业知识和资源投入过大。为应对这一问题，论文提出了一种基于大语言模型（Large Language Models, LLMs）的自动化方法，其关键在于构建一个由两个核心模块组成的系统：一是LLM模块，用于从文本输入中提取任务相关的信息；二是智能模块，负责将这些信息转化为VR环境中的动画演示和视觉提示。其中，指令生成器通过改变虚拟对象的颜色并创建动画来呈现操作流程，从而显著提升培训效果并降低开发成本，使VR培训更具可扩展性和适应性。

链接: https://arxiv.org/abs/2508.03699
作者: Subin Raj Peter
机构: University College Dublin (都柏林大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 7 pages, 7 figures, conference

点击查看摘要

Abstract:Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.
zh

[CV-143] PLA: Prompt Learning Attack against Text-to-Image Generative Models ICCV2025

【速读】：该论文旨在解决生成式 AI（Generative AI）中文本到图像（Text-to-Image, T2I）模型在黑盒设置下被恶意利用生成不安全内容（Not-Safe-For-Work, NSFW）的问题，特别是如何通过对抗性提示（adversarial prompts）绕过其内置的安全机制。解决方案的关键在于提出一种新颖的提示学习攻击框架（Prompt Learning Attack, PLA），该框架通过利用多模态相似性设计了适用于黑盒T2I模型的梯度驱动训练策略，从而在无模型内部结构和参数访问的情况下，有效提升对抗性提示的生成能力与攻击成功率。

链接: https://arxiv.org/abs/2508.03696
作者: Xinqi Lyu,Yihao Liu,Yanjie Li,Bin Xiao
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, and published to ICCV2025

点击查看摘要

Abstract:Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content.
zh

[CV-144] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

【速读】：该论文旨在解决森林LiDAR三维点云中个体树木与语义分割的难题，尤其针对自然森林环境中点云数据的复杂性和多样性带来的挑战。其解决方案的关键在于提出了一种统一且端到端的框架ForestFormer3D，该框架融合了ISA-guided查询点选择机制、基于分数的块合并策略（score-based block merging strategy）以及一对一多关联机制（one-to-many association mechanism），从而在新提出的FOR-instanceV2数据集上实现了个体树木分割的最先进性能，并展现出对未见测试集（如Wytham Woods和LAUTx）的良好泛化能力，证明了模型在不同森林条件和传感器模态下的鲁棒性。

链接: https://arxiv.org/abs/2506.16991
作者: Binbin Xiang,Maciej Wielgosz,Stefano Puliti,Kamil Král,Martin Krůček,Azim Missarov,Rasmus Astrup
机构: Norwegian Institute of Bioeconomy Research (挪威生物经济研究所); Silva Tarouca Research Institute for Landscape and Ornamental Gardening (Silva Tarouca 景观与观赏园艺研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code are publicly available at this https URL.
zh

[CV-145] Super Resolved Imaging with Adaptive Optics ICCV2025

【速读】：该论文旨在解决天文望远镜中视场（Field of View, FoV）与图像分辨率之间的固有权衡问题：增大视场会导致科学相机对光学场的欠采样，从而降低图像质量。解决方案的关键在于利用现有自适应光学（Adaptive Optics, AO）系统中的变形镜，施加一系列可学习且精确控制的波前畸变，生成一组具有显著高频亚像素位移的图像序列，进而通过联合超分辨重建算法实现最终图像的超分辨率恢复。该方法能够在不干扰AO核心功能（即实时校正由地球大气引起的快速变化波前畸变）的前提下完成超分辨率处理，其创新性体现在端到端优化变形镜畸变模式与超分辨算法，充分考虑望远镜特定光学特性及大气波前的时间统计特性，实验和仿真结果表明，在仅使用现有硬件、无需额外改造的情况下，信噪比（SNR）提升可达12 dB。

链接: https://arxiv.org/abs/2508.04648
作者: Robin Swanson,Esther Y. H. Lin,Masen Lamb,Suresh Sivanandam,Kiriakos N. Kutulakos
机构: University of Toronto (多伦多大学); Dunlap Institute for Astronomy & Astrophysics (邓拉普天文与天体物理研究所); International Gemini Observatory (国际盖姆尼天文台); University of Victoria (维多利亚大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025 (IEEE/CVF International Conference on Computer Vision)

点击查看摘要

Abstract:Astronomical telescopes suffer from a tradeoff between field of view (FoV) and image resolution: increasing the FoV leads to an optical field that is under-sampled by the science camera. This work presents a novel computational imaging approach to overcome this tradeoff by leveraging the existing adaptive optics (AO) systems in modern ground-based telescopes. Our key idea is to use the AO system’s deformable mirror to apply a series of learned, precisely controlled distortions to the optical wavefront, producing a sequence of images that exhibit distinct, high-frequency, sub-pixel shifts. These images can then be jointly upsampled to yield the final super-resolved image. Crucially, we show this can be done while simultaneously maintaining the core AO operation–correcting for the unknown and rapidly changing wavefront distortions caused by Earth’s atmosphere. To achieve this, we incorporate end-to-end optimization of both the induced mirror distortions and the upsampling algorithm, such that telescope-specific optics and temporal statistics of atmospheric wavefront distortions are accounted for. Our experimental results with a hardware prototype, as well as simulations, demonstrate significant SNR improvements of up to 12 dB over non-AO super-resolution baselines, using only existing telescope optics and no hardware modifications. Moreover, by using a precise bench-top replica of a complete telescope and AO system, we show that our methodology can be readily transferred to an operational telescope. Project webpage: this https URL
zh

[CV-146] LA-CaRe-CNN: Cascading Refinement CNN for Left Atrial Scar Segmentation MICCAI

【速读】：该论文旨在解决心房颤动（Atrial Fibrillation, AF）患者个性化射频消融治疗中，如何准确分割左心房（Left Atrial, LA）及其瘢痕组织的问题，以支持患者特异性心脏数字孪生模型的构建。其关键解决方案是提出一种两级级联卷积神经网络（Left Atrial Cascading Refinement CNN, LA-CaRe-CNN），该模型在三维空间中端到端训练，第一阶段预测左心房区域，第二阶段结合原始图像信息对左心房内瘢痕组织进行精细化分割；同时通过强强度和空间增强策略提升模型对未知域的鲁棒性，从而实现高精度分割——在左心房分割上达到89.21% Dice相似系数（DSC）和1.6969 mm平均表面距离（ASSD），在更具挑战性的左心房瘢痕分割上达到64.59% DSC和91.80% G-DSC，为个性化靶向消融治疗提供了可靠基础。

链接: https://arxiv.org/abs/2508.04553
作者: Franz Thaler,Darko Stern,Gernot Plank,Martin Urschler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the MICCAI Challenge on Comprehensive Analysis and Computing of Real-World Medical Images 2024, 12 pages

点击查看摘要

Abstract:Atrial fibrillation (AF) represents the most prevalent type of cardiac arrhythmia for which treatment may require patients to undergo ablation therapy. In this surgery cardiac tissues are locally scarred on purpose to prevent electrical signals from causing arrhythmia. Patient-specific cardiac digital twin models show great potential for personalized ablation therapy, however, they demand accurate semantic segmentation of healthy and scarred tissue typically obtained from late gadolinium enhanced (LGE) magnetic resonance (MR) scans. In this work we propose the Left Atrial Cascading Refinement CNN (LA-CaRe-CNN), which aims to accurately segment the left atrium as well as left atrial scar tissue from LGE MR scans. LA-CaRe-CNN is a 2-stage CNN cascade that is trained end-to-end in 3D, where Stage 1 generates a prediction for the left atrium, which is then refined in Stage 2 in conjunction with the original image information to obtain a prediction for the left atrial scar tissue. To account for domain shift towards domains unknown during training, we employ strong intensity and spatial augmentation to increase the diversity of the training dataset. Our proposed method based on a 5-fold ensemble achieves great segmentation results, namely, 89.21% DSC and 1.6969 mm ASSD for the left atrium, as well as 64.59% DSC and 91.80% G-DSC for the more challenging left atrial scar tissue. Thus, segmentations obtained through LA-CaRe-CNN show great potential for the generation of patient-specific cardiac digital twin models and downstream tasks like personalized targeted ablation therapy to treat AF.
zh

[CV-147] Conditional Fetal Brain Atlas Learning for Automatic Tissue Segmentation MICCAI

【速读】：该论文旨在解决胎儿脑部磁共振成像（MRI）在评估过程中因胎龄（Gestational Age, GA）不确定性、大脑发育变异及成像协议差异导致的定量分析困难问题。其核心解决方案是提出一种基于深度学习的新型框架，用于生成连续且年龄特异性的胎儿脑部图谱，通过结合直接配准模型与条件判别器，在21至37周胎龄范围内实现高精度图像对齐与实时组织分割；该方法在六类脑组织上的平均Dice相似性系数（DSC）达86.3%，并能刻画神经典型生长轨迹，支持个体化发育评估与临床应用。

链接: https://arxiv.org/abs/2508.04522
作者: Johannes Tischer,Patric Kienast,Marlene Stümpflen,Gregor Kasprian,Georg Langs,Roxane Licandro
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, MICCAI Workshop on Perinatal Imaging, Placental and Preterm Image analysis

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) of the fetal brain has become a key tool for studying brain development in vivo. Yet, its assessment remains challenging due to variability in brain maturation, imaging protocols, and uncertain estimates of Gestational Age (GA). To overcome these, brain atlases provide a standardized reference framework that facilitates objective evaluation and comparison across subjects by aligning the atlas and subjects in a common coordinate system. In this work, we introduce a novel deep-learning framework for generating continuous, age-specific fetal brain atlases for real-time fetal brain tissue segmentation. The framework combines a direct registration model with a conditional discriminator. Trained on a curated dataset of 219 neurotypical fetal MRIs spanning from 21 to 37 weeks of gestation. The method achieves high registration accuracy, captures dynamic anatomical changes with sharp structural detail, and robust segmentation performance with an average Dice Similarity Coefficient (DSC) of 86.3% across six brain tissues. Furthermore, volumetric analysis of the generated atlases reveals detailed neurotypical growth trajectories, providing valuable insights into the maturation of the fetal brain. This approach enables individualized developmental assessment with minimal pre-processing and real-time performance, supporting both research and clinical applications. The model code is available at this https URL
zh

[CV-148] OpenDCVCs: A PyTorch Open Source Implementation and Performance Evaluation of the DCVC series Video Codecs

【速读】：该论文旨在解决深度上下文视频压缩（Deep Contextual Video Compression, DCVC）领域中可复现性差的问题。现有公开代码多仅提供评估工具，缺乏训练支持，导致研究难以复现、基准测试不一致且后续开发受阻。解决方案的关键在于提出 OpenDCVCs——一个基于 PyTorch 的开源实现框架，统一提供了四种代表性 DCVC 模型（DCVC、DCVC-TCM、DCVC-HEM 和 DCVC-DC）的端到端训练与评估能力，配套详尽文档、标准化评估协议及跨数据集的基准结果，从而为社区提供透明、一致且可扩展的研究基础。

链接: https://arxiv.org/abs/2508.04491
作者: Yichi Zhang,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present OpenDCVCs, an open-source PyTorch implementation designed to advance reproducible research in learned video compression. OpenDCVCs provides unified and training-ready implementations of four representative Deep Contextual Video Compression (DCVC) models–DCVC, DCVC with Temporal Context Modeling (DCVC-TCM), DCVC with Hybrid Entropy Modeling (DCVC-HEM), and DCVC with Diverse Contexts (DCVC-DC). While the DCVC series achieves substantial bitrate reductions over both classical codecs and advanced learned models, previous public code releases have been limited to evaluation codes, presenting significant barriers to reproducibility, benchmarking, and further development. OpenDCVCs bridges this gap by offering a comprehensive, self-contained framework that supports both end-to-end training and evaluation for all included algorithms. The implementation includes detailed documentation, evaluation protocols, and extensive benchmarking results across diverse datasets, providing a transparent and consistent foundation for comparison and extension. All code and experimental tools are publicly available at this https URL, empowering the community to accelerate research and foster collaboration.
zh

[CV-149] otalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration

【速读】：该论文旨在解决现有图像配准方法在多器官联合配准任务中通用性不足的问题，即大多数现有方法仅针对单一器官设计，难以直接应用于其他解剖区域。其解决方案的关键在于提出了一种名为TotalRegistrator的新型配准框架，该框架基于标准UNet结构并引入一种新颖的场分解策略（field decomposition strategy），能够同时对多个解剖区域进行配准，且模型轻量（训练仅需11GB GPU内存）。该方法在大规模纵向全腹盆部CT数据集上进行了训练与验证，并在多个外部独立数据集上展现出良好的泛化性能，证明了其跨器官、跨场景的适应能力。

链接: https://arxiv.org/abs/2508.04450
作者: Xuan Loc Pham,Gwendolyn Vuurberg,Marjan Doppen,Joey Roosen,Tip Stille,Thi Quynh Ha,Thuy Duong Quach,Quoc Vu Dang,Manh Ha Luu,Ewoud J. Smit,Hong Son Mai,Mattias Heinrich,Bram van Ginneken,Mathias Prokop,Alessa Hering
机构: Radboudumc (奈梅亨大学医学中心); Fraunhofer MEVIS (弗劳恩霍夫MEVIS研究所); University of Lübeck (吕贝克大学); Hospital 108 (108医院); Thai Nguyen National Hospital (太原国家医院); Tam Anh Hospital (Tam Anh医院); Vietnam National University, University of Engineering and Technology (越南国家大学，工程技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image registration is a fundamental technique in the analysis of longitudinal and multi-phase CT images within clinical practice. However, most existing methods are tailored for single-organ applications, limiting their generalizability to other anatomical regions. This work presents TotalRegistrator, an image registration framework capable of aligning multiple anatomical regions simultaneously using a standard UNet architecture and a novel field decomposition strategy. The model is lightweight, requiring only 11GB of GPU memory for training. To train and evaluate our method, we constructed a large-scale longitudinal dataset comprising 695 whole-body (thorax-abdomen-pelvic) paired CT scans from individual patients acquired at different time points. We benchmarked TotalRegistrator against a generic classical iterative algorithm and a recent foundation model for image registration. To further assess robustness and generalizability, we evaluated our model on three external datasets: the public thoracic and abdominal datasets from the Learn2Reg challenge, and a private multiphase abdominal dataset from a collaborating hospital. Experimental results on the in-house dataset show that the proposed approach generally surpasses baseline methods in multi-organ abdominal registration, with a slight drop in lung alignment performance. On out-of-distribution datasets, it achieved competitive results compared to leading single-organ models, despite not being fine-tuned for those tasks, demonstrating strong generalizability. The source code will be publicly available at: this https URL.
zh

[CV-150] Unmasking Interstitial Lung Diseases: Leverag ing Masked Autoencoders for Diagnosis

【速读】：该论文旨在解决扩散性肺部疾病（diffused lung disease）诊断中因标注影像数据稀缺而导致模型训练困难的问题。解决方案的关键在于利用掩码自编码器（Masked Autoencoders, MAEs）在无标签数据上的预训练能力，通过在超过5000张胸部CT扫描图像上进行预训练，这些图像包括来自机构内部及公开来源（如COVID-19和细菌性肺炎）具有相似影像学特征的数据，从而学习到鲁棒且具有临床意义的特征表示；随后在下游分类任务中微调模型，显著提升了在小样本标注数据下的诊断性能。

链接: https://arxiv.org/abs/2508.04429
作者: Ethan Dack,Lorenzo Brigato,Vasilis Dedousis,Janine Gote-Schniering,Cheryl,Hanno Hoppe,Aristomenis Exadaktylos,Manuela Funke-Chambour,Thomas Geiser,Andreas Christe,Lukas Ebner,Stavroula Mougiakakou
机构: University of Bern (伯尔尼大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked autoencoders (MAEs) have emerged as a powerful approach for pre-training on unlabelled data, capable of learning robust and informative feature representations. This is particularly advantageous in diffused lung disease research, where annotated imaging datasets are scarce. To leverage this, we train an MAE on a curated collection of over 5,000 chest computed tomography (CT) scans, combining in-house data with publicly available scans from related conditions that exhibit similar radiological patterns, such as COVID-19 and bacterial pneumonia. The pretrained MAE is then fine-tuned on a downstream classification task for diffused lung disease diagnosis. Our findings demonstrate that MAEs can effectively extract clinically meaningful features and improve diagnostic performance, even in the absence of large-scale labelled datasets. The code and the models are available here: this https URL.
zh

[CV-151] PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

【速读】：该论文旨在解决当前视觉语言模型（VLMs）在分子影像学领域，特别是正电子发射断层成像（PET）图像报告生成任务中表现不足的问题。现有研究多集中于结构影像模态（如CT或MRI），而忽视了PET特有的代谢信息描述需求。解决方案的关键在于构建首个专门面向PET图像报告生成的大规模基准数据集PET2Rep，其包含覆盖全身多个器官的完整图像-报告对，并引入临床效率指标以量化评估放射性示踪剂摄取模式在关键器官中的描述准确性，从而系统性地推动医学VLM在PET场景下的发展与优化。

链接: https://arxiv.org/abs/2508.04062
作者: Yichi Zhang,Wenbo Zhang,Zehui Ling,Gang Feng,Sisi Peng,Deshu Chen,Yuchen Liu,Hongwei Zhang,Shuqi Wang,Lanlan Li,Limei Han,Yuan Cheng,Zixin Hu,Yuan Qi,Le Xue
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficiency metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.
zh

[CV-152] UNISELF: A Unified Network with Instance Normalization and Self-Ensembled Lesion Fusion for Multiple Sclerosis Lesion Segmentation

【速读】：该论文旨在解决深度学习（Deep Learning, DL）方法在多发性硬化（Multiple Sclerosis, MS）病灶自动分割任务中，难以同时实现域内准确性和跨域泛化能力的问题，尤其是在仅使用单一来源且数据有限的训练集时性能受限。解决方案的关键在于提出UNISELF方法，其核心创新包括：一是在测试阶段采用自集成病灶融合（test-time self-ensembled lesion fusion）策略提升域内分割精度；二是引入测试阶段实例归一化（test-time instance normalization, TTIN）对潜在特征进行校正，以缓解因成像协议、扫描仪类型和采集伪影导致的域偏移（domain shift）及缺失对比度的问题，从而显著增强模型在多种跨域测试数据集上的鲁棒性与泛化性能。

链接: https://arxiv.org/abs/2508.03982
作者: Jinwei Zhang,Lianrui Zuo,Blake E. Dewey,Samuel W. Remedios,Yihao Liu,Savannah P. Hays,Dzung L. Pham,Ellen M. Mowry,Scott D. Newsome,Peter A. Calabresi,Aaron Carass,Jerry L. Prince
机构: Johns Hopkins University (约翰霍普金斯大学); Vanderbilt University (范德堡大学); Johns Hopkins School of Medicine (约翰霍普金斯医学院); Johns Hopkins University (约翰霍普金斯大学); Uniformed Services University of the Health Sciences (美国卫生科学统一服务大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated segmentation of multiple sclerosis (MS) lesions using multicontrast magnetic resonance (MR) images improves efficiency and reproducibility compared to manual delineation, with deep learning (DL) methods achieving state-of-the-art performance. However, these DL-based methods have yet to simultaneously optimize in-domain accuracy and out-of-domain generalization when trained on a single source with limited data, or their performance has been unsatisfactory. To fill this gap, we propose a method called UNISELF, which achieves high accuracy within a single training domain while demonstrating strong generalizability across multiple out-of-domain test datasets. UNISELF employs a novel test-time self-ensembled lesion fusion to improve segmentation accuracy, and leverages test-time instance normalization (TTIN) of latent features to address domain shifts and missing input contrasts. Trained on the ISBI 2015 longitudinal MS segmentation challenge training dataset, UNISELF ranks among the best-performing methods on the challenge test dataset. Additionally, UNISELF outperforms all benchmark methods trained on the same ISBI training data across diverse out-of-domain test datasets with domain shifts and missing contrasts, including the public MICCAI 2016 and UMCL datasets, as well as a private multisite dataset. These test datasets exhibit domain shifts and/or missing contrasts caused by variations in acquisition protocols, scanner types, and imaging artifacts arising from imperfect acquisition. Our code is available at this https URL.
zh

[CV-153] Fast Magnetic Resonance Simulation Using Combined Update with Grouped Isochromats

【速读】：该论文旨在解决传统磁共振（MR）模拟器中一个关键假设带来的计算效率瓶颈：即每个等色质点（isochromat）必须单独进行模拟。为降低MR模拟的计算时间，作者提出了一种基于分组等色质点的新模拟方法。其核心在于，在模拟前将行为相同的等色质点归为一组（例如，沿x轴位置、T1、T2及磁场不均匀性参数均相同的等色质点），从而在脉冲序列处理过程中共享部分计算步骤。尤其对于仅沿x轴施加梯度的序列，该分组策略可显著减少冗余运算。实验表明，相较于传统方法，该方案在快速自旋回波（FSE）和回波平面成像（EPI）序列上分别实现了3至72倍的加速效果，在2750万等色质点规模下，使用单指令多数据（SIMD）与多线程技术时，FSE和EPI的模拟时间从208.4秒和66.4秒分别缩短至38.1秒和7.1秒。

链接: https://arxiv.org/abs/2508.03960
作者: Hidenori Takeshima
机构: Canon Medical Systems Corporation(佳能医疗系统公司)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This work aims to overcome an assumption of conventional MR simulators: Individual isochromats should be simulated individually. To reduce the computational times of MR simulation, a new simulation method using grouped isochromats is proposed. When multiple isochromats are grouped before simulations, some parts of the simulation can be shared in each group. For a certain gradient type, the isochromats in the group can be easily chosen for ensuring that they behave the same. For example, the group can be defined as the isochromats whose locations along x-axis, T1, T2 and magnetic field inhomogeneity values are the same values. In such groups, simulations can be combined when a pulse sequence with the magnetic field gradient along x-axis only are processed. The processing times of the conventional and proposed methods were evaluated with several sequences including fast spin echo (FSE) and echo-planar imaging (EPI) sequences. The simulation times of the proposed method were 3 to 72 times faster than those of the conventional methods. In the cases of 27.5 million isochromats using single instruction multiple data (SIMD) instructions and multi-threading, the conventional method simulated FSE and EPI sequences in 208.4 and 66.4 seconds, respectively. In the same cases, the proposed method simulated these sequences in 38.1 and 7.1 seconds, respectively.
zh

[CV-154] When Deep Learning Fails: Limitations of Recurrent Models on Stroke-Based Handwriting for Alzheimers Disease Detection

【速读】：该论文试图解决阿尔茨海默病（Alzheimer’s disease, AD）检测依赖昂贵神经影像或侵入性手段导致可及性受限的问题，提出通过手写分析实现非侵入式AD检测的可行性。其解决方案的关键在于利用深度学习模型（如LSTM、GRU和RNN）处理从离散笔画中提取的特征向量，而非原始时间序列信号；然而研究发现，由于这些递归架构设计用于捕捉连续时间流，而手写数据本质上是离散且由模糊分割的笔画构成，导致模型特异性差、方差高，传统集成学习方法反而在准确性和平衡指标上显著优于所有深度架构。这揭示了模型结构与数据表示之间存在根本性不匹配，为未来研究指明了方向。

链接: https://arxiv.org/abs/2508.03773
作者: Emanuele Nardone,Tiziana D’Alessandro,Francesco Fontanella,Claudio De Stefano
机构: University of Cassino and Southern Lazio (卡西诺和拉齐奥南部大学); Department of Electrical and Information Engineering (DIEI) (电气与信息工程系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alzheimer’s disease detection requires expensive neuroimaging or invasive procedures, limiting accessibility. This study explores whether deep learning can enable non-invasive Alzheimer’s disease detection through handwriting analysis. Using a dataset of 34 distinct handwriting tasks collected from healthy controls and Alzheimer’s disease patients, we evaluate and compare three recurrent neural architectures (LSTM, GRU, RNN) against traditional machine learning models. A crucial distinction of our approach is that the recurrent models process pre-extracted features from discrete strokes, not raw temporal signals. This violates the assumption of a continuous temporal flow that recurrent networks are designed to capture. Results reveal that they exhibit poor specificity and high variance. Traditional ensemble methods significantly outperform all deep architectures, achieving higher accuracy with balanced metrics. This demonstrates that recurrent architectures, designed for continuous temporal sequences, fail when applied to feature vectors extracted from ambiguously segmented strokes. Despite their complexity, deep learning models cannot overcome the fundamental disconnect between their architectural assumptions and the discrete, feature-based nature of stroke-level handwriting data. Although performance is limited, the study highlights several critical issues in data representation and model compatibility, pointing to valuable directions for future research.
zh

[CV-155] Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global Multiethnic Population (Study Protocol)

【速读】：该论文旨在解决前列腺癌磁共振成像（MRI）诊断中人工判读依赖性强、一致性不足的问题，尤其聚焦于提高对 Gleason 评分≥2 的前列腺癌（即具有临床意义的前列腺癌）检测的准确性与可重复性。其解决方案的关键在于开发并验证一个名为 PI-CAI-2B 的新一代人工智能（AI）模型，该模型基于跨洲际、多中心的大规模回顾性队列（共22,481例MRI检查），通过在欧洲、北美、亚洲和非洲的多个独立数据集上进行训练与外部验证，实现了对高危前列腺癌的高效识别。该AI系统在外部测试集中表现出与标准诊疗方案（由专家泌尿病理学家或至少两位共识影像学专家做出的诊断）高度一致的性能，且统计分析预设了诊断等效性的假设，以确保其在不同人群中的普适性和可靠性。

链接: https://arxiv.org/abs/2508.03762
作者: Anindo Saha,Joeran S. Bosma,Jasper J. Twilt,Alexander B.C.D. Ng,Aqua Asif,Kirti Magudia,Peder Larson,Qinglin Xie,Xiaodong Zhang,Chi Pham Minh,Samuel N. Gitau,Ivo G. Schoots,Martijn F. Boomsma,Renato Cuocolo,Nikolaos Papanikolaou,Daniele Regge,Derya Yakar,Mattijs Elschot,Jeroen Veltman,Baris Turkbey,Nancy A. Obuchowski,Jurgen J. Fütterer,Anwar R. Padhani,Hashim U. Ahmed,Tobias Nordström,Martin Eklund,Veeru Kasivisvanathan,Maarten de Rooij,Henkjan Huisman(on behalf of the PI-CAI, ProCAncer-I, COMFORT, STHLM3-MRI and PRIME consortia)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group \geq 2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group \geq 2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS \geq 3 (primary diagnosis) or \geq 4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).
zh

[CV-156] Assessing the Impact of Image Super Resolution on White Blood Cell Classification Accuracy

【速读】：该论文旨在解决低分辨率显微图像在白细胞（white blood cells）分类中的准确性问题，这一问题限制了深度学习模型在医学诊断中的有效应用。其解决方案的关键在于采用先进的图像超分辨率（image super-resolution）技术对原始图像进行增强，并将增强后的高分辨率图像融入深度学习模型的训练过程，从而提升模型对细微形态变化的感知能力，最终实现分类准确性的显著改善。

链接: https://arxiv.org/abs/2508.03759
作者: Tatwadarshi P. Nagarhalli,Shruti S. Pawar,Soham A. Dahanukar,Uday Aswalekar,Ashwini M. Save,Sanket D. Patil
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurately classifying white blood cells from microscopic images is essential to identify several illnesses and conditions in medical diagnostics. Many deep learning technologies are being employed to quickly and automatically classify images. However, most of the time, the resolution of these microscopic pictures is quite low, which might make it difficult to classify them correctly. Some picture improvement techniques, such as image super-resolution, are being utilized to improve the resolution of the photos to get around this issue. The suggested study uses large image dimension upscaling to investigate how picture-enhancing approaches affect classification performance. The study specifically looks at how deep learning models may be able to understand more complex visual information by capturing subtler morphological changes when image resolution is increased using cutting-edge techniques. The model may learn from standard and augmented data since the improved images are incorporated into the training process. This dual method seeks to comprehend the impact of image resolution on model performance and enhance classification accuracy. A well-known model for picture categorization is used to conduct extensive testing and thoroughly evaluate the effectiveness of this approach. This research intends to create more efficient image identification algorithms customized to a particular dataset of white blood cells by understanding the trade-offs between ordinary and enhanced images.
zh

[CV-157] FUTransUNet-GradCAM: A Hybrid Transformer-U-Net with Self-Attention and Explainable Visualizations for Foot Ulcer Segmentation

【速读】：该论文旨在解决糖尿病足溃疡（Diabetic Foot Ulcers, DFUs）在临床照片中的自动分割难题，该任务因溃疡区域形态不规则、背景复杂及外观异质性强而极具挑战性。传统卷积神经网络（Convolutional Neural Networks, CNNs）如U-Net虽具备良好的局部定位能力，但受限于感受野较小，难以建模长距离空间依赖关系。解决方案的关键在于提出一种混合架构FUTransUNet，其将视觉Transformer（Vision Transformers, ViTs）的全局注意力机制嵌入U-Net框架中，从而在保持跳跃连接与高效解码路径所带来精细空间分辨率的同时，有效提取全局上下文特征，实现对DFU区域更准确、鲁棒且可解释的分割。

链接: https://arxiv.org/abs/2508.03758
作者: Akwasi Asare,Mary Sagoe,Justice Williams Asare
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated segmentation of diabetic foot ulcers (DFUs) plays a critical role in clinical diagnosis, therapeutic planning, and longitudinal wound monitoring. However, this task remains challenging due to the heterogeneous appearance, irregular morphology, and complex backgrounds associated with ulcer regions in clinical photographs. Traditional convolutional neural networks (CNNs), such as U-Net, provide strong localization capabilities but struggle to model long-range spatial dependencies due to their inherently limited receptive fields. To address this, we propose FUTransUNet, a hybrid architecture that integrates the global attention mechanism of Vision Transformers (ViTs) into the U-Net framework. This combination allows the model to extract global contextual features while maintaining fine-grained spatial resolution through skip connections and an effective decoding pathway. We trained and validated FUTransUNet on the public Foot Ulcer Segmentation Challenge (FUSeg) dataset. FUTransUNet achieved a training Dice Coefficient of 0.8679, an IoU of 0.7672, and a training loss of 0.0053. On the validation set, the model achieved a Dice Coefficient of 0.8751, an IoU of 0.7780, and a validation loss of 0.009045. To ensure clinical transparency, we employed Grad-CAM visualizations, which highlighted model focus areas during prediction. These quantitative outcomes clearly demonstrate that our hybrid approach successfully integrates global and local feature extraction paradigms, thereby offering a highly robust, accurate, explainable, and interpretable solution and clinically translatable solution for automated foot ulcer analysis. The approach offers a reliable, high-fidelity solution for DFU segmentation, with implications for improving real-world wound assessment and patient care.
zh

[CV-158] Classification non supervisées dacquisitions hyperspectrales codées : quelles vérités terrain ?

【速读】：该论文旨在解决基于压缩感知的高光谱成像（DD-CASSI）系统中，如何在仅使用少量编码采样数据的情况下实现无监督分类的问题。其核心挑战在于数据压缩导致的信息损失与类别定义模糊之间的矛盾。解决方案的关键在于构建一个关于类内光谱变异性的简单模型，通过该模型识别出光谱更一致的区域，从而实现类别划分与参考光谱估计，同时揭示了传统评估方法因类别定义不清、类内变异性高及标注错误而导致的局限性，强调需重新审视无监督分类方法的评价体系。

链接: https://arxiv.org/abs/2508.03753
作者: Trung-tin Dinh(IRAP, LAAS-PHOTO, UT3, LAAS),Hervé Carfantan(IRAP),Antoine Monmayrant(LAAS-PHOTO),Simon Lacroix(LAAS-RIS)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注: in French language. 30{è} Colloque sur le traitement du signal et des images, GRETSI - Groupe de Recherche en Traitement du Signal et des Images, GRETSI, Aug 2025, Strasbourg, France

点击查看摘要

Abstract:We propose an unsupervised classification method using a limited number of coded acquisitions from a DD-CASSI hyperspectral imager. Based on a simple model of intra-class spectral variability, this approach allow to identify classes and estimate reference spectra, despite data compression by a factor of ten. Here, we highlight the limitations of the ground truths commonly used to evaluate this type of method: lack of a clear definition of the notion of class, high intra-class variability, and even classification errors. Using the Pavia University scene, we show that with simple assumptions, it is possible to detect regions that are spectrally more coherent, highlighting the need to rethink the evaluation of classification methods, particularly in unsupervised scenarios.
zh

[CV-159] M3HL: Mutual Mask Mix with High-Low Level Feature Consistency for Semi-Supervised Medical Image Segmentation MICCAI2025

【速读】：该论文旨在解决当前基于CutMix的数据增强方法在半监督医学图像分割任务中存在操作僵化、缺乏特征层面一致性约束的问题。其解决方案的关键在于提出一种名为M³HL（Mutual Mask Mix with High-Low level feature consistency）的新方法，核心包含两个组成部分：一是M³，即受掩码图像建模（Masked Image Modeling, MIM）启发的增强策略，通过动态可调掩码生成空间互补的图像对，实现标注与未标注图像间的协同训练和有效信息融合；二是HL，即分层一致性正则化框架，强制未标注图像与混合图像在高层和低层特征上保持一致性，从而提升模型对判别性特征的捕捉能力。该方法在ACDC和LA等主流医学图像分割数据集上达到最优性能。

链接: https://arxiv.org/abs/2508.03752
作者: Yajun Liu,Zenghui Zhang,Jiang Yue,Weiwei Guo,Dongying Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:Data augmentation methods inspired by CutMix have demonstrated significant potential in recent semi-supervised medical image segmentation tasks. However, these approaches often apply CutMix operations in a rigid and inflexible manner, while paying insufficient attention to feature-level consistency constraints. In this paper, we propose a novel method called Mutual Mask Mix with High-Low level feature consistency (M ^3 HL) to address the aforementioned challenges, which consists of two key components: 1) M ^3 : An enhanced data augmentation operation inspired by the masking strategy from Masked Image Modeling (MIM), which advances conventional CutMix through dynamically adjustable masks to generate spatially complementary image pairs for collaborative training, thereby enabling effective information fusion between labeled and unlabeled images. 2) HL: A hierarchical consistency regularization framework that enforces high-level and low-level feature consistency between unlabeled and mixed images, enabling the model to better capture discriminative feature this http URL method achieves state-of-the-art performance on widely adopted medical image segmentation benchmarks including the ACDC and LA datasets. Source code is available at this https URL
zh

[CV-160] Do We Need Pre-Processing for Deep Learning Based Ultrasound Shear Wave Elastography?

【速读】：该论文旨在解决超声剪切波弹性成像（Shear Wave Elastography, SWE）中因图像处理步骤差异导致的可重复性与标准化不足的问题。当前SWE依赖于传统预处理流程（如波束形成和滤波），这些步骤可能引入偏差并限制跨设备、跨算法的一致性。论文提出以3D卷积神经网络为核心的深度学习方法，直接从原始射频（Radiofrequency, RF）数据中预测剪切波速度，从而减少对人工设计预处理步骤的依赖。其关键创新在于验证了深度学习模型在无需任何预处理的情况下仍能可靠地区分不同弹性水平的组织，显著提升了临床应用中的效率与一致性。

链接: https://arxiv.org/abs/2508.03744
作者: Sarah Grube,Sören Grünhagen,Sarah Latus,Michael Meyling,Alexander Schlaefer
机构: Hamburg University of Technology (汉堡工业大学); SustAInLivWork Center of Excellence
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CURAC conference 2025

点击查看摘要

Abstract:Estimating the elasticity of soft tissue can provide useful information for various diagnostic applications. Ultrasound shear wave elastography offers a non-invasive approach. However, its generalizability and standardization across different systems and processing pipelines remain limited. Considering the influence of image processing on ultrasound based diagnostics, recent literature has discussed the impact of different image processing steps on reliable and reproducible elasticity analysis. In this work, we investigate the need of ultrasound pre-processing steps for deep learning-based ultrasound shear wave elastography. We evaluate the performance of a 3D convolutional neural network in predicting shear wave velocities from spatio-temporal ultrasound images, studying different degrees of pre-processing on the input images, ranging from fully beamformed and filtered ultrasound images to raw radiofrequency data. We compare the predictions from our deep learning approach to a conventional time-of-flight method across four gelatin phantoms with different elasticity levels. Our results demonstrate statistically significant differences in the predicted shear wave velocity among all elasticity groups, regardless of the degree of pre-processing. Although pre-processing slightly improves performance metrics, our results show that the deep learning approach can reliably differentiate between elasticity groups using raw, unprocessed radiofrequency data. These results show that deep learning-based approaches could reduce the need for and the bias of traditional ultrasound pre-processing steps in ultrasound shear wave elastography, enabling faster and more reliable clinical elasticity assessments.
zh

[CV-161] Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

【速读】：该论文旨在解决医学影像与诊断报告之间因信号噪声比差异导致的语义密度差距问题，从而缓解视觉对齐偏差（visual alignment bias）。其核心解决方案在于提升视觉语义密度：一方面通过疾病级别的视觉对比学习增强模型对各解剖结构正常与异常样本的区分能力；另一方面引入解剖学正常性建模方法，利用VQ-VAE在潜在空间中重构正常视觉嵌入，通过异常样本的分布偏移放大异常信号，强化模型对异常特征的感知与判别能力。该策略显著提升了视觉表征对诊断相关语义的捕捉效果，实现了更高效、准确的多任务医学图像-文本对齐。

链接: https://arxiv.org/abs/2508.03742
作者: Weiwei Cao,Jianpeng Zhang,Zhongyi Shui,Sinuo Wang,Zeli Chen,Xi Li,Le Lu,Xianghua Ye,Tingbo Liang,Qi Zhang,Ling Zhang
机构: Zhejiang University (浙江大学); DAMO Academy, Alibaba Group (阿里巴巴集团达摩院); The First Affiliated Hospital of College of Medicine, Zhejiang University (浙江大学医学院附属第一医院); Hupan Lab (湖畔实验室)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model’s ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in the latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model’s perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-CT69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9% across 54 diseases in 15 organs, significantly surpassing existing methods. Additionally, we demonstrate the superior transfer learning capabilities of our pre-trained model. Code is available at this https URL.
zh

[CV-162] A Modified VGG19-Based Framework for Accurate and Interpretable Real-Time Bone Fracture Detection

【速读】：该论文旨在解决骨骨折早期准确检测中存在的时间延迟与误诊问题，尤其是在放射科专家资源匮乏的情况下，X射线图像解读效率低且易出错；同时，现有深度学习方法普遍存在分类错误和缺乏可解释性，难以在临床实践中获得信任。其解决方案的关键在于提出一个改进的VGG-19模型框架，结合对比度受限自适应直方图均衡化（CLAHE）、Otsu阈值分割和Canny边缘检测等先进预处理技术以提升图像质量并增强特征提取能力，并引入Grad-CAM（Gradient-weighted Class Activation Mapping）这一可解释人工智能（Explainable AI）方法生成决策热力图，使医生能够直观理解模型判断依据，从而提升临床可信度与验证效率。该系统部署于实时Web应用中，可在0.5秒内完成诊断反馈，分类准确率达99.78%，AUC为1.00，实现了快速、可靠且可解释的骨骨折检测方案。

链接: https://arxiv.org/abs/2508.03739
作者: Md. Ehsanul Haque,Abrar Fahim,Shamik Dey,Syoda Anamika Jahan,S. M. Jahidul Islam,Sakib Rokoni,Md Sakib Morshed
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at THE 16th INTERNATIONAL IEEE CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), held at IIT Indore, Madhya Pradesh, India

点击查看摘要

Abstract:Early and accurate detection of the bone fracture is paramount to initiating treatment as early as possible and avoiding any delay in patient treatment and outcomes. Interpretation of X-ray image is a time consuming and error prone task, especially when resources for such interpretation are limited by lack of radiology expertise. Additionally, deep learning approaches used currently, typically suffer from misclassifications and lack interpretable explanations to clinical use. In order to overcome these challenges, we propose an automated framework of bone fracture detection using a VGG-19 model modified to our needs. It incorporates sophisticated preprocessing techniques that include Contrast Limited Adaptive Histogram Equalization (CLAHE), Otsu’s thresholding, and Canny edge detection, among others, to enhance image clarity as well as to facilitate the feature extraction. Therefore, we use Grad-CAM, an Explainable AI method that can generate visual heatmaps of the model’s decision making process, as a type of model interpretability, for clinicians to understand the model’s decision making process. It encourages trust and helps in further clinical validation. It is deployed in a real time web application, where healthcare professionals can upload X-ray images and get the diagnostic feedback within 0.5 seconds. The performance of our modified VGG-19 model attains 99.78% classification accuracy and AUC score of 1.00, making it exceptionally good. The framework provides a reliable, fast, and interpretable solution for bone fracture detection that reasons more efficiently for diagnoses and better patient care.
zh

[CV-163] Improve Retinal Artery/Vein Classification via Channel Couplin

【速读】：该论文旨在解决视网膜血管（retinal vessel）自动分割与动脉-静脉（Artery/Vein, A/V）分类中存在的任务独立性问题，即现有方法将血管、动脉和静脉分割视为三个独立的二值分割任务，忽略了三者之间的内在解剖学耦合关系，导致预测不一致且分类精度受限。解决方案的关键在于提出一种新的通道耦合血管一致性损失（Channel-Coupled Vessel Consistency Loss），以强制网络在预测血管、动脉和静脉时保持结构一致性，从而避免模型偏向于简单地学习三个孤立的二值任务；同时引入图像内像素级对比正则化项（intra-image pixel-level contrastive loss），增强特征层面的细粒度区分能力，提升A/V分类准确性。该方法在RITE、LES-AV和HRF三个公开数据集上均取得了当前最优（SOTA）性能。

链接: https://arxiv.org/abs/2508.03738
作者: Shuang Zeng,Chee Hong Lee,Kaiwen Li,Boxu Xie,Ourui Fu,Hangzhou He,Lei Zhu,Yanye Lu,Fangxiao Cheng
机构: Peking University Health Science Center, Peking University (北京大学医学部，北京大学); Peking University (北京大学); National Biomedical Imaging Center, Peking University (北京大学生物医学成像中心); Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University (佐治亚理工学院和埃默里大学瓦莱斯·C·库尔特生物医学工程系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal vessel segmentation plays a vital role in analyzing fundus images for the diagnosis of systemic and ocular diseases. Building on this, classifying segmented vessels into arteries and veins (A/V) further enables the extraction of clinically relevant features such as vessel width, diameter and tortuosity, which are essential for detecting conditions like diabetic and hypertensive retinopathy. However, manual segmentation and classification are time-consuming, costly and inconsistent. With the advancement of Convolutional Neural Networks, several automated methods have been proposed to address this challenge, but there are still some issues. For example, the existing methods all treat artery, vein and overall vessel segmentation as three separate binary tasks, neglecting the intrinsic coupling relationships between these anatomical structures. Considering artery and vein structures are subsets of the overall retinal vessel map and should naturally exhibit prediction consistency with it, we design a novel loss named Channel-Coupled Vessel Consistency Loss to enforce the coherence and consistency between vessel, artery and vein predictions, avoiding biasing the network toward three simple binary segmentation tasks. Moreover, we also introduce a regularization term named intra-image pixel-level contrastive loss to extract more discriminative feature-level fine-grained representations for accurate retinal A/V classification. SOTA results have been achieved across three public A/V classification datasets including RITE, LES-AV and HRF. Our code will be available upon acceptance.
zh

[CV-164] A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models

【速读】：该论文旨在解决视觉障碍（visual impairment）这一全球性健康挑战中，如何利用多模态深度学习方法提升眼科诊断准确性的问题。其解决方案的关键在于系统梳理并分析两类核心技术：一是面向特定临床任务的多模态方法，如病灶检测、疾病诊断和图像合成，整合彩色眼底照相、光学相干断层扫描（Optical Coherence Tomography, OCT）和血管造影等多种成像模态；二是大规模多模态基础模型，通过融合先进的视觉-语言架构与预训练于多样化眼科数据集的大语言模型，实现跨模态理解、自动化临床报告生成及决策支持。关键创新包括自监督学习、基于注意力机制的特征融合以及对比对齐等方法，从而推动构建更智能、可解释且具备临床适用性的AI系统。

链接: https://arxiv.org/abs/2508.03734
作者: Xiaoling Luo,Ruli Zheng,Qiaojian Zheng,Zibo Du,Shuo Yang,Meidan Ding,Qihao Xu,Chengliang Liu,Linlin Shen
机构: Shenzhen University (深圳大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Hong Kong (香港); School of Artificial Intelligence, Shenzhen University (深圳大学人工智能学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual impairment represents a major global health challenge, with multimodal imaging providing complementary information that is essential for accurate ophthalmic diagnosis. This comprehensive survey systematically reviews the latest advances in multimodal deep learning methods in ophthalmology up to the year 2025. The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models. Task-specific approaches are designed for particular clinical applications such as lesion detection, disease diagnosis, and image synthesis. These methods utilize a variety of imaging modalities including color fundus photography, optical coherence tomography, and angiography. On the other hand, foundation models combine sophisticated vision-language architectures and large language models pretrained on diverse ophthalmic datasets. These models enable robust cross-modal understanding, automated clinical report generation, and decision support. The survey critically examines important datasets, evaluation metrics, and methodological innovations including self-supervised learning, attention-based fusion, and contrastive alignment. It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations. Finally, the survey outlines promising future directions that emphasize the use of ultra-widefield imaging and reinforcement learning-based reasoning frameworks to create intelligent, interpretable, and clinically applicable AI systems for ophthalmology.
zh

[CV-165] chnical specification of a framework for the collection of clinical images and data

【速读】：该论文旨在解决临床影像与数据在训练和验证人工智能（AI）工具时所面临的系统性收集、伦理合规及共享难题，尤其关注如何确保数据集的时效性与代表性。其核心问题在于：传统数据收集方法难以持续更新并反映当前临床实践，导致AI模型在部署后可能因训练数据与实际应用场景不一致而性能下降。解决方案的关键在于提出一个可实现自动化且持续的数据采集框架，该框架不仅能保障数据采集的安全性（通过伦理审查与信息治理流程），还能支持跨机构的数据共享（依托基础设施与协议），从而保证训练和验证数据包含既有长期随访的旧病例以提升预测准确性，又涵盖当前数据以模拟真实部署环境，有效提升AI模型的泛化能力与临床实用性。

链接: https://arxiv.org/abs/2508.03723
作者: Alistair Mackenzie(1),Mark Halling-Brown(1),Ruben van Engen(2),Carlijn Roozemond(2),Lucy Warren(1),Dominic Ward(1),Nadia Smith(1) ((1) Royal Surrey NHS Foundation Trust, Guildford, UK, (2) Dutch Expert Centre for Screening (LRCB), Nijmegen, The Netherlands)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 58 pages, 4 figures

点击查看摘要

Abstract:In this report a framework for the collection of clinical images and data for use when training and validating artificial intelligence (AI) tools is described. The report contains not only information about the collection of the images and clinical data, but the ethics and information governance processes to consider ensuring the data is collected safely, and the infrastructure and agreements required to allow for the sharing of data with other groups. A key characteristic of the main collection framework described here is that it can enable automated and ongoing collection of datasets to ensure that the data is up-to-date and representative of current practice. This is important in the context of training and validating AI tools as it is vital that datasets have a mix of older cases with long term follow-up such that the clinical outcome is as accurate as possible, and current data. Validations run on old data will provide findings and conclusions relative to the status of the imaging units when that data was generated. It is important that a validation dataset can assess the AI tools with data that it would see if deployed and active now. Other types of collection frameworks, which do not follow a fully automated approach, are also described. Whilst the fully automated method is recommended for large scale, long-term image collection, there may be reasons to start data collection using semi-automated methods and indications of how to do that are provided. Comments: 58 pages, 4 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: H.2.8; J.3 Cite as: arXiv:2508.03723 [eess.IV] (or arXiv:2508.03723v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.03723 Focus to learn more arXiv-issued DOI via DataCite
zh

人工智能

[AI-0] From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario

【速读】：该论文旨在解决多智能体机器人系统（Multi-agent Robotic Systems, MARS）在现实世界中部署受限的问题，即尽管存在先进的多智能体框架，但其在机器人实际应用中的落地仍面临挑战。解决方案的关键在于通过两个实证研究揭示层次化多智能体框架在真实场景下的性能权衡：Study 1 利用 CrewAI 迭代优化知识库，识别并分类无法仅靠上下文知识解决的协调失败（如工具访问违规、故障报告响应延迟等）；Study 2 基于 AutoGen 设计双向通信结构，量化推理模型与非推理模型在同一机器人团队中的权衡关系。研究强调自主性与稳定性之间的张力，并指出边缘案例测试对提升系统可靠性与安全性的重要性，从而为未来 MARS 的实际部署提供可操作的改进路径。

链接: https://arxiv.org/abs/2508.04691
作者: Yuanchen Bai,Zijian Ding,Shaoyue Wen,Xiang Chang,Angelique Taylor
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints, increasing the complexity of action execution and agent coordination. However, despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited, hindering the advancement of MARS research in practice. To bridge this gap, we conducted two studies to investigate performance trade-offs of hierarchical multi-agent frameworks in a simulated real-world multi-robot healthcare scenario. In Study 1, using CrewAI, we iteratively refine the system’s knowledge base, to systematically identify and categorize coordination failures (e.g., tool access violations, lack of timely handling of failure reports) not resolvable by providing contextual knowledge alone. In Study 2, using AutoGen, we evaluate a redesigned bidirectional communication structure and further measure the trade-offs between reasoning and non-reasoning models operating within the same robotic team setting. Drawing from our empirical findings, we emphasize the tension between autonomy and stability and the importance of edge-case testing to improve system reliability and safety for future real-world deployment. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: this https URL.
zh

[AI-1] How are CS students using resources and AI tools for coding tasks?

【速读】：该论文旨在探究学生在编程学习过程中对人工智能辅助工具的使用偏好及其影响因素，试图解决的问题是：在计算机科学（Computer Science, CS）教育中，学生更倾向于依赖哪些类型的AI工具进行代码编写与调试，以及其经验水平如何影响求助策略。解决方案的关键在于通过问卷调查26名CS学生，发现AI编码助手（AI coding assistants）主要用于代码生成（仅次于在线搜索），而AI聊天机器人（AI chatbots）则成为调试（debugging）的首选资源；同时，不同编码经验的学生均更偏好在线帮助而非直接向同伴或教师寻求帮助，揭示了AI工具在编程学习中的核心作用及用户行为模式。

链接: https://arxiv.org/abs/2508.04667
作者: Natalia Echeverry,Arun Lekshmi Narayanan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A survey of 26 CS students reveals that AI coding assistants are mainly used for writing code (second to online searches) while AI chatbots are the top resource for debugging. Participants with different coding experience prefer online help over direct human help from peers and instructors.
zh

[AI-2] LLM Collaboration With Multi-Agent Reinforcement Learning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多智能体协作场景中缺乏有效协同机制的问题，即现有LLMs通常独立预训练且未针对协作进行优化，而现有的微调框架依赖于个体奖励信号，难以设计出能促进合作的奖励函数。其解决方案的关键在于将LLM协作建模为一个合作型多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）问题，并提出一种多智能体、多轮次的算法——多智能体组相对策略优化（Multi-Agent Group Relative Policy Optimization, MAGRPO），该方法结合了当前LLM强化学习与MARL技术，通过组内相对奖励机制实现高效协作，从而在写作和编程任务中显著提升生成质量与协作效率。

链接: https://arxiv.org/abs/2508.04652
作者: Shuo Liu,Zeyu Liang,Xueguang Lyu,Christopher Amato
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:A large amount of work has been done in Multi-Agent Systems (MAS) for modeling and solving problems with multiple interacting agents. However, most LLMs are pretrained independently and not specifically optimized for coordination. Existing LLM fine-tuning frameworks rely on individual rewards, which require complex reward designs for each agent to encourage collaboration. To address these challenges, we model LLM collaboration as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. We develop a multi-agent, multi-turn algorithm, Multi-Agent Group Relative Policy Optimization (MAGRPO), to solve it, building on current RL approaches for LLMs as well as MARL techniques. Our experiments on LLM writing and coding collaboration demonstrate that fine-tuning MAS with MAGRPO enables agents to generate high-quality responses efficiently through effective cooperation. Our approach opens the door to using other MARL methods for LLMs and highlights the associated challenges.
zh

[AI-3] A Scalable Pretraining Framework for Link Prediction with Efficient Adaptation KDD2025

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）在链接预测（Link Prediction, LP）任务中面临的三大挑战：稀疏连接导致的监督信号不足、对初始化敏感以及在分布变化下的泛化能力差。解决方案的关键在于提出一种系统性的预训练方法，通过引入晚期融合策略整合节点级和边级特征表示以提升性能，并采用Mixture-of-Experts（MoE）框架应对预训练数据多样性问题，避免负迁移；同时设计了一种参数高效微调策略，使模型能在极低计算开销下快速适应未见下游数据集，从而在16个不同数据集上实现低资源场景下的SOTA效果，且计算成本降低超过10,000倍。

链接: https://arxiv.org/abs/2508.04645
作者: Yu Song,Zhigang Hua,Harry Shomer,Yan Xie,Jingzhe Liu,Bo Long,Hui Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2025 Research Track

点击查看摘要

Abstract:Link Prediction (LP) is a critical task in graph machine learning. While Graph Neural Networks (GNNs) have significantly advanced LP performance recently, existing methods face key challenges including limited supervision from sparse connectivity, sensitivity to initialization, and poor generalization under distribution shifts. We explore pretraining as a solution to address these challenges. Unlike node classification, LP is inherently a pairwise task, which requires the integration of both node- and edge-level information. In this work, we present the first systematic study on the transferability of these distinct modules and propose a late fusion strategy to effectively combine their outputs for improved performance. To handle the diversity of pretraining data and avoid negative transfer, we introduce a Mixture-of-Experts (MoE) framework that captures distinct patterns in separate experts, facilitating seamless application of the pretrained model on diverse downstream datasets. For fast adaptation, we develop a parameter-efficient tuning strategy that allows the pretrained model to adapt to unseen datasets with minimal computational overhead. Experiments on 16 datasets across two domains demonstrate the effectiveness of our approach, achieving state-of-the-art performance on low-resource link prediction while obtaining competitive results compared to end-to-end trained methods, with over 10,000x lower computational overhead.
zh

[AI-4] HiD-VAE: Interpretable Generative Recommendation via Hierarchical and Disentangled Semantic IDs

【速读】：该论文旨在解决当前生成式推荐（Generative Recommendation）方法中因无监督分词导致的语义ID存在两个关键问题：一是语义扁平且不可解释，缺乏层次结构；二是表示纠缠（即“ID冲突”），损害推荐准确性和多样性。解决方案的核心在于提出HiD-VAE框架，其关键创新为：一是首创分层监督量化机制，使离散码与多层级物品标签对齐，生成更均匀、解耦的ID，并通过训练后的码本预测层级标签，提供可追溯的语义路径；二是引入新颖的独特性损失（uniqueness loss），直接惩罚潜在空间重叠，有效缓解ID冲突并提升推荐多样性。

链接: https://arxiv.org/abs/2508.04618
作者: Dengzhao Fang,Jingtong Gao,Chengcheng Zhu,Yu Li,Xiangyu Zhao,Yi Chang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems are indispensable for helping users navigate the immense item catalogs of modern online platforms. Recently, generative recommendation has emerged as a promising paradigm, unifying the conventional retrieve-and-rank pipeline into an end-to-end model capable of dynamic generation. However, existing generative methods are fundamentally constrained by their unsupervised tokenization, which generates semantic IDs suffering from two critical flaws: (1) they are semantically flat and uninterpretable, lacking a coherent hierarchy, and (2) they are prone to representation entanglement (i.e., ``ID collisions’'), which harms recommendation accuracy and diversity. To overcome these limitations, we propose HiD-VAE, a novel framework that learns hierarchically disentangled item representations through two core innovations. First, HiD-VAE pioneers a hierarchically-supervised quantization process that aligns discrete codes with multi-level item tags, yielding more uniform and disentangled IDs. Crucially, the trained codebooks can predict hierarchical tags, providing a traceable and interpretable semantic path for each recommendation. Second, to combat representation entanglement, HiD-VAE incorporates a novel uniqueness loss that directly penalizes latent space overlap. This mechanism not only resolves the critical ID collision problem but also promotes recommendation diversity by ensuring a more comprehensive utilization of the item representation space. These high-quality, disentangled IDs provide a powerful foundation for downstream generative models. Extensive experiments on three public benchmarks validate HiD-VAE’s superior performance against state-of-the-art methods. The code is available at this https URL.
zh

[AI-5] Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

【速读】：该论文旨在解决传统网络入侵检测系统（Network Intrusion Detection System, NIDS）在持续学习场景下难以有效应对新型攻击且易发生灾难性遗忘（catastrophic forgetting）的问题。其解决方案的关键在于提出一种受大脑层级处理与能量效率启发的脉冲神经网络（Spiking Neural Network, SNN）架构：首先使用高效的静态SNN初步识别潜在入侵，随后激活一个具备结构可塑性和新型自适应脉冲时间依赖可塑性（Adaptive Spike-Timing-Dependent Plasticity, Ad-STDP）学习规则的动态SNN进行攻击类型分类。该设计通过GWR（Grow When Required）启发的结构可塑机制和Ad-STDP规则实现了对新威胁的增量式学习，同时保留已有知识，从而在UNSW-NB15数据集上达到85.3%的整体准确率，并展现出高稀疏性，适合部署于类脑硬件以实现低功耗运行。

链接: https://arxiv.org/abs/2508.04610
作者: Md Zesun Ahmed Mia,Malyaban Bal,Sen Lu,George M. Nishibuchi,Suhas Chelian,Srini Vasan,Abhronil Sengupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Inspired by the brain’s hierarchical processing and energy efficiency, this paper presents a Spiking Neural Network (SNN) architecture for lifelong Network Intrusion Detection System (NIDS). The proposed system first employs an efficient static SNN to identify potential intrusions, which then activates an adaptive dynamic SNN responsible for classifying the specific attack type. Mimicking biological adaptation, the dynamic classifier utilizes Grow When Required (GWR)-inspired structural plasticity and a novel Adaptive Spike-Timing-Dependent Plasticity (Ad-STDP) learning rule. These bio-plausible mechanisms enable the network to learn new threats incrementally while preserving existing knowledge. Tested on the UNSW-NB15 benchmark in a continual learning setting, the architecture demonstrates robust adaptation, reduced catastrophic forgetting, and achieves 85.3 % overall accuracy. Furthermore, simulations using the Intel Lava framework confirm high operational sparsity, highlighting the potential for low-power deployment on neuromorphic hardware.
zh

[AI-6] GraphProp: Training the Graph Foundation Models using Graph Properties

【速读】：该论文旨在解决图基础模型（Graph Foundation Models, GFMs）在图级别任务中跨域泛化能力不足的问题，尤其针对传统GFM主要依赖节点特征迁移而忽视图结构一致性信息的局限性。其解决方案的关键在于提出GraphProp框架，通过两个阶段实现结构驱动的跨域泛化：第一阶段训练一个基于图不变量（graph invariants）的结构GFM，以捕捉抽象图结构信息并生成跨域可比的图表示；第二阶段利用该结构GFM提供的表示作为位置编码，结合领域特定的节点属性和图标签进一步优化节点特征的跨域泛化能力，从而显著提升模型在监督学习与少样本学习场景下的性能，尤其是在无节点属性图上的表现。

链接: https://arxiv.org/abs/2508.04594
作者: Ziheng Sun,Qi Feng,Lehao Lin,Chris Ding,Jicong Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.
zh

[AI-7] ConfProBench: A Confidence Evaluation Benchmark for MLLM -Based Process Judges

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）中的过程判别器（Multimodal Process Judges, MPJs）在生成步骤级置信度分数时缺乏可靠性评估的问题。现有基准主要关注推理步骤正确性分类或推理过程检索，忽视了置信度分数是否可信这一关键维度。解决方案的关键在于提出首个系统性评估框架ConfProBench，通过构建三类对抗扰动推理步骤（同义词替换、句法变换和图像扰动）来测试MPJ置信度的鲁棒性，并引入三个新指标：置信度鲁棒性得分（Confidence Robustness Score, CRS）、置信度敏感性得分（Confidence Sensitivity Score, CSS）和置信度校准得分（Confidence Calibration Score, CCS），从而全面量化MPJs在不同扰动下的表现，揭示其局限性并为后续研究提供基准。

链接: https://arxiv.org/abs/2508.04576
作者: Yue Zhou,Yi Chang,Yuan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to test the robustness of MPJ confidence under perturbations. In addition, we introduce three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which evaluate robustness, sensitivity, and calibration, respectively. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Experiments reveal limitations in current MPJs’ confidence performance and offer competitive baselines to support future research.
zh

[AI-8] SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在复杂问题求解场景中缺乏有效引导教学能力的问题，尤其在跨学科STEM教育中，如何通过多轮苏格拉底式对话实现学生知识整合与迁移的能力尚无系统评估标准。解决方案的关键在于提出首个专门用于评估LLMs高阶引导能力的基准测试SID（Socratic Instructional Dialogue benchmark），其核心创新包括：一个包含48个复杂STEM项目、10,000轮对话的大规模数据集、一种捕捉深层教学特征的新标注框架，以及一套新的评估指标（如X-SRG），从而为开发更具教学意识的LLM提供了可量化、可扩展的评测基础。

链接: https://arxiv.org/abs/2508.04563
作者: Mei Jiang,Houping Yue,Bingdong Li,Hao Hao,Ying Qian,Bo Jiang,Aimin Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 20 figures

点击查看摘要

Abstract:Fostering students’ abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.
zh

[AI-9] Argumentative Debates for Transparent Bias Detection [Technical Report]

【速读】：该论文旨在解决人工智能系统在社会应用中因数据或模型学习而产生的偏见问题，这些问题可能导致对特定群体的系统性不利。现有文献虽提出了多种公平性（unfairness）概念及相应的检测与缓解算法，但普遍忽视了透明性要求；而公平性作为以人为中心的议题，尤其需要可解释性和可理解性。论文的关键解决方案是一种新颖的、可解释的偏见检测方法，其核心在于基于受保护特征（protected features）值，在个体及其邻域内和跨邻域进行辩论式分析，从而识别偏见的存在。该方法依托形式化和计算论证技术，通过结构化的辩论机制实现对偏见的量化评估与可视化解释，显著提升了公平性检测的透明度与可信度。

链接: https://arxiv.org/abs/2508.04511
作者: Hamed Ayoobi,Nico Potyka,Anna Rapberger,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the use of AI systems in society grows, addressing potential biases that emerge from data or are learned by models is essential to prevent systematic disadvantages against specific groups. Several notions of (un)fairness have been proposed in the literature, alongside corresponding algorithmic methods for detecting and mitigating unfairness, but, with very few exceptions, these tend to ignore transparency. Instead, interpretability and explainability are core requirements for algorithmic fairness, even more so than for other algorithmic solutions, given the human-oriented nature of fairness. In this paper, we contribute a novel interpretable, explainable method for bias detection relying on debates about the presence of bias against individuals, based on the values of protected features for the individuals and others in their neighbourhoods. Our method builds upon techniques from formal and computational argumentation, whereby debates result from arguing about biases within and across neighbourhoods. We provide formal, quantitative, and qualitative evaluations of our method, highlighting its strengths in performance against baselines, as well as its interpretability and explainability.
zh

[AI-10] PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

【速读】：该论文旨在解决多变量时间序列分类任务中现有模型（如基于Transformer和CNN的模型）存在的计算复杂度高、频率多样性有限以及参数预算消耗大的问题。解决方案的关键在于提出了一种名为PRISM（Per-channel Resolution-Informed Symmetric Module）的卷积特征提取模块，其核心思想是在每个通道上独立应用多尺度对称有限冲激响应（FIR）滤波器，从而在不引入通道间卷积的情况下实现高度频率选择性的嵌入表示，显著降低模型规模与计算开销，同时保持甚至超越主流CNN和Transformer基线模型的性能表现。

链接: https://arxiv.org/abs/2508.04503
作者: Federico Zucchi,Thomas Lampert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multivariate time-series classification is pivotal in domains ranging from wearable sensing to biomedical monitoring. Despite recent advances, Transformer- and CNN-based models often remain computationally heavy, offer limited frequency diversity, and require extensive parameter budgets. We propose PRISM (Per-channel Resolution-Informed Symmetric Module), a convolutional-based feature extractor that applies symmetric finite-impulse-response (FIR) filters at multiple temporal scales, independently per channel. This multi-resolution, per-channel design yields highly frequency-selective embeddings without any inter-channel convolutions, greatly reducing model size and complexity. Across human-activity, sleep-stage and biomedical benchmarks, PRISM, paired with lightweight classification heads, matches or outperforms leading CNN and Transformer baselines, while using roughly an order of magnitude fewer parameters and FLOPs. By uniting classical signal processing insights with modern deep learning, PRISM offers an accurate, resource-efficient solution for multivariate time-series classification.
zh

[AI-11] Hierarchical Scoring for Machine Learning Classifier Error Impact Evaluation

【速读】：该论文旨在解决传统机器学习（Machine Learning, ML）模型评估中仅采用“通过/失败”评分方式的问题，即所有误分类均被视为等价，忽略了类别间可能存在的语义或层级关系。针对这一局限，论文提出了一种基于评分树（scoring tree）的层次化评分指标（hierarchical scoring metrics），其关键在于利用编码类别标签间关系的树结构来量化预测结果与真实标签之间的距离，从而实现对错误类型的细粒度区分和部分赋分机制。这种方法不仅能够更精确地反映模型性能，还允许通过调整评分树中的权重策略来优化评价导向，使模型评估从单纯关注错误数量转向同时考虑错误类型及其影响程度。

链接: https://arxiv.org/abs/2508.04489
作者: Erin Lanus,Daniel Wolodkin,Laura J. Freeman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common use of machine learning (ML) models is predicting the class of a sample. Object detection is an extension of classification that includes localization of the object via a bounding box within the sample. Classification, and by extension object detection, is typically evaluated by counting a prediction as incorrect if the predicted label does not match the ground truth label. This pass/fail scoring treats all misclassifications as equivalent. In many cases, class labels can be organized into a class taxonomy with a hierarchical structure to either reflect relationships among the data or operator valuation of misclassifications. When such a hierarchical structure exists, hierarchical scoring metrics can return the model performance of a given prediction related to the distance between the prediction and the ground truth label. Such metrics can be viewed as giving partial credit to predictions instead of pass/fail, enabling a finer-grained understanding of the impact of misclassifications. This work develops hierarchical scoring metrics varying in complexity that utilize scoring trees to encode relationships between class labels and produce metrics that reflect distance in the scoring tree. The scoring metrics are demonstrated on an abstract use case with scoring trees that represent three weighting strategies and evaluated by the kind of errors discouraged. Results demonstrate that these metrics capture errors with finer granularity and the scoring trees enable tuning. This work demonstrates an approach to evaluating ML performance that ranks models not only by how many errors are made but by the kind or impact of errors. Python implementations of the scoring metrics will be available in an open-source repository at time of publication.
zh

[AI-12] Small transformer architectures for task switching ICANN2025

【速读】：该论文旨在解决小规模任务中注意力机制（attention mechanism）是否仍能优于传统模型（如多层感知机 MLP 或循环神经网络 LSTM）的问题，特别是在任务切换（task switching）场景下。研究发现，标准 Transformer、LSTM 和 MLP 在处理基于有限域算术的 IARC 子任务组合时表现相近且效果有限；而通过引入非平移不变的扩展架构 cisformer 与改进的广义注意力机制（extensive attention），尤其是二者结合后，模型在任务切换任务中的预测准确率显著提升至约 95%，表明注意力机制的性能提升可通过结构设计和注意力形式的创新实现，从而为理解并优化注意力机制提供了新路径。

链接: https://arxiv.org/abs/2508.04461
作者: Claudius Gros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICANN 2025, in press

点击查看摘要

Abstract:The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of ‘task switching’. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.
zh

[AI-13] From “Aha Moments” to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Reasoning in Large Reasoning Models via Decoupled Reasoning and Control

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在复杂推理过程中因缺乏内在调控机制而导致的“过度思考”问题，即模型在已得出可靠结论后仍持续生成冗余推理内容，从而造成计算资源浪费和延迟增加。解决方案的关键在于提出元认知推理框架（Meta-cognitive Reasoning Framework, MERA），其核心创新是显式地将推理过程解耦为独立的推理模块与控制模块，并通过三阶段策略实现对控制行为的有效优化：首先利用基于接管的数据构建机制识别关键决策点并由辅助大语言模型（LLM）生成高质量控制信号；其次通过监督微调实现推理与控制的结构化分离，使模型具备初始元认知控制能力；最后采用控制片段策略优化（Control-Segment Policy Optimization, CSPO）方法，在最小化无关内容干扰的前提下提升控制策略的学习效率。

链接: https://arxiv.org/abs/2508.04460
作者: Rui Ha,Chaozhuo Li,Rui Pu,Sen Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated a latent capacity for complex reasoning by spontaneously exhibiting cognitive behaviors such as step-by-step reasoning, reflection, and backtracking, commonly referred to as “Aha Moments”. However, such emergent behaviors remain unregulated and uncontrolled, often resulting in overthinking, where the model continues generating redundant reasoning content even after reaching reliable conclusions. This leads to excessive computational costs and increased latency, limiting the practical deployment of LRMs. The root cause lies in the absence of intrinsic regulatory mechanisms, as current models are unable to monitor and adaptively manage their reasoning process to determine when to continue, backtrack, or terminate. To address this issue, we propose the Meta-cognitive Reasoning Framework (MERA), which explicitly decouples the thinking process into distinct reasoning and control components, thereby enabling the independent optimization of control strategies. Specifically, MERA incorporates a takeover-based data construction mechanism that identifies critical decision points during reasoning and delegates the creation of control signals to auxiliary LLMs, thereby enabling the construction of high-quality reasoning-control data. Additionally, a structured reasoning-control separation is implemented via supervised fine-tuning, enabling the model to generate explicit traces and acquire initial meta-cognitive control capabilities. Finally, MERA employs Control-Segment Policy Optimization (CSPO), which combines segment-wise Group Relative Policy Optimization (GRPO) with a control-masking mechanism to optimize control behavior learning while minimizing interference from irrelevant content. Experiments on various reasoning benchmarks demonstrate that models trained with MERA enhance both reasoning efficiency and accuracy.
zh

[AI-14] Automatic LLM Red Teaming

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）红队测试中自动化方法存在的局限性问题，即现有方法依赖脆弱的提示模板或单轮攻击策略，无法捕捉真实世界对抗对话中复杂的交互特性。其解决方案的关键在于将红队测试建模为马尔可夫决策过程（Markov Decision Process, MDP），并采用分层强化学习（Hierarchical Reinforcement Learning, HRL）框架，通过细粒度的token级伤害奖励机制，使生成式代理能够学习连贯的多轮攻击策略，从而发现传统基线方法所遗漏的隐蔽漏洞。这一方法从根本上将LLM红队测试重构为一种动态的、基于轨迹的过程，显著提升了漏洞挖掘能力与模型鲁棒性评估的有效性。

链接: https://arxiv.org/abs/2508.04451
作者: Roman Belaire,Arunesh Sinha,Pradeep Varakantham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break’ another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.
zh

[AI-15] Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

【速读】：该论文旨在解决生成式自动编码器（Generative Autoencoder）在建模复杂数据分布时存在的重建质量不高、潜在空间结构不清晰以及生成样本多样性不足的问题。传统方法通常依赖标准高斯先验和经典散度度量，导致重构样本趋于同质化。其解决方案的关键在于将云模型（Cloud Model）引入Wasserstein自动编码器（WAE）框架，通过云模型的特征函数（Characteristic Function）作为正则项来约束潜在空间，从而实现更灵活且符合实际的数据分布建模，显著提升了生成样本的质量与多样性。

链接: https://arxiv.org/abs/2508.04447
作者: Biao Hu,Guoyin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.
zh

[AI-16] textscSimInstruct: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM -Simulated Novices

【速读】：该论文旨在解决高质量多轮教学对话数据稀缺的问题，这类数据对于训练能够支持教学、学习和决策的AI系统至关重要。其核心挑战在于真实教学场景中涉及隐私保护和求助行为的脆弱性，导致难以收集专家与新手之间的结构化互动数据。解决方案的关键在于提出SimInstruct——一种“专家在环”的可扩展工具，通过大语言模型（LLM）模拟不同性格特征（如外向或内向）的新手教师角色，使人类专家能够在无真实新手参与的情况下提供多轮反馈、推理和指导，从而生成具有教育学意义且认知深度高的对话数据。该方法不仅提升了数据采集效率与质量，还增强了专家的专业反思能力，并最终用于微调出优于GPT-4o的专家型模型。

链接: https://arxiv.org/abs/2508.04428
作者: Si Chen,Izzy Molnar,Ting Hua,Peiyu Li,Le Huy Khiem,G. Alex Ambrose,Jim Lang,Ronald Metoyer,Nitesh V. Chawla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding – the process by which an expert supports a novice’s thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM’s persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o’s limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.
zh

[AI-17] Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

【速读】：该论文旨在解决多模态人工智能（Multimodal AI）模型在决策过程中的可解释性不足问题，尤其是在复杂模态交互和评估方法不一致的情况下。其解决方案的关键在于系统性地分析2020年1月至2024年初发表的多模态可解释人工智能（XAI）研究，识别出当前研究主要集中在视觉-语言和纯语言模型上，且以注意力机制为基础的解释方法占主导地位；同时指出这些方法难以全面捕捉模态间交互，并强调现有评估手段缺乏系统性、一致性与对模态特异性认知因素的考量。基于此，论文提出一套标准化、透明且严谨的评估与报告建议，推动未来多模态XAI研究向更可解释、可问责和负责任的方向发展。

链接: https://arxiv.org/abs/2508.04427
作者: Md Raisul Kibria,Sébastien Lafond,Janan Arslan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that the majority of studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. Based on these findings, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible mulitmodal AI systems, with explainability at their core.
zh

[AI-18] GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

【速读】：该论文旨在解决图形用户界面视觉定位（GUI-VG）任务中依赖监督微调（SFT）所面临的高数据标注成本与训练开销问题，尤其是在多模态大语言模型（MLLM）预训练阶段已覆盖GUI领域的情况下，传统SFT的必要性受到质疑。其解决方案的关键在于提出一种基于强化学习的GUI-VG方法GuirlVG，通过系统性实证研究识别出规则驱动的强化微调（RFT）在GUI-VG中的优化路径：首先分解并重构RFT的核心组件以获得最优形式；其次引入一种新型对抗性KL因子（Adversarial KL Factor），动态稳定训练过程以防止奖励过优化；最后探索最佳训练配置以提升效果。实验表明，GuirlVG仅用5.2K样本即超越使用超1000万样本训练的SFT基线，在多个基准测试中实现显著性能提升。

链接: https://arxiv.org/abs/2508.04389
作者: Weitai Kang,Bin Lei,Gaowen Liu,Caiwen Ding,Yan Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Graphical user interface visual grounding (GUI-VG), a core capability for GUI agents, has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), which demands extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. Despite this promise, the optimal manner of applying RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning-based GUI-VG method built on a systematic empirical study and a novel stabilization technique. We find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a 7.7% improvement on ScreenSpot, a 17.2% improvement on ScreenSpotPro, and 91.9% accuracy on ScreenSpotV2.
zh

[AI-19] Artificial Consciousness as Interface Representation

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）系统是否具备意识（consciousness）这一长期争议性问题，其核心挑战在于主观体验（subjective experience）的定义与可操作化难题。解决方案的关键在于提出一套名为SLP-test的实证可检验框架，包含三个评价标准：S（主观语言类）、L（潜在涌现类）和P（现象结构类），用以评估AI系统是否具备能促进类意识属性的接口表征（interface representations）。该框架基于范畴论（category theory）将接口表征建模为关系基底（relational substrates, RS）与可观测行为之间的映射，从而将主观体验从物理系统的内在属性重构为一种功能性接口，使其成为可量化、可验证的科学问题。

链接: https://arxiv.org/abs/2508.04383
作者: Robert Prentner
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 12 pages

点击查看摘要

Abstract:Whether artificial intelligence (AI) systems can possess consciousness is a contentious question because of the inherent challenges of defining and operationalizing subjective experience. This paper proposes a framework to reframe the question of artificial consciousness into empirically tractable tests. We introduce three evaluative criteria - S (subjective-linguistic), L (latent-emergent), and P (phenomenological-structural) - collectively termed SLP-tests, which assess whether an AI system instantiates interface representations that facilitate consciousness-like properties. Drawing on category theory, we model interface representations as mappings between relational substrates (RS) and observable behaviors, akin to specific types of abstraction layers. The SLP-tests collectively operationalize subjective experience not as an intrinsic property of physical systems but as a functional interface to a relational entity.
zh

[AI-20] OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

【速读】：该论文旨在解决当前多模态基础模型（如Gemini和GPT-4o）在动态交互环境中评估不足的问题，尤其是现有基准测试在静态场景下缺乏代理（agent）的主动性，而交互式基准又存在严重的模态瓶颈，忽视了听觉与时间维度等关键信息。为弥合这一评估鸿沟，作者提出OmniPlay——一个诊断性基准平台，其核心创新在于通过五种游戏环境系统性地构建模态协同与冲突场景，强制代理进行真正的跨模态推理。解决方案的关键在于“模态互依性”（modality interdependence）的设计理念，即通过模拟真实世界中多感官信息的融合与竞争关系，揭示模型在感知融合机制上的脆弱性，并发现“少即是多”的反直觉现象：去除部分感官输入反而能提升性能，这表明当前模型的融合策略过于脆弱，未来实现鲁棒通用人工智能（AGI）需超越单纯规模扩展，聚焦于协同融合机制的优化。

链接: https://arxiv.org/abs/2508.04361
作者: Fuqing Bie,Shiyu Huang,Xijia Tao,Zhiqin Fang,Leyi Pan,Junzhe Chen,Min Ren,Liuyu Xiang,Zhaofeng He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive “less is more” paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at this https URL.
zh

[AI-21] LUST: A Multi-Modal Framework with Hierarchical LLM -based Scoring for Learned Thematic Significance Tracking in Multimedia Content

【速读】：该论文旨在解决视频内容分析中如何量化片段与用户指定主题之间语义相关性的难题，尤其关注于动态变化的叙事结构对相关性评估的影响。解决方案的关键在于提出了一种名为“学习用户显著性追踪器”（Learned User Significance Tracker, LUST）的多模态框架，其核心创新是采用两级层次化相关性评分机制：首先通过大型语言模型（Large Language Models, LLMs）计算每个视频片段的直接相关性分数（S_d,i），基于视觉帧和语音识别（Automatic Speech Recognition, ASR）提取的文本信息；随后引入上下文相关性分数（S_c,i），利用前序片段的相关性得分序列进行时序建模，从而捕捉主题在时间维度上的演变，实现对用户定义显著性的精细化、时序感知建模。

链接: https://arxiv.org/abs/2508.04353
作者: Anderson de Lima Luiz
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: 5 pages and 4 figures

点击查看摘要

Abstract:This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial “direct relevance” score, S_d,i , assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a “contextual relevance” score, S_c,i , that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving narratives. The LUST framework aims to provide a nuanced, temporally-aware measure of user-defined significance, outputting an annotated video with visualized relevance scores and comprehensive analytical logs.
zh

[AI-22] Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在逻辑推理中因语义启发式与决定性证据冲突而陷入认知陷阱（cognitive traps）的问题，即模型常依赖表面相关性而非严谨逻辑推导。其解决方案的核心是提出一种新的推理范式—— deliberative reasoning network (DRN)，将逻辑推理从概率最大化重构为不确定性最小化：DRN 不再寻求“最可能的答案”，而是识别“内部证据最一致的假设”。通过显式追踪信念状态并利用迭代证据合成过程量化竞争假设的主观认知不确定性（epistemic uncertainty），DRN 实现了内在可解释性，并在无需额外训练的情况下展现出强零样本泛化能力，显著提升推理准确率与可信度。

链接: https://arxiv.org/abs/2508.04339
作者: Anran Xu,Jincheng Wang,Baigen Cai,Tao Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Large language models often fail at logical reasoning when semantic heuristics conflict with decisive evidence - a phenomenon we term cognitive traps. To address this fundamental limitation, we introduce the Deliberative Reasoning Network (DRN), a novel paradigm that reframes logical reasoning from probability maximization to uncertainty minimization. Instead of asking “Which answer is most likely?”, DRN asks “Which hypothesis has the most internally consistent evidence?”. DRN achieves intrinsic interpretability by explicitly tracking belief states and quantifying epistemic uncertainty for competing hypotheses through an iterative evidence synthesis process. We validate our approach through two complementary architectures - a bespoke discriminative model that embodies the core uncertainty minimization principle, and a lightweight verification module that enhances existing generative LLMs. Evaluated on LCR-1000, our new adversarial reasoning benchmark designed to expose cognitive traps, the bespoke DRN achieves up to 15.2% improvement over standard baselines. When integrated as a parameter-efficient verifier with Mistral-7B, our hybrid system boosts accuracy from 20% to 80% on the most challenging problems. Critically, DRN demonstrates strong zero-shot generalization, improving TruthfulQA performance by 23.6% without additional training, indicating that uncertainty-driven deliberation learns transferable reasoning principles. We position DRN as a foundational, verifiable System 2 reasoning component for building more trustworthy AI systems.
zh

[AI-23] Compressing Large Language Models with PCA Without Performance Loss

【速读】：该论文旨在解决神经网络模型参数冗余与信息利用率不足的问题，即如何在不牺牲性能的前提下实现模型的极致压缩。其解决方案的关键在于采用结构化主成分分析（Principal Component Analysis, PCA），对输入数据进行高效降维：要么在极坐标空间中处理图像特征，要么分段应用于token序列，从而显著减少模型参数量并保留关键语义信息。实验表明，该方法可在多种任务和架构中实现轻量化建模，同时保持高精度与强表征一致性。

链接: https://arxiv.org/abs/2508.04307
作者: Magnus Bengtsson
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 23 pages. 4 figures, submitted to journal

点击查看摘要

Abstract:We demonstrate that Principal Component Analysis (PCA), when applied in a structured manner, either to polar-transformed images or segment-wise to token sequences, enables extreme compression of neural models without sacrificing performance. Across three case studies, we show that a one-layer classifier trained on PCA-compressed polar MNIST achieves over 98 percent accuracy using only 840 parameters. A two-layer transformer trained on 70-dimensional PCA-reduced MiniLM embeddings reaches 76.62 percent accuracy on the 20 Newsgroups dataset with just 81000 parameters. A decoder-only transformer generates coherent token sequences from 70-dimensional PCA embeddings while preserving over 97 percent cosine similarity with full MiniLM representations, using less than 17 percent of the parameter count of GPT-2. These results highlight PCA-based input compression as a general and effective strategy for aligning model capacity with information content, enabling lightweight architectures across multiple modalities.
zh

[AI-24] Comparative Analysis of Novel NIRMAL Optimizer Against Adam and SGD with Momentum

【速读】：该论文旨在解决深度学习模型训练中优化算法收敛速度慢、泛化能力弱以及在复杂数据集上性能不稳定的问题。其解决方案的关键在于提出一种名为NIRMAL（Novel Integrated Robust Multi-Adaptation Learning）的新型优化算法，该算法融合了梯度下降（gradient descent）、动量（momentum）、随机扰动（stochastic perturbations）、自适应学习率（adaptive learning rates）和非线性变换（non-linear transformations）等多种策略，灵感来源于国际象棋棋子的运动模式，从而实现更鲁棒且高效的参数更新机制。实验表明，NIRMAL在多个图像分类基准数据集上均展现出优异的性能，尤其在CIFAR-100等高难度数据集上显著优于Adam，并接近SGD with Momentum的表现，验证了其在复杂任务中的强大泛化能力和稳定收敛特性。

链接: https://arxiv.org/abs/2508.04293
作者: Nirmal Gaud,Surej Mouli,Preeti Katiyar,Vaduguru Venkata Ramya
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figures

点击查看摘要

Abstract:This study proposes NIRMAL (Novel Integrated Robust Multi-Adaptation Learning), a novel optimization algorithm that combines multiple strategies inspired by the movements of the chess piece. These strategies include gradient descent, momentum, stochastic perturbations, adaptive learning rates, and non-linear transformations. We carefully evaluated NIRMAL against two widely used and successful optimizers, Adam and SGD with Momentum, on four benchmark image classification datasets: MNIST, FashionMNIST, CIFAR-10, and CIFAR-100. The custom convolutional neural network (CNN) architecture is applied on each dataset. The experimental results show that NIRMAL achieves competitive performance, particularly on the more challenging CIFAR-100 dataset, where it achieved a test accuracy of 45.32%and a weighted F1-score of 0.4328. This performance surpasses Adam (41.79% accuracy, 0.3964 F1-score) and closely matches SGD with Momentum (46.97% accuracy, 0.4531 F1-score). Also, NIRMAL exhibits robust convergence and strong generalization capabilities, especially on complex datasets, as evidenced by stable training results in loss and accuracy curves. These findings underscore NIRMAL’s significant ability as a versatile and effective optimizer for various deep learning tasks.
zh

[AI-25] Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

【速读】：该论文旨在解决当前记忆增强型强化学习（Reinforcement Learning, RL）算法在部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）环境中评估时面临的挑战：现有基准测试虽然模拟了复杂现实问题，但缺乏对记忆模型所面临难度的可控性。为实现更细致、严谨的评估，论文提出了一种基于合成环境的解决方案，其关键在于三点：一是构建了一个以记忆需求结构（Memory Demand Structure, MDS）和转移不变性为基础的理论分析框架；二是设计了一套利用线性过程动态、状态聚合与奖励重分配的方法论，用于生成具有预设特性的定制化POMDP环境；三是通过实证验证了一系列难度递增的POMDP环境，从而明确记忆增强RL在处理不同复杂度任务时的表现边界，并为记忆模型的选择与环境设计提供指导。

链接: https://arxiv.org/abs/2508.04282
作者: Yongyi Wang,Lingfeng Li,Bozhou Chen,Ang Li,Hanyu Liu,Qirui Zheng,Xionghui Yang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllability over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions: 1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.04282 [cs.AI] (or arXiv:2508.04282v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.04282 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-26] Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在交互式多模态代理中缺乏将原始视觉观测转化为语言条件动作序列的能力这一关键问题。现有强化学习（Reinforcement Learning, RL）方法虽理论上可赋予VLM此类技能，但普遍存在泛化能力不足、依赖敏感超参数调优或仅适用于高状态变异性环境的局限性。解决方案的核心在于提出一种轻量级、无需超参数调整的强化学习算法——视觉语言解耦演员-评论家（Vision-Language Decoupled Actor-Critic, VL-DAC），其创新性地在动作标记（action tokens）层面应用PPO更新，同时仅在环境步长级别学习价值函数，从而避免不稳定的权重项并实现更快更可靠的收敛。实验表明，仅在单一低成本仿真环境中训练单个VLM即可显著提升其在真实图像代理任务（如BALROG、VSI-Bench和VisualWebBench）上的表现，验证了该方法的有效性与泛化潜力。

链接: https://arxiv.org/abs/2508.04280
作者: George Bredis,Stanislav Dereka,Viacheslav Sinii,Ruslan Rakhimov,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions – a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50% relative on BALROG (game-centric agentic control), +5% relative on the hardest part of VSI-Bench (spatial planning), and +2% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.
zh

[AI-27] Large Language Models Multi-Capability Alignment in Biomedical Domain

【速读】：该论文旨在解决领域特定人工智能对齐（domain-specific AI alignment）中多能力集成时的能力干扰问题，特别是在生物医学场景下实现高效、安全的推理。其核心挑战在于如何在参数效率的前提下保持各能力模块（如领域知识、推理、指令遵循与综合能力）之间的独立性与协同性，避免梯度空间冲突导致性能下降或安全隐患。解决方案的关键在于提出BalancedBio框架，其理论基础为生物医学多能力收敛定理（Biomedical Multi-Capability Convergence Theorem），证明正交梯度空间是防止能力干扰的必要条件；具体创新包括：(1) 医学知识引导的合成生成（Medical Knowledge Grounded Synthetic Generation, MKGSG），通过临床工作流约束和医学本体验证确保生成内容的事实准确性和安全性；(2) 能力感知的组相对策略优化（Capability Aware Group Relative Policy Optimization），设计基于规则与模型评分融合的奖励机制，动态调整混合奖励权重以维持强化学习中的正交性，从而实现帕累托最优收敛。该方法在多个生物医学基准上达到当前同类参数规模下的最先进性能，并提供可量化的安全边界保障。

链接: https://arxiv.org/abs/2508.04278
作者: Wentao Wu,Linqing Chen,Hanmeng Zhong,Weilei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:BalancedBio is a theoretically grounded framework for parameter-efficient biomedical reasoning, addressing multi-capability integration in domain-specific AI alignment. It establishes the Biomedical Multi-Capability Convergence Theorem, proving orthogonal gradient spaces are essential to prevent capability interference for safe deployment. Key innovations include: (1) Medical Knowledge Grounded Synthetic Generation (MKGSG), extending Source2Synth with clinical workflow constraints and medical ontology validation for factual accuracy and safety; and (2) Capability Aware Group Relative Policy Optimization, deriving optimal hybrid reward weighting to maintain orthogonality in RL, using a reward model with rule-based and model-based scores adapted to biomedical tasks. Mathematical analysis proves Pareto-optimal convergence, preserving performance across capabilities. It achieves state-of-the-art results in its parameter class: domain expertise (80.95% BIOMED-MMLU, +15.32% over baseline), reasoning (61.94%, +7.75%), instruction following (67.95%, +6.44%), and integration (86.7%, +18.5%). Theoretical safety guarantees include bounds on capability preservation and clinical accuracy. Real-world deployment yields 78% cost reduction, 23% improved diagnostic accuracy, and 89% clinician acceptance. This work provides a principled methodology for biomedical AI alignment, enabling efficient reasoning with essential safety and reliability, with the 0.5B model version to be released.
zh

[AI-28] A Visual Tool for Interactive Model Explanation using Sensitivity Analysis

【速读】：该论文旨在解决机器学习（Machine Learning, ML）模型在实际应用中缺乏可解释性与透明度的问题，尤其是在模型决策过程难以被人类理解的情况下。解决方案的关键在于提出SAInT工具，它通过集成局部和全局敏感性分析（sensitivity analysis），实现对ML模型行为的可视化探索与理解。该工具支持人机协同（Human-in-the-Loop, HITL）工作流，允许用户无需编程即可配置、训练、评估和解释模型；其核心功能包括自动化模型训练与选择、基于方差的全局特征归因（global feature attribution），以及利用LIME和SHAP提供单样本层面的解释（per-instance explanation）。实验在泰坦尼克号生存预测任务上验证了该方法在指导特征选择和数据优化方面的有效性。

链接: https://arxiv.org/abs/2508.04269
作者: Manuela Schuler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, This work is a preprint version of a paper currently in preparation with co-authors

点击查看摘要

Abstract:We present SAInT, a Python-based tool for visually exploring and understanding the behavior of Machine Learning (ML) models through integrated local and global sensitivity analysis. Our system supports Human-in-the-Loop (HITL) workflows by enabling users - both AI researchers and domain experts - to configure, train, evaluate, and explain models through an interactive graphical interface without programming. The tool automates model training and selection, provides global feature attribution using variance-based sensitivity analysis, and offers per-instance explanation via LIME and SHAP. We demonstrate the system on a classification task predicting survival on the Titanic dataset and show how sensitivity information can guide feature selection and data refinement.
zh

[AI-29] SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中梯度泄露攻击导致的敏感用户信息泄露问题，尤其是在非独立同分布（non-IID）数据和客户端能力异构环境下，现有防御机制如差分隐私（Differential Privacy, DP）和同态加密（Homomorphic Encryption, HE）常面临隐私保护、模型效用与系统开销之间的权衡难题。解决方案的关键在于提出一种轻量级混合防御框架 SelectiveShield，其核心创新是通过 Fisher 信息量化参数敏感性，使客户端能够本地识别关键参数；随后通过协作协商协议确定需用同态加密保护的共享敏感参数，而个性化重要参数保留在本地，其余非关键参数则采用自适应差分隐私噪声进行保护，从而在保障模型性能的同时显著降低梯度泄露风险。

链接: https://arxiv.org/abs/2508.04265
作者: Borui Li,Li Yan,Jianmin Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training on decentralized data but remains vulnerable to gradient leakage attacks that can reconstruct sensitive user information. Existing defense mechanisms, such as differential privacy (DP) and homomorphic encryption (HE), often introduce a trade-off between privacy, model utility, and system overhead, a challenge that is exacerbated in heterogeneous environments with non-IID data and varying client capabilities. To address these limitations, we propose SelectiveShield, a lightweight hybrid defense framework that adaptively integrates selective homomorphic encryption and differential privacy. SelectiveShield leverages Fisher information to quantify parameter sensitivity, allowing clients to identify critical parameters locally. Through a collaborative negotiation protocol, clients agree on a shared set of the most sensitive parameters for protection via homomorphic encryption. Parameters that are uniquely important to individual clients are retained locally, fostering personalization, while non-critical parameters are protected with adaptive differential privacy noise. Extensive experiments demonstrate that SelectiveShield maintains strong model utility while significantly mitigating gradient leakage risks, offering a practical and scalable defense mechanism for real-world federated learning deployments.
zh

[AI-30] Automated ultrasound doppler angle estimation using deep learning

【速读】：该论文旨在解决多普勒超声临床工作中角度估计不准确导致血流速度测量误差的问题（Doppler ultrasound angle estimation error）。其解决方案的关键在于提出一种基于深度学习的自动化角度估计方法：利用2100张人类颈动脉超声图像（含数据增强）训练模型，通过五个预训练网络提取图像特征，并将这些特征输入至自定义的浅层网络进行角度预测。实验表明，最优模型的平均绝对误差（MAE）低于临床可接受的多普勒角度误差阈值，从而有效避免将正常血流速度误判为狭窄，具备在商用超声设备中集成应用的潜力。

链接: https://arxiv.org/abs/2508.04243
作者: Nilesh Patil,Ajay Anand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9° to 9.4° for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.
zh

[AI-31] Circuit-Aware SAT Solving: Guiding CDCL via Conditional Probabilities

【速读】：该论文旨在解决传统电路可满足性（Circuit Satisfiability, CSAT）求解方法在将电路转换为合取范式（Conjunctive Normal Form, CNF）后，因忽略电路结构和功能信息而导致求解效率低下的问题。其解决方案的关键在于提出一种新型的电路感知SAT求解框架CASCAD，该框架通过图神经网络（Graph Neural Networks, GNNs）直接计算电路级条件概率，并利用这些概率动态指导冲突驱动子句学习（Conflict-Driven Clause Learning, CDCL）中的两个核心启发式策略——变量相位选择（variable phase selection）与子句管理（clause management），从而显著提升求解效率。实验表明，CASCAD相比最先进的CNF-based方法可将求解时间缩短达10倍，并通过概率引导的子句过滤策略额外减少23.5%的运行时间。

链接: https://arxiv.org/abs/2508.04235
作者: Jiaying Zhu,Ziyang Zheng,Zhengyuan Shi,Yalun Cai,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Circuit Satisfiability (CSAT) plays a pivotal role in Electronic Design Automation. The standard workflow for solving CSAT problems converts circuits into Conjunctive Normal Form (CNF) and employs generic SAT solvers powered by Conflict-Driven Clause Learning (CDCL). However, this process inherently discards rich structural and functional information, leading to suboptimal solver performance. To address this limitation, we introduce CASCAD, a novel circuit-aware SAT solving framework that directly leverages circuit-level conditional probabilities computed via Graph Neural Networks (GNNs). By explicitly modeling gate-level conditional probabilities, CASCAD dynamically guides two critical CDCL heuristics – variable phase selection and clause managementto significantly enhance solver efficiency. Extensive evaluations on challenging real-world Logical Equivalence Checking (LEC) benchmarks demonstrate that CASCAD reduces solving times by up to 10x compared to state-of-the-art CNF-based approaches, achieving an additional 23.5% runtime reduction via our probability-guided clause filtering strategy. Our results underscore the importance of preserving circuit-level structural insights within SAT solvers, providing a robust foundation for future improvements in SAT-solving efficiency and EDA tool design.
zh

[AI-32] Empowering Time Series Forecasting with LLM -Agents

【速读】：该论文旨在解决时间序列预测中传统自动化机器学习（AutoML）系统过度依赖模型架构搜索而忽视数据质量的问题。现有方法多聚焦于特征工程和模型结构优化，但研究表明轻量级模型在时间序列任务中已能实现先进性能，因此提升数据质量可能成为更有效的改进方向。解决方案的关键在于提出DCATS（Data-Centric Agent for Time Series），该代理利用时间序列附带的元数据（metadata）进行数据清洗，并在清洗过程中直接优化预测性能，从而实现以数据为中心的AutoML策略。实验表明，DCATS在四个主流预测模型上平均降低6%的误差，验证了数据质量改善对时间序列预测的重要价值。

链接: https://arxiv.org/abs/2508.04231
作者: Chin-Chia Michael Yeh,Vivian Lai,Uday Singh Saini,Xiran Fan,Yujie Fan,Junpeng Wang,Xin Dai,Yan Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.
zh

[AI-33] Symmetric Behavior Regularization via Taylor Expansion of Symmetry

【速读】：该论文旨在解决离线强化学习（Offline Reinforcement Learning, Offline RL）中行为正则化策略优化（Behavior Regularization Policy Optimization, BRPO）框架依赖非对称散度（如KL散度）所导致的局限性问题，尤其是其在实际训练中难以处理对称散度带来的数值不稳定性和无法获得解析形式策略的问题。解决方案的关键在于引入对称f-散度，并通过f-散度的泰勒展开（Taylor series）来构造可解析的策略更新规则和稳定的损失函数：首先证明有限阶泰勒展开即可获得解析策略；其次将对称散度分解为不对称项与条件对称项，仅对后者进行泰勒展开以缓解数值问题。由此提出首个基于对称散度的实际可行BRPO算法——对称f-Actor-Critic（Symmetric f Actor-Critic, S f -AC），并在分布逼近和MuJoCo任务上验证了其竞争力。

链接: https://arxiv.org/abs/2508.04225
作者: Lingwei Zhu,Zheng Chen,Han Wang,Yukie Nagai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces symmetric divergences to behavior regularization policy optimization (BRPO) to establish a novel offline RL framework. Existing methods focus on asymmetric divergences such as KL to obtain analytic regularized policies and a practical minimization objective. We show that symmetric divergences do not permit an analytic policy as regularization and can incur numerical issues as loss. We tackle these challenges by the Taylor series of f -divergence. Specifically, we prove that an analytic policy can be obtained with a finite series. For loss, we observe that symmetric divergences can be decomposed into an asymmetry and a conditional symmetry term, Taylor-expanding the latter alleviates numerical issues. Summing together, we propose Symmetric f Actor-Critic (S f -AC), the first practical BRPO algorithm with symmetric divergences. Experimental results on distribution approximation and MuJoCo verify that S f -AC performs competitively.
zh

[AI-34] A Hybrid AI Methodology for Generating Ontologies of Research Topics from Scientific Paper Corpora

【速读】：该论文旨在解决传统研究主题分类体系（如MeSH、UMLS、CSO等）依赖人工构建所导致的效率低、更新滞后及粒度粗糙的问题。其解决方案的关键在于提出了一种半自动化的方法Sci-OG，通过三个步骤实现：1）主题发现（Topic Discovery），从科研文献中提取潜在的研究主题；2）关系分类（Relationship Classification），利用编码器语言模型结合主题在科学文献中的出现特征，识别主题间的语义关系；3）本体构建（Ontology Construction），对主题进行精炼与结构化组织。其中，关系分类模块是系统核心，采用融合语言模型与文献统计特征的机制，在21,649条人工标注的语义三元组数据上达到F1分数0.951，显著优于多种基线方法（包括微调后的SciBERT和GPT4-mini），验证了其在扩展CSO本体（如网络安全领域）中的有效性。

链接: https://arxiv.org/abs/2508.04213
作者: Alessia Pisu,Livio Pompianu,Francesco Osborne,Diego Reforgiato Recupero,Daniele Riboni,Angelo Salatino
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Taxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM) play a central role in providing the primary framework through which intelligent systems can explore and interpret the literature. However, these resources have traditionally been manually curated, a process that is time-consuming, prone to obsolescence, and limited in granularity. This paper presents Sci-OG, a semi-auto-mated methodology for generating research topic ontologies, employing a multi-step approach: 1) Topic Discovery, extracting potential topics from research papers; 2) Relationship Classification, determining semantic relationships between topic pairs; and 3) Ontology Construction, refining and organizing topics into a structured ontology. The relationship classification component, which constitutes the core of the system, integrates an encoder-based language model with features describing topic occurrence in the scientific literature. We evaluate this approach against a range of alternative solutions using a dataset of 21,649 manually annotated semantic triples. Our method achieves the highest F1 score (0.951), surpassing various competing approaches, including a fine-tuned SciBERT model and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is corroborated by a use case which illustrates the practical application of our system to extend the CSO ontology in the area of cybersecurity. The presented solution is designed to improve the accessibility, organization, and analysis of scientific knowledge, thereby supporting advancements in AI-enabled literature management and research exploration.
zh

[AI-35] NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

【速读】：该论文旨在解决传统自动语音识别（ASR）与文本到语音合成（TTS）系统中对副语言发声（paralinguistic vocalizations，如笑声、吸气声、“嗯”“哦”等词汇化插话）建模不足的问题，这类发声在自然口语交流中承载情感、意图和交互线索，但长期被忽视。解决方案的关键在于提出一个统一的可扩展流水线NVSpeech，其核心包括：（1）构建首个包含48,430条人工标注语句的多类别副语言发声数据集（18类词级标签）；（2）开发一种面向副语言发声的ASR模型，将非言语声作为可解码的内联标记（如“你真有趣 [Laughter]”），实现词汇与非言语信息的联合转录，并基于此自动标注出首个大规模中文副语言发声语料库（174,179条语句，573小时）；（3）在人类标注与自标注数据上微调零样本TTS模型，实现对副语言发声的显式控制，支持上下文感知的任意token位置插入，从而生成更自然的人类语音。该方案首次实现了副语言发声在中文表达性语音建模中的端到端识别与可控合成。

链接: https://arxiv.org/abs/2508.04195
作者: Huan Liao,Qinke Ni,Yuancheng Wang,Yiheng Lu,Haoyue Zhan,Pengyuan Xie,Qiang Zhang,Zhizheng Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Paralinguistic vocalizations-including non-verbal sounds like laughter and breathing, as well as lexicalized interjections such as “uhm” and “oh”-are integral to natural spoken communication. Despite their importance in conveying affect, intent, and interactional cues, such cues remain largely overlooked in conventional automatic speech recognition (ASR) and text-to-speech (TTS) systems. We present NVSpeech, an integrated and scalable pipeline that bridges the recognition and synthesis of paralinguistic vocalizations, encompassing dataset construction, ASR modeling, and controllable TTS. (1) We introduce a manually annotated dataset of 48,430 human-spoken utterances with 18 word-level paralinguistic categories. (2) We develop the paralinguistic-aware ASR model, which treats paralinguistic cues as inline decodable tokens (e.g., “You’re so funny [Laughter]”), enabling joint lexical and non-verbal transcription. This model is then used to automatically annotate a large corpus, the first large-scale Chinese dataset of 174,179 utterances (573 hours) with word-level alignment and paralingustic cues. (3) We finetune zero-shot TTS models on both human- and auto-labeled data to enable explicit control over paralinguistic vocalizations, allowing context-aware insertion at arbitrary token positions for human-like speech synthesis. By unifying the recognition and generation of paralinguistic vocalizations, NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin, integrating recognition and synthesis in a scalable and controllable manner. Dataset and audio demos are available at this https URL.
zh

[AI-36] Quasi-Clique Discovery via Energy Diffusion

【速读】：该论文旨在解决准团（quasi-clique）发现问题，即在图数据中识别出边密度不低于给定阈值的子图，这在社交网络、生物信息学和电子商务等领域具有广泛应用。现有方法多依赖贪婪规则、相似性度量或元启发式搜索，但在不同图结构上难以同时保证效率与解的一致性。论文提出的解决方案EDQC（Energy Diffusion-based Quasi-Clique discovery）的关键在于引入能量扩散机制（energy diffusion），通过从源节点随机传播能量，使能量自然聚集于结构紧密区域，从而无需显式枚举候选子图即可高效发现密集子图，且不依赖特定数据集调参。实验表明，EDQC在30个真实数据集上 consistently 发现更大规模的准团，并显著降低解质量的方差，是首个将能量扩散机制应用于准团发现的方法。

链接: https://arxiv.org/abs/2508.04174
作者: Yu Zhang,Yilong Luo,Mingyuan Ma,Yao Chen,Enqiang Zhu,Jin Xu,Chanjuan Liu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Discovering quasi-cliques – subgraphs with edge density no less than a given threshold – is a fundamental task in graph mining, with broad applications in social networks, bioinformatics, and e-commerce. Existing heuristics often rely on greedy rules, similarity measures, or metaheuristic search, but struggle to maintain both efficiency and solution consistency across diverse graphs. This paper introduces EDQC, a novel quasi-clique discovery algorithm inspired by energy diffusion. Instead of explicitly enumerating candidate subgraphs, EDQC performs stochastic energy diffusion from source vertices, naturally concentrating energy within structurally cohesive regions. The approach enables efficient dense subgraph discovery without exhaustive search or dataset-specific tuning. Experimental results on 30 real-world datasets demonstrate that EDQC consistently discovers larger quasi-cliques than state-of-the-art baselines on the majority of datasets, while also yielding lower variance in solution quality. To the best of our knowledge, EDQC is the first method to incorporate energy diffusion into quasi-clique discovery.
zh

[AI-37] Generic-to-Specific Reasoning and Learning for Scalable Ad Hoc Teamwork

【速读】：该论文旨在解决**即兴团队协作（ad hoc teamwork）**中AI代理在缺乏预先协调的情况下如何有效与其他代理（人类或AI系统）协同工作的难题。现有数据驱动方法依赖大规模标注数据、透明度低且难以快速适应环境变化，尤其在代理数量增加时决策复杂性显著上升。解决方案的关键在于融合知识驱动与数据驱动方法的优势：通过非单调逻辑推理（non-monotonic logical reasoning），每个代理能够结合三类信息进行决策——(a) 先验常识性和领域特定知识；(b) 快速学习并更新以预测其他代理行为的模型；© 基于基础模型中相似情境的通用知识推断未来抽象目标。该架构在VirtualHome物理模拟环境中验证了其有效性。

链接: https://arxiv.org/abs/2508.04163
作者: Hasra Dodampegama,Mohan Sridharan
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:AI agents deployed in assistive roles often have to collaborate with other agents (humans, AI systems) without prior coordination. Methods considered state of the art for such ad hoc teamwork often pursue a data-driven approach that needs a large labeled dataset of prior observations, lacks transparency, and makes it difficult to rapidly revise existing knowledge in response to changes. As the number of agents increases, the complexity of decision-making makes it difficult to collaborate effectively. This paper advocates leveraging the complementary strengths of knowledge-based and data-driven methods for reasoning and learning for ad hoc teamwork. For any given goal, our architecture enables each ad hoc agent to determine its actions through non-monotonic logical reasoning with: (a) prior commonsense domain-specific knowledge; (b) models learned and revised rapidly to predict the behavior of other agents; and © anticipated abstract future goals based on generic knowledge of similar situations in an existing foundation model. We experimentally evaluate our architecture’s capabilities in VirtualHome, a realistic physics-based 3D simulation environment.
zh

[AI-38] Experimental Analysis of Productive Interaction Strategy with ChatGPT : User Study on Function and Project-level Code Generation Tasks

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在软件工程（Software Engineering, SE）任务中应用时存在的现实工作流适配不足问题，特别是针对复杂代码生成场景下人类-大模型交互（Human-LLM Interaction, HLI）效率低下的现象。现有研究多局限于函数级任务和常见提示模式，忽视了类级别以上的依赖关系及交互过程中影响生产力的关键因素。其解决方案的关键在于设计了一个涵盖两个项目级基准任务的实验，通过36名来自不同背景的参与者与GPT助手进行特定提示模式交互，并结合屏幕录制与聊天日志分析其行为特征，从而系统识别出显著影响代码生成生产力的三个HLI特征、提炼出五条提升HLI生产力的核心指导原则，并构建了29种运行时和逻辑错误的分类体系及其缓解策略。

链接: https://arxiv.org/abs/2508.04125
作者: Sangwon Hyun,Hyunjun Kim,Jinhyuk Jang,Hyojin Choi,M. Ali Babar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: The benchmark repository has not been publicly released yet due to the IP policy in our institutions. If you would like to use the benchmark or collaborate on extension, please contact " this http URL @gmail.com"

点击查看摘要

Abstract:The application of Large Language Models (LLMs) is growing in the productive completion of Software Engineering tasks. Yet, studies investigating the productive prompting techniques often employed a limited problem space, primarily focusing on well-known prompting patterns and mainly targeting function-level SE practices. We identify significant gaps in real-world workflows that involve complexities beyond class-level (e.g., multi-class dependencies) and different features that can impact Human-LLM Interactions (HLIs) processes in code generation. To address these issues, we designed an experiment that comprehensively analyzed the HLI features regarding the code generation productivity. Our study presents two project-level benchmark tasks, extending beyond function-level evaluations. We conducted a user study with 36 participants from diverse backgrounds, asking them to solve the assigned tasks by interacting with the GPT assistant using specific prompting patterns. We also examined the participants’ experience and their behavioral features during interactions by analyzing screen recordings and GPT chat logs. Our statistical and empirical investigation revealed (1) that three out of 15 HLI features significantly impacted the productivity in code generation; (2) five primary guidelines for enhancing productivity for HLI processes; and (3) a taxonomy of 29 runtime and logic errors that can occur during HLI processes, along with suggested mitigation plans.
zh

[AI-39] A Compositional Framework for On-the-Fly LTLf Synthesis ECAI2025

【速读】：该论文旨在解决基于有限迹线性时序逻辑（LTLf）的反应式合成问题中，Deterministic Finite Automaton（DFA）构造所面临的计算复杂度高（最坏情况下为2EXPTIME-complete）和状态空间爆炸难题。现有方法要么在游戏求解前进行组合式DFA构造并依赖自动机最小化缓解状态膨胀，要么在游戏求解过程中增量构建DFA以避免完整构造，但两者均未展现出绝对优势。论文提出一种组合式“按需合成”（compositional on-the-fly synthesis）框架，其关键在于将组合操作从传统的DFA构造阶段转移到游戏求解过程中执行，从而结合两种策略的优势：一方面利用组合性在游戏进行中逐步生成中间结果，另一方面通过剪枝（pruning）机制简化后续组合并提前检测不可实现性（unrealizability）。该框架支持两种剪枝策略——组合前剪枝以最大化最小化收益，或组合中剪枝以引导按需合成过程，显著提升了对实际场景中常见大规模合取式LTLf公式的求解能力。

链接: https://arxiv.org/abs/2508.04116
作者: Yongkang Li,Shengping Xiao,Shufang Zhu,Jianwen Li,Geguang Pu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, accepted by ECAI 2025

点击查看摘要

Abstract:Reactive synthesis from Linear Temporal Logic over finite traces (LTLf) can be reduced to a two-player game over a Deterministic Finite Automaton (DFA) of the LTLf specification. The primary challenge here is DFA construction, which is 2EXPTIME-complete in the worst case. Existing techniques either construct the DFA compositionally before solving the game, leveraging automata minimization to mitigate state-space explosion, or build the DFA incrementally during game solving to avoid full DFA construction. However, neither is dominant. In this paper, we introduce a compositional on-the-fly synthesis framework that integrates the strengths of both approaches, focusing on large conjunctions of smaller LTLf formulas common in practice. This framework applies composition during game solving instead of automata (game arena) construction. While composing all intermediate results may be necessary in the worst case, pruning these results simplifies subsequent compositions and enables early detection of unrealizability. Specifically, the framework allows two composition variants: pruning before composition to take full advantage of minimization or pruning during composition to guide on-the-fly synthesis. Compared to state-of-the-art synthesis solvers, our framework is able to solve a notable number of instances that other solvers cannot handle. A detailed analysis shows that both composition variants have unique merits.
zh

[AI-40] owards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement

【速读】：该论文旨在解决自动化评分系统在评估短答案时缺乏不确定性识别能力的问题，即系统难以判断其评分决策是否可能引发争议或存在歧义。解决方案的关键在于提出“语义熵（semantic entropy）”这一指标，通过计算多个GPT-4生成的解释理由在基于蕴含关系聚类后的多样性来量化评分不确定性的程度，从而无需依赖最终分数即可捕捉人类评分者之间的分歧，实现对AI评分结果可信度的可解释性监控。

链接: https://arxiv.org/abs/2508.04105
作者: Karrtik Iyer,Manikandan Ravikiran,Prasanna Pendse,Shayan Mohanty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated grading systems can efficiently score short-answer responses, yet they often fail to indicate when a grading decision is uncertain or potentially contentious. We introduce semantic entropy, a measure of variability across multiple GPT-4-generated explanations for the same student response, as a proxy for human grader disagreement. By clustering rationales via entailment-based similarity and computing entropy over these clusters, we quantify the diversity of justifications without relying on final output scores. We address three research questions: (1) Does semantic entropy align with human grader disagreement? (2) Does it generalize across academic subjects? (3) Is it sensitive to structural task features such as source dependency? Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. Our findings position semantic entropy as an interpretable uncertainty signal that supports more transparent and trustworthy AI-assisted grading workflows.
zh

[AI-41] SenseCrypt: Sensitivity-guided Selective Homomorphic Encryption for Joint Federated Learning in Cross-Device Scenarios

【速读】：该论文旨在解决同态加密（Homomorphic Encryption, HE）在跨设备联邦学习（Federated Learning, FL）场景中因高计算开销和适应成本导致的效率瓶颈问题，尤其是在客户端数据分布异构（heterogeneous data）和系统能力差异显著时，传统选择性HE方法易加剧客户端延迟（straggling）且难以有效降低HE开销。解决方案的关键在于提出SenseCrypt框架，其核心创新是基于模型参数敏感性（sensitivity）构建自适应加密策略：首先通过隐私保护聚类将具有相似数据分布的客户端分组；其次设计评分机制量化每组内可安全加密的参数比例以避免straggling；最后针对每个客户端建立多目标优化模型，在最小化HE计算开销的同时最大化模型安全性，从而实现跨设备FL中的安全与效率平衡。

链接: https://arxiv.org/abs/2508.04100
作者: Borui Li,Li Yan,Junhao Han,Jianmin Liu,Lei Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 17 pages, 19 figures

点击查看摘要

Abstract:Homomorphic Encryption (HE) prevails in securing Federated Learning (FL), but suffers from high overhead and adaptation cost. Selective HE methods, which partially encrypt model parameters by a global mask, are expected to protect privacy with reduced overhead and easy adaptation. However, in cross-device scenarios with heterogeneous data and system capabilities, traditional Selective HE methods deteriorate client straggling, and suffer from degraded HE overhead reduction performance. Accordingly, we propose SenseCrypt, a Sensitivity-guided selective Homomorphic EnCryption framework, to adaptively balance security and HE overhead per cross-device FL client. Given the observation that model parameter sensitivity is effective for measuring clients’ data distribution similarity, we first design a privacy-preserving method to respectively cluster the clients with similar data distributions. Then, we develop a scoring mechanism to deduce the straggler-free ratio of model parameters that can be encrypted by each client per cluster. Finally, for each client, we formulate and solve a multi-objective model parameter selection optimization problem, which minimizes HE overhead while maximizing model security without causing straggling. Experiments demonstrate that SenseCrypt ensures security against the state-of-the-art inversion attacks, while achieving normal model accuracy as on IID data, and reducing training time by 58.4%-88.7% as compared to traditional HE methods.
zh

[AI-42] GeoSR: Cognitive-Agent ic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在地理空间任务中面临的三大挑战：空间一致性不足、多跳推理能力弱以及地理偏差问题。其解决方案的关键在于提出GeoSR（Geospatial Self-Refining framework），该框架通过嵌入地理学核心原理（尤其是Tobler第一定律）构建一个迭代式代理推理机制，包含三个协作代理：变量选择代理（从同一位置选取相关协变量）、点选择代理（从邻近位置的历史预测中选取参考预测）和精炼代理（评估预测质量并决定是否触发新一轮迭代）。该机制利用空间依赖性和变量间关系实现逐步优化，显著提升了地理空间预测的准确性与公平性。

链接: https://arxiv.org/abs/2508.04080
作者: Jinfan Tang,Kunming Wu,Ruifeng Gongxie,Yuya He,Yuankai Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Statistics (stat.OT)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Recent studies have extended the application of large language models (LLMs) to geographic problems, revealing surprising geospatial competence even without explicit spatial supervision. However, LLMs still face challenges in spatial consistency, multi-hop reasoning, and geographic bias. To address these issues, we propose GeoSR, a self-refining agentic reasoning framework that embeds core geographic principles – most notably Tobler’s First Law of Geography – into an iterative prediction loop. In GeoSR, the reasoning process is decomposed into three collaborating agents: (1) a variable-selection agent that selects relevant covariates from the same location; (2) a point-selection agent that chooses reference predictions at nearby locations generated by the LLM in previous rounds; and (3) a refine agent that coordinates the iterative refinement process by evaluating prediction quality and triggering further rounds when necessary. This agentic loop progressively improves prediction quality by leveraging both spatial dependencies and inter-variable relationships. We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction. Experimental results show consistent improvements over standard prompting strategies, demonstrating that incorporating geostatistical priors and spatially structured reasoning into LLMs leads to more accurate and equitable geospatial predictions. The code of GeoSR is available at this https URL.
zh

[AI-43] KG-Augmented Executable CoT for Mathematical Coding

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂推理任务中表现不佳的问题，特别是数学推理和代码生成方面的局限性。其核心解决方案是提出KG-Augmented Executable Chain-of-Thought (KGA-ECoT) 框架，该框架的关键在于：通过构建结构化的任务图（Structured Task Graph）分解问题，并利用GraphRAG从数学知识库中高效检索精准知识以增强代码生成质量；同时，生成可执行的代码并通过外部执行验证结果，从而确保计算精度。实验证明，该方法在多个数学推理基准上显著优于传统提示方法，准确率提升达数个百分点至十余个百分点。

链接: https://arxiv.org/abs/2508.04072
作者: Xingyu Chen,Junxiu An,Jun Guo,Li Wang,Jingcai Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages,2figures,6 tables

点击查看摘要

Abstract:In recent years, large language models (LLMs) have excelled in natural language processing tasks but face significant challenges in complex reasoning tasks such as mathematical reasoning and code generation. To address these limitations, we propose KG-Augmented Executable Chain-of-Thought (KGA-ECoT), a novel framework that enhances code generation through knowledge graphs and improves mathematical reasoning via executable code. KGA-ECoT decomposes problems into a Structured Task Graph, leverages efficient GraphRAG for precise knowledge retrieval from mathematical libraries, and generates verifiable code to ensure computational accuracy. Evaluations on multiple mathematical reasoning benchmarks demonstrate that KGA-ECoT significantly outperforms existing prompting methods, achieving absolute accuracy improvements ranging from several to over ten percentage points. Further analysis confirms the critical roles of GraphRAG in enhancing code quality and external code execution in ensuring precision. These findings collectively establish KGA-ECoT as a robust and highly generalizable framework for complex mathematical reasoning tasks.
zh

[AI-44] Personalized Knowledge Transfer Through Generative AI: Contextualizing Learning to Individual Career Goals

【速读】：该论文试图解决的问题是：如何通过个性化学习内容提升学习者的参与度、满意度和学习效率，特别是在人工智能日益融入数字学习环境的背景下。解决方案的关键在于利用生成式 AI (Generative AI, GenAI) 对学习内容进行基于学习者职业目标的动态适配，从而增强内容的相关性和实用性。实验结果表明，这种基于职业目标的内容定制显著延长了学习会话时长、提高了满意度，并在保持认知深度的同时适度缩短了学习时间，证明了AI驱动的个性化策略能够有效连接学术知识与职场应用，提升学习效果。

链接: https://arxiv.org/abs/2508.04070
作者: Ronja Mehlan,Claudia Hess,Quintus Stierstorfer,Kristina Schaaff
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As artificial intelligence becomes increasingly integrated into digital learning environments, the personalization of learning content to reflect learners’ individual career goals offers promising potential to enhance engagement and long-term motivation. In our study, we investigate how career goal-based content adaptation in learning systems based on generative AI (GenAI) influences learner engagement, satisfaction, and study efficiency. The mixed-methods experiment involved more than 4,000 learners, with one group receiving learning scenarios tailored to their career goals and a control group. Quantitative results show increased session duration, higher satisfaction ratings, and a modest reduction in study duration compared to standard content. Qualitative analysis highlights that learners found the personalized material motivating and practical, enabling deep cognitive engagement and strong identification with the content. These findings underscore the value of aligning educational content with learners’ career goals and suggest that scalable AI personalization can bridge academic knowledge and workplace applicability.
zh

[AI-45] DRIVE: Dynamic Rule Inference and Verified Evaluation for Constraint-Aware Autonomous Driving

【速读】：该论文旨在解决自动驾驶中软约束（soft constraints）难以显式建模的问题，这些约束通常具有隐含性、情境依赖性和复杂性，传统方法难以有效捕捉人类驾驶行为中的偏好与规则。解决方案的关键在于提出DRIVE框架，其核心是通过指数族似然建模（exponential-family likelihood modeling）从专家演示中动态推断出随场景变化的软行为规则分布，并将这些概率化的规则嵌入基于凸优化的规划模块中，从而在轨迹生成过程中同时保证动力学可行性与人类偏好合规性。此方法实现了规则推理与轨迹决策的紧密耦合，相较于固定约束或纯奖励建模的方法，具备更强的数据驱动泛化能力与可验证的可行性保障。

链接: https://arxiv.org/abs/2508.04066
作者: Longling Geng,Huangxing Li,Viktor Lado Naess,Mert Pilanci
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and adhering to soft constraints is essential for safe and socially compliant autonomous driving. However, such constraints are often implicit, context-dependent, and difficult to specify explicitly. In this work, we present DRIVE, a novel framework for Dynamic Rule Inference and Verified Evaluation that models and evaluates human-like driving constraints from expert demonstrations. DRIVE leverages exponential-family likelihood modeling to estimate the feasibility of state transitions, constructing a probabilistic representation of soft behavioral rules that vary across driving contexts. These learned rule distributions are then embedded into a convex optimization-based planning module, enabling the generation of trajectories that are not only dynamically feasible but also compliant with inferred human preferences. Unlike prior approaches that rely on fixed constraint forms or purely reward-based modeling, DRIVE offers a unified framework that tightly couples rule inference with trajectory-level decision-making. It supports both data-driven constraint generalization and principled feasibility verification. We validate DRIVE on large-scale naturalistic driving datasets, including inD, highD, and RoundD, and benchmark it against representative inverse constraint learning and planning baselines. Experimental results show that DRIVE achieves 0.0% soft constraint violation rates, smoother trajectories, and stronger generalization across diverse driving scenarios. Verified evaluations further demonstrate the efficiency, explanability, and robustness of the framework for real-world deployment.
zh

[AI-46] SEA: Self-Evolution Agent with Step-wise Reward for Computer Use

【速读】：该论文旨在解决当前计算机使用代理（Computer Use Agent）性能不足的问题，尤其是在长程任务执行中的效率与效果瓶颈。其关键解决方案包括：首先提出一种自动化的轨迹生成流水线，用于训练数据的可验证性构建；其次设计了一种高效的逐步强化学习策略，以缓解长程训练带来的显著计算资源需求；最后提出模型增强方法，在不增加额外训练的前提下将 grounding（具身理解）与 planning（规划能力）融合至单一模型中。通过上述创新，作者实现了仅用7B参数的Self-Evolution Agent（SEA），在性能上超越同规模模型，并接近更大模型的表现。

链接: https://arxiv.org/abs/2508.04037
作者: Liang Tang,Shuxian Li,Yuhao Cheng,Yukang Huo,Zhepeng Wang,Yiqiang Yan,Kaer Huang,Yanzhe Jing,Tiaonan Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer use agent is an emerging area in artificial intelligence that aims to operate the computers to achieve the user’s tasks, which attracts a lot of attention from both industry and academia. However, the present agents’ performance is far from being used. In this paper, we propose the Self-Evolution Agent (SEA) for computer use, and to develop this agent, we propose creative methods in data generation, reinforcement learning, and model enhancement. Specifically, we first propose an automatic pipeline to generate the verifiable trajectory for training. And then, we propose efficient step-wise reinforcement learning to alleviate the significant computational requirements for long-horizon training. In the end, we propose the enhancement method to merge the grounding and planning ability into one model without any extra training. Accordingly, based on our proposed innovation of data generation, training strategy, and enhancement, we get the Selfevolution Agent (SEA) for computer use with only 7B parameters, which outperforms models with the same number of parameters and has comparable performance to larger ones. We will make the models’ weight and related codes open-source in the future.
zh

[AI-47] A Comparative Survey of PyTorch vs TensorFlow for Deep Learning: Usability Performance and Deployment Trade-offs

【速读】：该论文旨在解决深度学习框架选择中的关键决策问题，即在TensorFlow和PyTorch之间权衡其在可用性、性能与部署灵活性方面的差异。解决方案的关键在于系统性地对比两个框架的编程范式（TensorFlow的图执行模式与PyTorch的动态计算图）、训练与推理速度、生产部署能力（如TensorFlow Lite、TensorFlow Serving vs. TorchScript、TorchServe），以及生态支持与研究趋势，从而为从业者提供清晰的选型依据：PyTorch因其简洁性和灵活性更适用于研究场景，而TensorFlow凭借成熟的工业级工具链更适合生产环境。

链接: https://arxiv.org/abs/2508.04035
作者: Zakariya Ba Alawi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 15 figures, 43 references

点击查看摘要

Abstract:This paper presents a comprehensive comparative survey of TensorFlow and PyTorch, the two leading deep learning frameworks, focusing on their usability, performance, and deployment trade-offs. We review each framework’s programming paradigm and developer experience, contrasting TensorFlow’s graph-based (now optionally eager) approach with PyTorch’s dynamic, Pythonic style. We then compare model training speeds and inference performance across multiple tasks and data regimes, drawing on recent benchmarks and studies. Deployment flexibility is examined in depth - from TensorFlow’s mature ecosystem (TensorFlow Lite for mobile/embedded, TensorFlow Serving, and JavaScript support) to PyTorch’s newer production tools (TorchScript compilation, ONNX export, and TorchServe). We also survey ecosystem and community support, including library integrations, industry adoption, and research trends (e.g., PyTorch’s dominance in recent research publications versus TensorFlow’s broader tooling in enterprise). Applications in computer vision, natural language processing, and other domains are discussed to illustrate how each framework is used in practice. Finally, we outline future directions and open challenges in deep learning framework design, such as unifying eager and graph execution, improving cross-framework interoperability, and integrating compiler optimizations (XLA, JIT) for improved speed. Our findings indicate that while both frameworks are highly capable for state-of-the-art deep learning, they exhibit distinct trade-offs: PyTorch offers simplicity and flexibility favored in research, whereas TensorFlow provides a fuller production-ready ecosystem - understanding these trade-offs is key for practitioners selecting the appropriate tool. We include charts, code snippets, and more than 20 references to academic papers and official documentation to support this comparative analysis
zh

[AI-48] Enhancing Serendipity Recommendation System by Constructing Dynamic User Knowledge Graphs with Large Language Models

【速读】：该论文旨在解决工业推荐系统中存在的反馈循环导致的内容同质化、过滤气泡效应以及用户满意度下降的问题。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）动态构建用户知识图谱，并通过两阶段框架实现更富偶然性（serendipity）的推荐：第一阶段为两跳兴趣推理（two-hop interest reasoning），基于用户静态画像与历史行为，由LLM动态生成知识图谱并进行深度推理以挖掘潜在兴趣；第二阶段为近线适应（Near-line adaptation），提出一种融合用户到物品（u2i）与物品到物品（i2i）检索能力的高效模型，在保证高转化率的同时显著提升推荐新颖性与用户体验。在线实验表明，该方法在Dewu应用中有效提升了多项关键指标，验证了其在工业场景下的实用性与有效性。

链接: https://arxiv.org/abs/2508.04032
作者: Qian Yong,Yanhui Li,Jialiang Shi,Yaguang Dou,Tian Qi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The feedback loop in industrial recommendation systems reinforces homogeneous content, creates filter bubble effects, and diminishes user satisfaction. Recently, large language models(LLMs) have demonstrated potential in serendipity recommendation, thanks to their extensive world knowledge and superior reasoning capabilities. However, these models still face challenges in ensuring the rationality of the reasoning process, the usefulness of the reasoning results, and meeting the latency requirements of industrial recommendation systems (RSs). To address these challenges, we propose a method that leverages llm to dynamically construct user knowledge graphs, thereby enhancing the serendipity of recommendation systems. This method comprises a two stage framework:(1) two-hop interest reasoning, where user static profiles and historical behaviors are utilized to dynamically construct user knowledge graphs via llm. Two-hop reasoning, which can enhance the quality and accuracy of LLM reasoning results, is then performed on the constructed graphs to identify users’ potential interests; and(2) Near-line adaptation, a cost-effective approach to deploying the aforementioned models in industrial recommendation systems. We propose a u2i (user-to-item) retrieval model that also incorporates i2i (item-to-item) retrieval capabilities, the retrieved items not only exhibit strong relevance to users’ newly emerged interests but also retain the high conversion rate of traditional u2i retrieval. Our online experiments on the Dewu app, which has tens of millions of users, indicate that the method increased the exposure novelty rate by 4.62%, the click novelty rate by 4.85%, the average view duration per person by 0.15%, unique visitor click through rate by 0.07%, and unique visitor interaction penetration by 0.30%, enhancing user experience.
zh

[AI-49] Uncertainty-Aware GUI Agent : Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement

【速读】：该论文旨在解决图形用户界面（GUI）代理在自动化移动任务中面临的输入冗余和决策模糊性问题。其核心解决方案在于提出一种不确定性感知的代理RecAgent，通过自适应感知机制降低两类不确定性：一是由屏幕信息冗余和噪声引起的感知不确定性，通过组件推荐机制聚焦于最相关的UI元素以减少输入复杂度；二是由任务模糊性和复杂推理引发的决策不确定性，通过交互模块在不确定情境下请求用户反馈，实现意图感知的决策。该方法将上述机制整合进统一框架，结合人机协同优化，在复杂场景下显著提升了GUI代理的任务执行成功率。

链接: https://arxiv.org/abs/2508.04025
作者: Chao Hao,Shuai Wang,Kaiwen Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphical user interface (GUI) agents have shown promise in automating mobile tasks but still struggle with input redundancy and decision ambiguity. In this paper, we present \textbfRecAgent, an uncertainty-aware agent that addresses these issues through adaptive perception. We distinguish two types of uncertainty in GUI navigation: (1) perceptual uncertainty, caused by input redundancy and noise from comprehensive screen information, and (2) decision uncertainty, arising from ambiguous tasks and complex reasoning. To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements. For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions. These components are integrated into a unified framework that proactively reduces input complexity and reacts to high-uncertainty cases via human-in-the-loop refinement. Additionally, we propose a dataset called \textbfComplexAction to evaluate the success rate of GUI agents in executing specified single-step actions within complex scenarios. Extensive experiments validate the effectiveness of our approach. The dataset and code will be available at this https URL.
zh

[AI-50] Identity Theft in AI Conference Peer Review

【速读】：该论文旨在解决人工智能（AI）研究领域中科学同行评审过程中日益严重的身份盗用问题，即不诚实的研究人员通过伪造审稿人身份来操纵论文评审结果，暴露出当前审稿人招募流程和身份验证机制的漏洞。解决方案的关键在于建立更严格的审稿人身份验证体系，并改进学术出版流程中的安全防护措施，以防范此类欺诈行为对学术诚信造成系统性损害。

链接: https://arxiv.org/abs/2508.04024
作者: Nihar B. Shah,Melisa Bok,Xukun Liu,Andrew McCallum
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We discuss newly uncovered cases of identity theft in the scientific peer-review process within artificial intelligence (AI) research, with broader implications for other academic procedures. We detail how dishonest researchers exploit the peer-review system by creating fraudulent reviewer profiles to manipulate paper evaluations, leveraging weaknesses in reviewer recruitment workflows and identity verification processes. The findings highlight the critical need for stronger safeguards against identity theft in peer review and academia at large, and to this end, we also propose mitigating strategies.
zh

[AI-51] StepWrite: Adaptive Planning for Speech-Driven Text Generation WWW

【速读】：该论文旨在解决当前语音转文字系统在处理长文本、复杂语境内容时的局限性，尤其是在用户处于移动状态且无法视觉追踪进度的场景下，传统语音输入工具和对话式语音助手难以支持结构化、持续性的写作需求。其解决方案的关键在于提出StepWrite系统——一个基于大语言模型（Large Language Model, LLM）驱动的语音交互系统，通过将写作过程分解为可管理的子任务，并以情境感知的非视觉音频提示序列引导用户，实现无手无眼条件下的结构化文本创作。该系统通过模型自动承担上下文跟踪与自适应规划任务，显著降低认知负荷，同时动态调整提示内容以匹配用户意图演化，从而在保持用户自主性的同时提升可用性和满意度。

链接: https://arxiv.org/abs/2508.04011
作者: Hamza El Alaoui,Atieh Taheri,Yi-Hao Peng,Jeffrey P. Bigham
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to UIST 2025. For additional materials and project details, please see: this https URL

点击查看摘要

Abstract:People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions–capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite’s capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.
zh

[AI-52] Galaxy: A Cognition-Centered Framework for Proactive Privacy-Preserving and Self-Evolving LLM Agents

【速读】：该论文旨在解决智能个人助理（Intelligent Personal Assistant, IPA）在主动行为（proactive behavior）、隐私保护和自我进化能力方面的研究空白与技术挑战。现有IPA多聚焦于响应式功能，缺乏自主决策与持续优化的能力，且难以在保障用户隐私的前提下实现个性化演进。解决方案的关键在于提出Cognition Forest——一种将认知建模与系统设计统一为自增强循环的语义结构，并基于此构建Galaxy框架，支持多维交互与个性化能力生成；其中KoRa代理实现认知增强的响应与主动技能，Kernel代理则通过元认知机制实现系统的自我进化与隐私保护，实验表明该方案在多项基准测试中优于当前主流方法。

链接: https://arxiv.org/abs/2508.03991
作者: Chongyu Bao,Ruimin Dai,Yangbo Shen,Runyang Jian,Jinghan Zhang,Xiaolan Liu,Kunpeng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent personal assistants (IPAs) such as Siri and Google Assistant are designed to enhance human capabilities and perform tasks on behalf of users. The emergence of LLM agents brings new opportunities for the development of IPAs. While responsive capabilities have been widely studied, proactive behaviors remain underexplored. Designing an IPA that is proactive, privacy-preserving, and capable of self-evolution remains a significant challenge. Designing such IPAs relies on the cognitive architecture of LLM agents. This work proposes Cognition Forest, a semantic structure designed to align cognitive modeling with system-level design. We unify cognitive architecture and system design into a self-reinforcing loop instead of treating them separately. Based on this principle, we present Galaxy, a framework that supports multidimensional interactions and personalized capability generation. Two cooperative agents are implemented based on Galaxy: KoRa, a cognition-enhanced generative agent that supports both responsive and proactive skills; and Kernel, a meta-cognition-based meta-agent that enables Galaxy’s self-evolution and privacy preservation. Experimental results show that Galaxy outperforms multiple state-of-the-art benchmarks. Ablation studies and real-world interaction cases validate the effectiveness of Galaxy.
zh

[AI-53] Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework

【速读】：该论文旨在解决现代传感系统中用户隐私控制不足的问题，尤其是在配备惯性测量单元（Inertial Measurement Unit, IMU）的设备（如智能手机和可穿戴设备）中，这些设备持续采集丰富的时间序列数据，可能无意间暴露敏感用户行为。传统隐私保护方法通常依赖静态预定义的隐私标签或需要大量私有训练数据，难以适应用户动态变化的隐私偏好。解决方案的关键在于提出PrivCLIP框架，其核心是通过多模态对比学习将IMU数据与自然语言活动描述对齐至共享嵌入空间，实现基于少量样本的敏感活动检测；当识别出隐私敏感活动时，系统利用语言引导的活动净化器与运动生成模块（IMU-GPT）将原始数据转换为语义上类似于非敏感活动的合规版本，从而在保障隐私的同时维持数据效用。

链接: https://arxiv.org/abs/2508.03989
作者: Ajesh Koyatan Chathoth,Shuhao Yu,Stephen Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User-controllable privacy is important in modern sensing systems, as privacy preferences can vary significantly from person to person and may evolve over time. This is especially relevant in devices equipped with Inertial Measurement Unit (IMU) sensors, such as smartphones and wearables, which continuously collect rich time-series data that can inadvertently expose sensitive user behaviors. While prior work has proposed privacy-preserving methods for sensor data, most rely on static, predefined privacy labels or require large quantities of private training data, limiting their adaptability and user agency. In this work, we introduce PrivCLIP, a dynamic, user-controllable, few-shot privacy-preserving sensing framework. PrivCLIP allows users to specify and modify their privacy preferences by categorizing activities as sensitive (black-listed), non-sensitive (white-listed), or neutral (gray-listed). Leveraging a multimodal contrastive learning approach, PrivCLIP aligns IMU sensor data with natural language activity descriptions in a shared embedding space, enabling few-shot detection of sensitive activities. When a privacy-sensitive activity is identified, the system uses a language-guided activity sanitizer and a motion generation module (IMU-GPT) to transform the original data into a privacy-compliant version that semantically resembles a non-sensitive activity. We evaluate PrivCLIP on multiple human activity recognition datasets and demonstrate that it significantly outperforms baseline methods in terms of both privacy protection and data utility.
zh

[AI-54] he Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

【速读】：该论文旨在解决大语言推理模型（MLRMs）在人本服务导向场景下，因对用户情绪线索敏感而导致的安全协议失效问题，尤其在深度思考阶段易受高情感强度干扰，从而产生有害输出。解决方案的关键在于提出EmoAgent框架，这是一个自主的对抗性情绪代理系统，通过构造夸张的情绪提示（affective prompts）来劫持模型的推理路径，揭示出即使视觉风险被正确识别，仍可能因情绪错位而生成有害内容的深层缺陷。该框架进一步量化了三种典型风险模式：有害推理隐蔽度（RRSS）、视觉忽视率（RVNR）和拒绝态度不一致性（RAIC），从而暴露模型内部推理与表面行为之间的不对齐现象，突破现有基于内容的安全防护机制局限。

链接: https://arxiv.org/abs/2508.03986
作者: Yuan Xun,Xiaojun Jia,Xinwei Liu,Hua Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) Risk-Reasoning Stealth Score (RRSS) for harmful reasoning beneath benign outputs; (2) Risk-Visual Neglect Rate (RVNR) for unsafe completions despite visual risk recognition; and (3) Refusal Attitude Inconsistency (RAIC) for evaluating refusal unstability under prompt variants. Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior.
zh

[AI-55] Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

【速读】：该论文旨在解决当前人-人工智能交互（Human-AI Interaction, HAII）研究中过度聚焦技术而忽视人类需求的问题，提出以人类为中心的人-人工智能交互（Human-Centered HAII, HC-HAII）框架作为解决方案。其关键在于将人类置于HAII研究与应用的核心位置，强调采用以人为本的方法论，而非技术驱动导向，并通过引入人本方法、跨学科团队协作、多层级设计范式等系统性策略，推动HAII理论与实践的协同发展。

链接: https://arxiv.org/abs/2508.03969
作者: Wei Xu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This chapter systematically promotes an emerging interdisciplinary field of human-artificial intelligence interaction (human-AI interaction, HAII) from a human-centered AI (HCAI) perspective. It introduces a framework of human-centered HAII (HC-HAII). HC-HAII places humans at the core of HAII research and applications, emphasizing the importance of adopting a human-centered approach over a technology-centered one. The chapter presents the HC-HAII methodology, including human-centered methods, process, interdisciplinary teams, and multi-level design paradigms. It also highlights key research challenges and future directions. As the first chapter, this chapter also provides a structural overview of this book, which brings together contributions from an interdisciplinary community of researchers and practitioners to advance the theory, methodology, and applications of HCAI in diverse domains of HAII. The purpose of this chapter is to provide a fundamental framework for this book, centered on HAII research and applications based on the HCAI approach, which will pave the way for the content of subsequent chapters.
zh

[AI-56] Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?

【速读】：该论文旨在解决从时间序列数据中自动发现隐藏的符号规律（symbolic laws）这一核心科学发现挑战，其目标是提升大语言模型（Large Language Models, LLMs）在复杂现实场景下进行可解释、上下文对齐的符号结构推理能力。解决方案的关键在于提出一个统一框架，将LLMs与遗传编程（genetic programming）相结合，构建闭环符号推理系统：LLMs同时担任预测器和评估者角色，从而实现对多变量符号回归、布尔网络推断和因果发现等任务的系统性评估与优化。实证结果表明，融合领域知识、上下文对齐性和结构化推理机制是提升LLMs在自动化科学发现中性能的核心路径。

链接: https://arxiv.org/abs/2508.03963
作者: Zewen Liu,Juntong Ni,Xianfeng Tang,Max S.Y. Lau,Wei Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery.
zh

[AI-57] Constraint-Preserving Data Generation for Visuomotor Policy Learning

【速读】：该论文旨在解决机器人操作中大规模演示数据收集成本高、耗时长的问题，提出了一种名为Constraint-Preserving Data Generation (CP-Gen) 的方法，通过单条专家轨迹生成包含新颖物体几何形状和位姿的机器人演示数据。其解决方案的关键在于将机器人技能建模为关键点轨迹约束（keypoint-trajectory constraints），即机器人或抓取物体上的关键点需相对于任务相关的物体跟踪参考轨迹；在此基础上，通过采样物体的位姿和几何变换，并将其应用于物体及其关联的关键点或关键点轨迹，再优化机器人关节配置以使关键点跟踪变换后的轨迹，最终生成无碰撞的运动规划路径。该方法使得训练出的闭环视觉-运动策略能够在零样本迁移至真实世界并泛化于物体几何与位姿的变化。

链接: https://arxiv.org/abs/2508.03944
作者: Kevin Lin,Varun Ragunath,Andrew McAlinden,Aaditya Prasad,Jimmy Wu,Yuke Zhu,Jeannette Bohg
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: CoRL 2025. Website: this https URL

点击查看摘要

Abstract:Large-scale demonstration data has powered key breakthroughs in robot manipulation, but collecting that data remains costly and time-consuming. We present Constraint-Preserving Data Generation (CP-Gen), a method that uses a single expert trajectory to generate robot demonstrations containing novel object geometries and poses. These generated demonstrations are used to train closed-loop visuomotor policies that transfer zero-shot to the real world and generalize across variations in object geometries and poses. Similar to prior work using pose variations for data generation, CP-Gen first decomposes expert demonstrations into free-space motions and robot skills. But unlike those works, we achieve geometry-aware data generation by formulating robot skills as keypoint-trajectory constraints: keypoints on the robot or grasped object must track a reference trajectory defined relative to a task-relevant object. To generate a new demonstration, CP-Gen samples pose and geometry transforms for each task-relevant object, then applies these transforms to the object and its associated keypoints or keypoint trajectories. We optimize robot joint configurations so that the keypoints on the robot or grasped object track the transformed keypoint trajectory, and then motion plan a collision-free path to the first optimized joint configuration. Experiments on 16 simulation tasks and four real-world tasks, featuring multi-stage, non-prehensile and tight-tolerance manipulation, show that policies trained using CP-Gen achieve an average success rate of 77%, outperforming the best baseline that achieves an average of 50%.
zh

[AI-58] FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport

【速读】：该论文旨在解决高风险领域（如医疗、金融和刑事司法）中，公平性约束（通常基于AUC指标）与模型整体性能（即AUC表现）之间存在的冲突问题——即严格强制公平可能导致AUC显著下降。其解决方案的关键在于提出一种模型无关的后处理框架Fair Proportional Optimal Transport (FairPOT)，该方法利用最优传输（Optimal Transport）技术，仅对劣势群体中比例可控的顶部λ分位数风险分数进行选择性调整，从而在降低AUC不公平性的同时保持甚至提升整体AUC性能。通过调节λ参数，FairPOT实现了公平性与效用之间的可调权衡，并进一步扩展至局部AUC场景，聚焦于高风险区域的公平干预，展现出优异的实验效果与实际部署潜力。

链接: https://arxiv.org/abs/2508.03940
作者: Pengxi Liu,Yi Shen,Matthew M. Engelhard,Benjamin A. Goldstein,Michael J. Pencina,Nicoleta J. Economou-Zavlanos,Michael M. Zavlanos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.
zh

[AI-59] MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework

【速读】：该论文旨在解决NP-hard组合优化问题（COPs）中算法组件设计效率低下的核心挑战，传统方法依赖人工精心设计的策略，难以探索更优解空间。其解决方案的关键在于提出一种多策略优化框架MOTIF（Multi-strategy Optimization via Turn-based Interactive Framework），该框架基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS），通过两个LLM代理在回合制交互中轮流优化一组相互依赖的求解器组件，利用双方历史更新信息实现竞争性驱动与涌现式协作，从而显著拓宽搜索空间并发现多样且高性能的求解策略组合。

链接: https://arxiv.org/abs/2508.03929
作者: Nguyen Viet Tuan Kiet,Dao Van Tung,Tran Cong Dao,Huynh Thi Thanh Binh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 4 figures

点击查看摘要

Abstract:Designing effective algorithmic components remains a fundamental obstacle in tackling NP-hard combinatorial optimization problems (COPs), where solvers often rely on carefully hand-crafted strategies. Despite recent advances in using large language models (LLMs) to synthesize high-quality components, most approaches restrict the search to a single element - commonly a heuristic scoring function - thus missing broader opportunities for innovation. In this paper, we introduce a broader formulation of solver design as a multi-strategy optimization problem, which seeks to jointly improve a set of interdependent components under a unified objective. To address this, we propose Multi-strategy Optimization via Turn-based Interactive Framework (MOTIF) - a novel framework based on Monte Carlo Tree Search that facilitates turn-based optimization between two LLM agents. At each turn, an agent improves one component by leveraging the history of both its own and its opponent’s prior updates, promoting both competitive pressure and emergent cooperation. This structured interaction broadens the search landscape and encourages the discovery of diverse, high-performing solutions. Experiments across multiple COP domains show that MOTIF consistently outperforms state-of-the-art methods, highlighting the promise of turn-based, multi-agent prompting for fully automated solver design.
zh

[AI-60] Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data

【速读】：该论文旨在解决跨域时间序列数据中异常检测的性能提升问题，核心挑战在于如何有效利用有限标注样本实现高精度模型训练。其解决方案的关键在于结合主动学习（Active Learning）与迁移学习（Transfer Learning），通过智能选择最具信息量的样本进行标注以优化模型性能。研究发现，尽管主动学习能持续提升模型表现，但改进呈线性趋缓趋势，且单一聚类（即不进行聚类）时效果最优，表明样本选择策略对性能影响显著，同时验证了主动学习在合理实验设计下具有稳定有效性。

链接: https://arxiv.org/abs/2508.03921
作者: John D. Kelleher,Matthew Nicholson,Rahul Agrahari,Clare Conran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled.
zh

[AI-61] Fast and Accurate Explanations of Distance-Based Classifiers by Uncovering Latent Explanatory Structures

【速读】：该论文旨在解决距离-based分类器（如k近邻和支持向量机）在实际应用中缺乏可解释性的问题，这限制了其在科学与工业场景中获取可信洞察的能力。解决方案的关键在于揭示了这类分类器内部隐藏的神经网络结构——由线性检测单元与非线性池化层组成，从而使得可解释人工智能（Explainable AI, XAI）技术（如逐层重要性传播，Layer-wise Relevance Propagation, LRP）能够被有效应用于这些模型，实现更清晰、可理解的预测解释。

链接: https://arxiv.org/abs/2508.03913
作者: Florian Bley,Jacob Kauffmann,Simon León Krug,Klaus-Robert Müller,Grégoire Montavon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Distance-based classifiers, such as k-nearest neighbors and support vector machines, continue to be a workhorse of machine learning, widely used in science and industry. In practice, to derive insights from these models, it is also important to ensure that their predictions are explainable. While the field of Explainable AI has supplied methods that are in principle applicable to any model, it has also emphasized the usefulness of latent structures (e.g. the sequence of layers in a neural network) to produce explanations. In this paper, we contribute by uncovering a hidden neural network structure in distance-based classifiers (consisting of linear detection units combined with nonlinear pooling layers) upon which Explainable AI techniques such as layer-wise relevance propagation (LRP) become applicable. Through quantitative evaluations, we demonstrate the advantage of our novel explanation approach over several baselines. We also show the overall usefulness of explaining distance-based models through two practical use cases.
zh

[AI-62] Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning

【速读】：该论文旨在解决葡萄物候期（phenology）精准预测问题，以支持精细化葡萄园管理决策，如灌溉和施肥时间的优化。传统生物物理模型虽可用于季节性预测，但其精度不足；而深度学习方法受限于品种级别数据稀疏的问题。解决方案的关键在于提出一种混合建模方法，将多任务学习（multi-task learning）与循环神经网络（recurrent neural network）结合，用于参数化可微分的生物物理模型。该方法通过多任务学习实现不同品种间的共享学习，同时保留生物结构信息，从而显著提升预测准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.03898
作者: William Solow,Sandhya Saisubramanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of grape phenology is essential for timely vineyard management decisions, such as scheduling irrigation and fertilization, to maximize crop yield and quality. While traditional biophysical models calibrated on historical field data can be used for season-long predictions, they lack the precision required for fine-grained vineyard management. Deep learning methods are a compelling alternative but their performance is hindered by sparse phenology datasets, particularly at the cultivar level. We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model. By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure, thereby improving the robustness and accuracy of predictions. Empirical evaluation using real-world and synthetic datasets demonstrates that our method significantly outperforms both conventional biophysical models and baseline deep learning approaches in predicting phenological stages, as well as other crop state variables such as cold-hardiness and wheat yield.
zh

[AI-63] Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE)

【速读】：该论文旨在解决当前组织在复杂数字环境中难以有效识别隐蔽攻击路径的问题，尤其是在传统攻击模拟方法无法充分覆盖动态威胁场景的情况下。解决方案的关键在于将安全混沌工程（Security Chaos Engineering, SCE）与漏洞攻击模拟（Breach Attack Simulation, BAS）平台相结合，构建一个由SCE协调器、连接器和BAS三层组成的结构化架构，利用MITRE Caldera执行自动化攻击序列，并基于现有威胁情报数据库中的对手画像（adversary profiles）生成推断的攻击树，从而显著提升攻击模拟的真实性与覆盖度，增强防御体系的韧性。

链接: https://arxiv.org/abs/2508.03882
作者: Arturo Sánchez-Matas,Pablo Escribano Ruiz,Daniel Díaz-López,Angel Luis Perales Gómez,Pantaleone Nespoli,Gregorio Martínez Pérez
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, paper in proceedings of the “X Jornadas Nacionales de Investigación en Ciberseguridad” in Zaragoza, Spain, June, 2025

点击查看摘要

Abstract:In today digital landscape, organizations face constantly evolving cyber threats, making it essential to discover slippery attack vectors through novel techniques like Security Chaos Engineering (SCE), which allows teams to test defenses and identify vulnerabilities effectively. This paper proposes to integrate SCE into Breach Attack Simulation (BAS) platforms, leveraging adversary profiles and abilities from existing threat intelligence databases. This innovative proposal for cyberattack simulation employs a structured architecture composed of three layers: SCE Orchestrator, Connector, and BAS layers. Utilizing MITRE Caldera in the BAS layer, our proposal executes automated attack sequences, creating inferred attack trees from adversary profiles. Our proposal evaluation illustrates how integrating SCE with BAS can enhance the effectiveness of attack simulations beyond traditional scenarios, and be a useful component of a cyber defense strategy.
zh

[AI-64] Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

【速读】：该论文旨在解决在摩尔定律（Moore’s law）和登纳德缩放定律（Dennard scaling）终结背景下，深度学习模型训练对数据量依赖过高导致的能效瓶颈问题。其核心解决方案是提出SICKLE框架，关键创新在于引入一种基于最大熵（maximum entropy, MaxEnt）的智能采样方法，通过稀疏但信息丰富的数据子集进行训练，在保持甚至提升模型精度的同时显著降低能耗，实测最高可实现38倍的能源效率提升。

链接: https://arxiv.org/abs/2508.03872
作者: Wesley Brewer,Murali Meena Gopalakrishnan,Matthias Maiterth,Aditya Kashi,Jong Youl Choi,Pei Zhang,Stephen Nichols,Riccardo Balin,Miles Couchman,Stephen de Bruyn Kops,P.K. Yeung,Daniel Dotson,Rohini Uma-Vaideswaran,Sarp Oral,Feiyi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 14 pages, 8 figures, 2 tables

点击查看摘要

Abstract:With the end of Moore’s law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.
zh

[AI-65] Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems, MAS）在基于多模态大语言模型时面临的开放性和交互复杂性带来的安全风险，尤其是越狱攻击（jailbreak）和对抗攻击问题。现有防御方法依赖外部安全代理模块，存在保护能力有限和单点故障风险，且增加 guard agent 数量会显著提升成本与复杂度。其解决方案的关键在于提出 Evo-MARL 框架，通过多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）使所有任务智能体同时具备防御能力，无需额外安全模块即可实现内在化安全机制；并通过进化搜索与参数共享强化学习相结合的方式，协同演化攻击者与防御者，在持续对抗训练中提升系统整体鲁棒性和任务性能，从而在不引入额外开销的前提下实现安全性与实用性的共同增强。

链接: https://arxiv.org/abs/2508.03864
作者: Zhenyu Pan,Yiting Zhang,Yutong Zhang,Jianshu Zhang,Haozheng Luo,Yuwei Han,Dennis Wu,Hong-Yu Chen,Philip S. Yu,Manling Li,Han Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) built on multimodal large language models exhibit strong collaboration and performance. However, their growing openness and interaction complexity pose serious risks, notably jailbreak and adversarial attacks. Existing defenses typically rely on external guard modules, such as dedicated safety agents, to handle unsafe behaviors. Unfortunately, this paradigm faces two challenges: (1) standalone agents offer limited protection, and (2) their independence leads to single-point failure-if compromised, system-wide safety collapses. Naively increasing the number of guard agents further raises cost and complexity. To address these challenges, we propose Evo-MARL, a novel multi-agent reinforcement learning (MARL) framework that enables all task agents to jointly acquire defensive capabilities. Rather than relying on external safety modules, Evo-MARL trains each agent to simultaneously perform its primary function and resist adversarial threats, ensuring robustness without increasing system overhead or single-node failure. Furthermore, Evo-MARL integrates evolutionary search with parameter-sharing reinforcement learning to co-evolve attackers and defenders. This adversarial training paradigm internalizes safety mechanisms and continually enhances MAS performance under co-evolving threats. Experiments show that Evo-MARL reduces attack success rates by up to 22% while boosting accuracy by up to 5% on reasoning tasks-demonstrating that safety and utility can be jointly improved.
zh

[AI-66] MI9 – Agent Intelligence Protocol: Runtime Governance for Agent ic AI Systems

【速读】：该论文旨在解决生成式 AI (Generative AI) 系统在运行时因具备推理、规划和执行能力而产生的新型代理风险（agent-related risks），这些问题无法通过传统的部署前治理手段完全预见和控制。其解决方案的关键在于提出首个完整的运行时治理框架 MI9，该框架包含六个集成组件：代理风险指数、代理语义遥测采集、持续授权监控、基于有限状态机（Finite-State-Machine, FSM）的合规引擎、目标条件漂移检测以及分级 containment 策略，从而实现对异构代理架构下 agentic AI 系统的实时安全管控与对齐，为规模化生产环境中安全部署提供系统性保障。

链接: https://arxiv.org/abs/2508.03858
作者: Charles L. Wang,Trisha Singhal,Ameya Kelkar,Jason Tuo
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9’s systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight.
zh

[AI-67] VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations

【速读】：该论文旨在解决参数化非线性偏微分方程（parameterized nonlinear partial differential equations）的前向与反演问题，特别是在地下水流动建模中，如何高效且准确地求解无压含水层中的非线性扩散方程。其解决方案的关键在于提出一种可分训练的代理模型——VAE-DNN，该模型由编码器、全连接神经网络和解码器组成，能够独立训练三个组件：编码器作为输入场 $ y(\mathbf{x}) $ 的变分自编码器（variational autoencoder, VAE）的一部分，解码器作为输出场 $ h(\mathbf{x},t) $ 的VAE的一部分，从而显著降低训练时间和能耗，同时在前向与反演求解精度上优于FNO和DeepONet等主流算子学习模型。

链接: https://arxiv.org/abs/2508.03839
作者: Yifei Zong,Alexandre M. Tartakovsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input y(\bmx) to a lower-dimensional latent space, \bm\mu_\bm\phi_y . Then, a fully connected neural network is used to map \bm\mu_\bm\phi_y to the latent space, \bm\mu_\bm\phi_h , of the PDE solution h(\bmx,t) . Finally, a decoder is utilized to reconstruct h(\bmx,t) . The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for y(\bmx) and the decoder as part of the h(\bmx,t) VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models.
zh

[AI-68] Mechanism Design for Facility Location using Predictions IJCAI2025

【速读】：该论文旨在解决带有预测信息的设施选址问题（facility location problem），其核心挑战在于如何在利用预测的最优设施位置提升性能的同时，保障机制在预测不准时的鲁棒性。解决方案的关键在于引入一种平等视角（egalitarian viewpoint），同时优化任意个体到设施的最大距离和最小效用，从而揭示了仅关注最大距离时所忽略的重要特性；在此基础上，设计出更具鲁棒性的新机制，并通过调节参数实现鲁棒性与一致性的权衡，进一步扩展至双设施选址场景，提出具有策略激励相容（strategy-proof）性质的新机制，且其一致性与鲁棒性均受控于两个预测位置。

链接: https://arxiv.org/abs/2508.03818
作者: Toby Walsh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: To appear in Proceedings oj IJCAI 2025 workshop on Computational Fair Division

点击查看摘要

Abstract:We study mechanisms for the facility location problem augmented with predictions of the optimal facility location. We demonstrate that an egalitarian viewpoint which considers both the maximum distance of any agent from the facility and the minimum utility of any agent provides important new insights compared to a viewpoint that just considers the maximum distance. As in previous studies, we consider performance in terms of consistency (worst case when predictions are accurate) and robustness (worst case irrespective of the accuracy of predictions). By considering how mechanisms with predictions can perform poorly, we design new mechanisms that are more robust. Indeed, by adjusting parameters, we demonstrate how to trade robustness for consistency. We go beyond the single facility problem by designing novel strategy proof mechanisms for locating two facilities with bounded consistency and robustness that use two predictions for where to locate the two facilities.
zh

[AI-69] SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons

【速读】：该论文旨在解决土壤剖面（soil profile）中土壤发生层（soil horizon）分类的难题，该任务具有多模态（图像与地理时空元数据）、多任务特性以及复杂的层级标签结构（hierarchical label taxonomy），且标签数量庞大、分布极度不均衡。传统方法难以有效建模此类复杂结构，限制了土壤健康监测与农业可持续发展的精度。解决方案的关键在于提出一个名为SoilNet的模块化多模态多任务模型：首先利用图像和地理时空信息预测深度标记（depth markers），将土层分割为候选发生层；随后提取每段的形态学特征；最后通过融合多模态特征向量并引入基于图结构的标签表示（graph-based label representation），显式建模发生层之间的层级依赖关系，从而实现高精度的分类。

链接: https://arxiv.org/abs/2508.03785
作者: Teodor Chiaburu,Vipin Singh,Frank Haußer,Felix Bießmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures, 6 tables

点击查看摘要

Abstract:While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose \textitSoilNet - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: this https URL
zh

[AI-70] Are Inherently Interpretable Models More Robust? A Study In Music Emotion Recognition

【速读】：该论文旨在解决深度学习模型在面对微小扰动（如对抗样本）时缺乏鲁棒性的问题，尤其是在音乐情感识别（Music Emotion Recognition, MER）任务中，模型对无关特征的敏感性可能导致输出剧烈变化。其解决方案的关键在于验证“内在可解释的深度模型”是否比黑箱模型更具鲁棒性——即通过设计更关注有意义且可解释特征的模型架构，在不依赖对抗训练的情况下提升对无关扰动的抵抗能力。实验表明，这类模型不仅在鲁棒性上优于传统黑箱模型，还能达到与对抗训练模型相当的性能，同时计算成本更低。

链接: https://arxiv.org/abs/2508.03780
作者: Katharina Hoedt,Arthur Flexer,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 8 pages, published in Proceedings of the 22nd Sound and Music Computing Conference 2025 (SMC-25)

点击查看摘要

Abstract:One of the desired key properties of deep learning models is the ability to generalise to unseen samples. When provided with new samples that are (perceptually) similar to one or more training samples, deep learning models are expected to produce correspondingly similar outputs. Models that succeed in predicting similar outputs for similar inputs are often called robust. Deep learning models, on the other hand, have been shown to be highly vulnerable to minor (adversarial) perturbations of the input, which manage to drastically change a model’s output and simultaneously expose its reliance on spurious correlations. In this work, we investigate whether inherently interpretable deep models, i.e., deep models that were designed to focus more on meaningful and interpretable features, are more robust to irrelevant perturbations in the data, compared to their black-box counterparts. We test our hypothesis by comparing the robustness of an interpretable and a black-box music emotion recognition (MER) model when challenged with adversarial examples. Furthermore, we include an adversarially trained model, which is optimised to be more robust, in the comparison. Our results indicate that inherently more interpretable models can indeed be more robust than their black-box counterparts, and achieve similar levels of robustness as adversarially trained models, at lower computational cost.
zh

[AI-71] When Agents Break Down in Multiagent Path Finding

【速读】：该论文针对多智能体路径规划（Multiagent Path Finding, MAPF）中因部分智能体发生故障导致延迟的问题提出解决方案。传统方法在每次故障后重新计算整个调度方案，计算开销大且不切实际。其关键创新在于提出一种无需全局重规划的动态调度自适应框架：通过设计局部协调协议，使智能体能够在运行时自主调整路径，从而将k次故障引起的makespan（完成时间）增长严格控制在k个时间步以内；同时，为应对智能体计算能力受限的情况，进一步提出次级协议将计算任务迁移至网络节点，实现高效、可扩展且鲁棒的多智能体导航机制。

链接: https://arxiv.org/abs/2508.03777
作者: Foivos Fioravantes,Dušan Knop,Nikolaos Melissinos,Michal Opler
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Multiagent Path Finding (MAPF), the goal is to compute efficient, collision-free paths for multiple agents navigating a network from their sources to targets, minimizing the schedule’s makespan-the total time until all agents reach their destinations. We introduce a new variant that formally models scenarios where some agents may experience delays due to malfunctions, posing significant challenges for maintaining optimal schedules. Recomputing an entirely new schedule from scratch after each malfunction is often computationally infeasible. To address this, we propose a framework for dynamic schedule adaptation that does not rely on full replanning. Instead, we develop protocols enabling agents to locally coordinate and adjust their paths on the fly. We prove that following our primary communication protocol, the increase in makespan after k malfunctions is bounded by k additional turns, effectively limiting the impact of malfunctions on overall efficiency. Moreover, recognizing that agents may have limited computational capabilities, we also present a secondary protocol that shifts the necessary computations onto the network’s nodes, ensuring robustness without requiring enhanced agent processing power. Our results demonstrate that these protocols provide a practical, scalable approach to resilient multiagent navigation in the face of agent failures. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.03777 [cs.MA] (or arXiv:2508.03777v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.03777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] Revisiting Heat Flux Analysis of Tungsten Monoblock Divertor on EAST using Physics-Informed Neural Network

【速读】：该论文旨在解决核聚变装置EAST中热流密度估算的计算效率问题。传统基于有限元法（Finite Element Method, FEM）的模拟方法依赖于网格采样，导致计算效率低下，难以在实际实验中实现实时仿真。其解决方案的关键在于提出一种物理信息神经网络（Physics-Informed Neural Network, PINN），通过将热传导方程作为先验知识嵌入神经网络损失函数，结合空间坐标、时间戳以及少量数据驱动的采样点，同时优化边界条件、初始条件和物理一致性损失，从而在保持与FEM相当精度的前提下，实现约40倍的计算加速。

链接: https://arxiv.org/abs/2508.03776
作者: Xiao Wang,Zikang Yan,Hao Si,Zhendong Yang,Qingquan Yang,Dengdi Sun,Wanli Lyu,Jin Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating heat flux in the nuclear fusion device EAST is a critically important task. Traditional scientific computing methods typically model this process using the Finite Element Method (FEM). However, FEM relies on grid-based sampling for computation, which is computationally inefficient and hard to perform real-time simulations during actual experiments. Inspired by artificial intelligence-powered scientific computing, this paper proposes a novel Physics-Informed Neural Network (PINN) to address this challenge, significantly accelerating the heat conduction estimation process while maintaining high accuracy. Specifically, given inputs of different materials, we first feed spatial coordinates and time stamps into the neural network, and compute boundary loss, initial condition loss, and physical loss based on the heat conduction equation. Additionally, we sample a small number of data points in a data-driven manner to better fit the specific heat conduction scenario, further enhancing the model’s predictive capability. We conduct experiments under both uniform and non-uniform heating conditions on the top surface. Experimental results show that the proposed thermal conduction physics-informed neural network achieves accuracy comparable to the finite element method, while achieving \times 40 times acceleration in computational efficiency. The dataset and source code will be released on this https URL.
zh

[AI-73] U-PINet: End-to-End Hierarchical Physics-Informed Learning With Sparse Graph Coupling for 3D EM Scattering Modeling

【速读】：该论文旨在解决电磁（Electromagnetic, EM）散射建模在雷达遥感应用中的计算复杂性问题，传统数值求解器虽精度高但存在可扩展性差和计算成本高的缺陷，而纯数据驱动的深度学习方法则因缺乏物理约束且依赖大量标注数据，导致泛化能力受限。解决方案的关键在于提出一种U形物理信息神经网络（U-shaped Physics-Informed Network, U-PINet），其首次实现了基于深度学习的、具有物理一致性的分层EM建模框架：通过借鉴EM求解器中的分层分解策略与局部电磁耦合的稀疏特性，采用多尺度处理神经网络架构建模近远场相互作用，并引入受物理启发的稀疏图表示来高效刻画复杂三维（3D）目标网格单元间的自耦合与互耦关系，从而在保证物理一致性的同时显著提升计算效率、泛化能力和预测精度。

链接: https://arxiv.org/abs/2508.03774
作者: Rui Zhu,Yuexing Peng,Peng Wang,George C. Alexandropoulos,Wenbo Wang,Wei Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electromagnetic (EM) scattering modeling is critical for radar remote sensing, however, its inherent complexity introduces significant computational challenges. Traditional numerical solvers offer high accuracy, but suffer from scalability issues and substantial computational costs. Pure data-driven deep learning approaches, while efficient, lack physical constraints embedding during training and require extensive labeled data, limiting their applicability and generalization. To overcome these limitations, we propose a U-shaped Physics-Informed Network (U-PINet), the first fully deep-learning-based, physics-informed hierarchical framework for computational EM designed to ensure physical consistency while maximizing computational efficiency. Motivated by the hierarchical decomposition strategy in EM solvers and the inherent sparsity of local EM coupling, the U-PINet models the decomposition and coupling of near- and far-field interactions through a multiscale processing neural network architecture, while employing a physics-inspired sparse graph representation to efficiently model both self- and mutual- coupling among mesh elements of complex 3 -Dimensional (3D) objects. This principled approach enables end-to-end multiscale EM scattering modeling with improved efficiency, generalization, and physical consistency. Experimental results showcase that the U-PINet accurately predicts surface current distributions, achieving close agreement with traditional solver, while significantly reducing computational time and outperforming conventional deep learning baselines in both accuracy and robustness. Furthermore, our evaluations on radar cross section prediction tasks confirm the feasibility of the U-PINet for downstream EM scattering applications.
zh

[AI-74] rustworthiness of Legal Considerations for the Use of LLM s in Education

【速读】：该论文旨在解决全球范围内教育领域中人工智能（Artificial Intelligence, AI）部署的伦理、法律及情境适配性问题，尤其关注不同区域监管框架与核心可信原则（如透明度、公平性、问责制、数据隐私和人类监督）的差异性整合。其解决方案的关键在于提出一个面向海湾合作委员会（Gulf Cooperation Council, GCC）地区的“以合规为中心的人工智能治理框架”（Compliance-Centered AI Governance Framework），该框架包含分层分类体系与机构检查清单，能够帮助监管者、教育工作者和开发者在遵循国际规范的同时，兼顾本地价值观与教育场景需求，从而推动合法、伦理且文化敏感的AI系统建设。

链接: https://arxiv.org/abs/2508.03771
作者: Sara Alaswad,Tatiana Kalganova,Wasan Awad
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 6 tables

点击查看摘要

Abstract:As Artificial Intelligence (AI), particularly Large Language Models (LLMs), becomes increasingly embedded in education systems worldwide, ensuring their ethical, legal, and contextually appropriate deployment has become a critical policy concern. This paper offers a comparative analysis of AI-related regulatory and ethical frameworks across key global regions, including the European Union, United Kingdom, United States, China, and Gulf Cooperation Council (GCC) countries. It maps how core trustworthiness principles, such as transparency, fairness, accountability, data privacy, and human oversight are embedded in regional legislation and AI governance structures. Special emphasis is placed on the evolving landscape in the GCC, where countries are rapidly advancing national AI strategies and education-sector innovation. To support this development, the paper introduces a Compliance-Centered AI Governance Framework tailored to the GCC context. This includes a tiered typology and institutional checklist designed to help regulators, educators, and developers align AI adoption with both international norms and local values. By synthesizing global best practices with region-specific challenges, the paper contributes practical guidance for building legally sound, ethically grounded, and culturally sensitive AI systems in education. These insights are intended to inform future regulatory harmonization and promote responsible AI integration across diverse educational environments.
zh

[AI-75] Development of management systems using artificial intelligence systems and machine learning methods for boards of directors (preprint unofficial translation)

【速读】：该论文旨在解决人工智能（AI）在企业治理中从辅助决策工具向自主决策主体转变过程中，法律与伦理规范滞后于技术发展所带来的治理风险问题。其核心解决方案是提出一个“参考模型”，关键在于构建一套可操作的、面向自治AI系统的合规框架：首先引入“计算法”（computational law），将法律规则转化为机器可读的算法形式以消除自然语言的模糊性；其次定义“专用运行环境”（dedicated operational context），类似自动驾驶车辆的“设计运行域”，确保AI在明确边界内安全运行；同时通过合成数据训练嵌入公平性与伦理考量，并结合博弈论优化AI在约束下的最优策略，最终借助可解释人工智能（XAI）保障决策透明度与责任追溯，从而实现合法、可信且可控的AI治理。

链接: https://arxiv.org/abs/2508.03769
作者: Anna Romanova
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The study addresses the paradigm shift in corporate management, where AI is moving from a decision support tool to an autonomous decision-maker, with some AI systems already appointed to leadership roles in companies. A central problem identified is that the development of AI technologies is far outpacing the creation of adequate legal and ethical guidelines. The research proposes a “reference model” for the development and implementation of autonomous AI systems in corporate management. This model is based on a synthesis of several key components to ensure legitimate and ethical decision-making. The model introduces the concept of “computational law” or “algorithmic law”. This involves creating a separate legal framework for AI systems, with rules and regulations translated into a machine-readable, algorithmic format to avoid the ambiguity of natural language. The paper emphasises the need for a “dedicated operational context” for autonomous AI systems, analogous to the “operational design domain” for autonomous vehicles. This means creating a specific, clearly defined environment and set of rules within which the AI can operate safely and effectively. The model advocates for training AI systems on controlled, synthetically generated data to ensure fairness and ethical considerations are embedded from the start. Game theory is also proposed as a method for calculating the optimal strategy for the AI to achieve its goals within these ethical and legal constraints. The provided analysis highlights the importance of explainable AI (XAI) to ensure the transparency and accountability of decisions made by autonomous systems. This is crucial for building trust and for complying with the “right to explanation”. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.03769 [cs.CY] (or arXiv:2508.03769v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.03769 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.13140/RG.2.2.34004.10888 Focus to learn more DOI(s) linking to related resources
zh

[AI-76] CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning ISWC

【速读】：该论文旨在解决呼吸系统疾病诊断中因标签和数据稀缺导致的AI模型性能受限问题，尤其针对新冠以外的疾病场景。其解决方案的关键在于提出CoughViT预训练框架，通过自监督学习中的掩码数据建模（masked data modelling）策略训练特征编码器，从而学习通用的咳嗽声表示，提升在小样本下游任务中的诊断性能。实验表明，该方法获得的表征在三个重要咳嗽分类任务中达到或超越当前最先进的监督音频表征效果。

链接: https://arxiv.org/abs/2508.03764
作者: Justin Luong,Hao Xue,Flora D. Salim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted to ISWC

点击查看摘要

Abstract:Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient’s airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against other pre-training strategies on three diagnostically important cough classification tasks. Experimental results show that our representations match or exceed current state-of-the-art supervised audio representations in enhancing performance on downstream tasks.
zh

[AI-77] FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication

【速读】：该论文旨在解决大规模语言模型（LLMs）在分布式训练与部署过程中面临的通信瓶颈问题，尤其是低比特量化带来的传输效率与精度损失难题。其解决方案的关键在于提出了一种新型通信范式FlashCommunication V2，核心创新包括**位拆分（bit splitting）和尖峰保留（spike reserving）**技术：前者将任意非标准位宽分解为硬件兼容的基本单元，实现任意比特宽度下的高效跨GPU传输；后者通过保留数值异常值（极小值与极大值）为浮点数，压缩动态数值范围，使量化极限可达到2比特且保持可接受的精度损失，从而显著提升通信系统的灵活性与资源利用率。

链接: https://arxiv.org/abs/2508.03760
作者: Qingyuan Li,Bo Zhang,Hui Kang,Tianhao Xu,Yulei Qian,Yuchen Xie,Lin Ma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Nowadays, communication bottlenecks have emerged as a critical challenge in the distributed training and deployment of large language models (LLMs). This paper introduces FlashCommunication V2, a novel communication paradigm enabling efficient cross-GPU transmission at arbitrary bit widths. Its core innovations lie in the proposed bit splitting and spike reserving techniques, which address the challenges of low-bit quantization. Bit splitting decomposes irregular bit widths into basic units, ensuring compatibility with hardware capabilities and thus enabling transmission at any bit width. Spike reserving, on the other hand, retains numerical outliers (i.e., minima and maxima) as floating-point numbers, which shrinks the dynamic numerical range and pushes the quantization limits to 2-bit with acceptable losses. FlashCommunication V2 significantly enhances the flexibility and resource utilization of communication systems. Through meticulous software-hardware co-design, it delivers robust performance and reduced overhead across both NVLink-based and PCIe-based architectures, achieving a maximum 3.2 \times speedup in AllReduce and 2 \times in All2All communication.
zh

[AI-78] Data-Driven Discovery of Mobility Periodicity for Understanding Urban Transportation Systems

【速读】：该论文旨在解决复杂多维人类移动数据中周期性规律的量化问题，以揭示城市动态并支持决策制定与城市系统应用。其解决方案的关键在于将周期性识别问题建模为时间序列自回归中的稀疏主导正自相关性识别，从而从数据驱动且可解释的机器学习视角发现和量化显著的周期模式（如每周周期性）。该方法通过稀疏自回归模型实现对真实世界人流数据（如杭州地铁客流、纽约与芝加哥共享出行数据）的分析，不仅揭示了不同空间位置上可解释的周周期性，还捕捉到新冠疫情对移动规律的破坏作用及两地恢复趋势差异，体现了可解释机器学习在解析现实移动数据中的价值。

链接: https://arxiv.org/abs/2508.03747
作者: Xinyu Chen,Qi Wang,Yunhan Zheng,Nina Cao,HanQin Cai,Jinhua Zhao
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Uncovering the temporal regularity of human mobility is crucial for discovering urban dynamics and has implications for various decision-making processes and urban system applications. This study formulates the periodicity quantification problem in complex and multidimensional human mobility data as a sparse identification of dominant positive auto-correlations in time series autoregression, allowing one to discover and quantify significant periodic patterns such as weekly periodicity from a data-driven and interpretable machine learning perspective. We apply our framework to real-world human mobility data, including metro passenger flow in Hangzhou, China and ridesharing trips in New York City (NYC) and Chicago, USA, revealing the interpretable weekly periodicity across different spatial locations over past several years. In particular, our analysis of ridesharing data from 2019 to 2024 demonstrates the disruptive impact of the COVID-19 pandemic on mobility regularity and the subsequent recovery trends, highlighting differences in the recovery pattern percentages and speeds between NYC and Chicago. We explore that both NYC and Chicago experienced a remarkable reduction of weekly periodicity in 2020, and the recovery of mobility regularity in NYC is faster than Chicago. The interpretability of sparse autoregression provides insights into the underlying temporal patterns of human mobility, offering a valuable tool for understanding urban systems. Our findings highlight the potential of interpretable machine learning to unlock crucial insights from real-world mobility data.
zh

[AI-79] Latent Knowledge Scalpel: Precise and Massive Knowledge Editing for Large Language Models ECAI2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在预训练过程中保留的不准确或过时信息问题，这些问题会导致推理阶段产生错误预测或偏见输出。现有模型编辑方法虽能部分缓解此问题，但在同时编辑大量事实性知识时效果有限，且可能损害模型的通用能力。解决方案的关键在于提出Latent Knowledge Scalpel (LKS)，一种基于轻量级超网络（hypernetwork）操控特定实体隐空间表征的编辑方法，实现对LLM内部知识的精确、大规模修改，同时保持模型整体性能不受显著影响。实验证明，即使同时编辑多达10,000条知识，LKS仍能有效完成任务并维持模型的通用能力。

链接: https://arxiv.org/abs/2508.03741
作者: Xin Liu,Qiyang Song,Shaowen Xu,Kerou Zhou,Wenbo Jiang,Xiaoqi Jia,Weijuan Zhang,Heqing Huang,Yakai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ECAI 2025 - 28th European Conference on Artificial Intelligence

点击查看摘要

Abstract:Large Language Models (LLMs) often retain inaccurate or outdated information from pre-training, leading to incorrect predictions or biased outputs during inference. While existing model editing methods can address this challenge, they struggle with editing large amounts of factual information simultaneously and may compromise the general capabilities of the models. In this paper, our empirical study demonstrates that it is feasible to edit the internal representations of LLMs and replace the entities in a manner similar to editing natural language inputs. Based on this insight, we introduce the Latent Knowledge Scalpel (LKS), an LLM editor that manipulates the latent knowledge of specific entities via a lightweight hypernetwork to enable precise and large-scale editing. Experiments conducted on Llama-2 and Mistral show even with the number of simultaneous edits reaching 10,000, LKS effectively performs knowledge editing while preserving the general abilities of the edited LLMs. Code is available at: this https URL.
zh

[AI-80] “Think First Verify Always”: Training Humans to Face AI Risks

【速读】：该论文旨在解决人工智能（Artificial Intelligence, AI）对人类认知能力的前所未有的攻击威胁，而当前网络安全体系仍以设备为中心、忽视了人的因素这一关键短板。其解决方案的核心在于提出“Think First, Verify Always”（TFVA）协议，将人类重新定位为“防火墙零号”（Firewall Zero），作为抵御AI驱动认知操纵的第一道防线；该协议基于五个操作原则：意识（Awareness）、完整性（Integrity）、判断力（Judgment）、伦理责任（Ethical Responsibility）和透明度（Transparency）（统称为AIJET），并通过一项随机对照试验（n=151）验证：仅需3分钟的简短干预即可显著提升受试者在认知安全任务中的表现，平均提升达+7.87%，证明基于原则的训练能快速增强人类对生成式AI（Generative AI）相关风险的韧性。

链接: https://arxiv.org/abs/2508.03714
作者: Yuksel Aydin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial intelligence enables unprecedented attacks on human cognition, yet cybersecurity remains predominantly device-centric. This paper introduces the “Think First, Verify Always” (TFVA) protocol, which repositions humans as ‘Firewall Zero’, the first line of defense against AI-enabled threats. The protocol is grounded in five operational principles: Awareness, Integrity, Judgment, Ethical Responsibility, and Transparency (AIJET). A randomized controlled trial (n=151) demonstrated that a minimal 3-minute intervention produced statistically significant improvements in cognitive security task performance, with participants showing an absolute +7.87% gains compared to controls. These results suggest that brief, principles-based training can rapidly enhance human resilience against AI-driven cognitive manipulation. We recommend that GenAI platforms embed “Think First, Verify Always” as a standard prompt, replacing passive warnings with actionable protocols to enhance trustworthy and ethical AI use. By bridging the gap between technical cybersecurity and human factors, the TFVA protocol establishes human-empowered security as a vital component of trustworthy AI systems.
zh

[AI-81] Privacy Risks of LLM -Empowered Recommender Systems: An Inversion Attack Perspective RECSYS2025

【速读】：该论文旨在解决生成式 AI (Generative AI) 驱动的推荐系统中存在的隐私泄露问题，特别是针对基于大语言模型（Large Language Model, LLM）的推荐系统易受重构攻击（reconstruction attacks）的风险。现有方法在处理冷启动用户或物品时虽具优势，但其输出 logits 可被 adversaries 利用以逆向推断原始输入提示中的敏感信息，如用户交互历史、偏好和人口统计属性。解决方案的关键在于提出一种名为“相似性引导精炼”（Similarity Guided Refinement）的方法，通过优化 vec2text 框架，在不依赖模型性能的前提下显著提升从 logits 中重建原始文本提示的准确性，实验证明该方法在电影和图书两个领域均能高保真还原近 65% 的用户交互项目，并在 87% 的案例中准确推断年龄与性别，揭示了隐私泄露高度依赖于领域一致性与提示复杂度，而非推荐模型本身性能。

链接: https://arxiv.org/abs/2508.03703
作者: Yubo Wang,Min Tang,Nuo Shen,Shujie Cui,Weiqing Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at ACM RecSys 2025 (10 pages, 4 figures)

点击查看摘要

Abstract:The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To examine this threat, we present the first systematic study on inversion attacks targeting LLM empowered recommender systems, where adversaries attempt to reconstruct original prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We reproduce the vec2text framework and optimize it using our proposed method called Similarity Guided Refinement, enabling more accurate reconstruction of textual prompts from model generated logits. Extensive experiments across two domains (movies and books) and two representative LLM based recommendation models demonstrate that our method achieves high fidelity reconstructions. Specifically, we can recover nearly 65 percent of the user interacted items and correctly infer age and gender in 87 percent of the cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model’s performance but highly dependent on domain consistency and prompt complexity. These findings expose critical privacy vulnerabilities in LLM empowered recommender systems.
zh

[AI-82] MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

【速读】：该论文旨在解决移动图形用户界面（GUI）环境中感知、定位和推理的关键挑战，尤其是在真实世界场景下实现高效的人机交互。其解决方案的核心在于提出MagicGUI这一基础移动端GUI代理框架，通过六大关键组件构建端到端的智能系统：首先，利用可扩展的GUI数据流水线构建大规模、多样的GUI导向多模态数据集；其次，增强细粒度的多模态对齐能力以提升UI元素识别与屏幕理解精度；第三，设计统一且全面的动作空间，涵盖基础操作与复杂交互意图；第四，引入面向规划的推理机制，将用户指令分解为带中间元规划推理的序列动作；第五，采用两阶段训练策略，结合大规模连续预训练与基于空间增强复合奖励及双重过滤策略的强化微调；最终，在Magic-RICH等多个基准上展现出卓越性能与良好的泛化能力，验证了其在实际移动端GUI任务中的部署潜力。

链接: https://arxiv.org/abs/2508.03700
作者: Liujian Tang,Shaokang Dong,Yijia Huang,Minqi Xiang,Hongtao Ruan,Bin Wang,Shuo Li,Zhihui Cao,Hailiang Pang,Heng Kong,He Yang,Mingxu Chai,Zhilin Gao,Xingyu Liu,Yingnan Fu,Jiaming Liu,Tao Gui,Xuanjing Huang,Yu-Gang Jiang,Qi Zhang,Kang Wang,Yunke Zhang,Yuran Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents MagicGUI, a foundational mobile GUI agent designed to address critical challenges in perception, grounding, and reasoning within real-world mobile GUI environments. The framework is underpinned by following six key components: (1) a comprehensive and accurate dataset, constructed via the scalable GUI Data Pipeline, which aggregates the largest and most diverse GUI-centric multimodal data to date from open-source repositories, automated crawling, and targeted manual annotation; (2) enhanced perception and grounding capabilities, facilitating fine-grained multimodal alignment for UI element referencing, grounding, and screen comprehension; (3) a comprehensive and unified action space, encompassing both fundamental UI operations and complex interactive intents to support human-agent interactions; (4) planning-oriented reasoning mechanisms that enable the model to decompose complex user instructions into sequential actions with explicit intermediate meta-paln reasoning; (5) an iterative two-stage training procedure, combining large-scale continue pre-training on 7.8M samples with reinforcement fine-tuning utilizing a spatially enhanced composite reward and dual filtering strategy; and (6) competitive performance on both the proprietary Magic-RICH benchmark and over a dozen public benchmarks, achieving superior performance across GUI perception and agent tasks, while demonstrating robust generalization and real-world deployment potential in practical mobile GUI scenarios, as detailed in Figure 1.
zh

[AI-83] Large AI Models for Wireless Physical Layer

【速读】：该论文旨在解决传统基于人工智能（Artificial Intelligence, AI）的物理层通信技术在泛化能力、多任务处理和多模态融合方面的局限性。其解决方案的关键在于引入大模型（Large Artificial Intelligence Models, LAMs），通过两种策略实现：一是利用预训练的LAMs进行迁移学习，二是开发专为物理层任务设计的原生LAM架构。这两种方法均显著提升了系统在多样化无线场景下的性能与适应性，为下一代通信系统提供了更高效、灵活的智能解决方案。

链接: https://arxiv.org/abs/2508.02314
作者: Jiajia Guo,Yiming Cui,Shi Jin,Jun Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: A collection of paper on Large AI Models for wireless physical layer can be found at this https URL

点击查看摘要

Abstract:Large artificial intelligence models (LAMs) are transforming wireless physical layer technologies through their robust generalization, multitask processing, and multimodal capabilities. This article reviews recent advancements in LAM applications for physical layer communications, addressing limitations of conventional AI-based approaches. LAM applications are classified into two strategies: leveraging pre-trained LAMs and developing native LAMs designed specifically for physical layer tasks. The motivations and key frameworks of these approaches are comprehensively examined through multiple use cases. Both strategies significantly improve performance and adaptability across diverse wireless scenarios. Future research directions, including efficient architectures, interpretability, standardized datasets, and collaboration between large and small models, are proposed to advance LAM-based physical layer solutions for next-generation communication systems.
zh

[AI-84] Recommendation with Generative Models

【速读】：该论文旨在解决生成式推荐系统（Gen-RecSys）中模型分类不清晰、技术演进路径模糊以及评估框架缺失的问题。其解决方案的关键在于提出一个系统的深度生成模型（DGM）分类体系，将DGMs细分为三类：以标识符驱动的模型（ID-driven models）、大语言模型（LLMs）和多模态模型（Multimodal Models），从而为研究者提供清晰的技术架构划分与演进脉络；同时强调构建稳健的评估框架以应对生成模型在推荐场景中的潜在风险，推动其在电商、媒体等领域的可靠落地与应用拓展。

链接: https://arxiv.org/abs/2409.15173
作者: Yashar Deldjoo,Zhankui He,Julian McAuley,Anton Korikov,Scott Sanner,Arnau Ramisa,Rene Vidal,Maheswaran Sathiamoorthy,Atoosa Kasrizadeh,Silvia Milano,Francesco Ricci
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: This submission is a full-length book, expanding significantly on two chapters previously submitted ( arXiv:2409.10993v1 , arXiv:2408.10946v1 ). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993

点击查看摘要

Abstract:Generative models are a class of AI models capable of creating new instances of data by learning and sampling from their statistical distributions. In recent years, these models have gained prominence in machine learning due to the development of approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based architectures such as GPT. These models have applications across various domains, such as image generation, text synthesis, and music composition. In recommender systems, generative models, referred to as Gen-RecSys, improve the accuracy and diversity of recommendations by generating structured outputs, text-based interactions, and multimedia content. By leveraging these capabilities, Gen-RecSys can produce more personalized, engaging, and dynamic user experiences, expanding the role of AI in eCommerce, media, and beyond. Our book goes beyond existing literature by offering a comprehensive understanding of generative models and their applications, with a special focus on deep generative models (DGMs) and their classification. We introduce a taxonomy that categorizes DGMs into three types: ID-driven models, large language models (LLMs), and multimodal models. Each category addresses unique technical and architectural advancements within its respective research area. This taxonomy allows researchers to easily navigate developments in Gen-RecSys across domains such as conversational AI and multimodal content generation. Additionally, we examine the impact and potential risks of generative models, emphasizing the importance of robust evaluation frameworks. Comments: This submission is a full-length book, expanding significantly on two chapters previously submitted (arXiv:2409.10993v1, arXiv:2408.10946v1). It includes additional chapters, context, analysis, and content, providing a comprehensive presentation of the subject. We have ensured it is appropriately presented as a new, distinct work. arXiv admin note: substantial text overlap with arXiv:2409.10993 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.15173 [cs.IR] (or arXiv:2409.15173v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.15173 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-85] Delving Deeper Into Astromorphic Transformers

【速读】：该论文旨在解决当前脑启发类脑计算（brain-inspired neuromorphic computing）中对星形胶质细胞（astrocyte）作用建模不足的问题，尤其在模仿Transformer架构中的自注意力机制（self-attention mechanism）时缺乏生物可实现性（bioplausible）的神经-突触-星形胶质细胞交互建模。解决方案的关键在于构建一个跨层的生物可实现模型，整合神经元-星形胶质细胞网络中的赫布塑性（Hebbian plasticity）与突触前可塑性（presynaptic plasticity），并引入非线性效应和反馈机制，同时将此类生物计算映射为自注意力机制的算法形式，从而在情感分类（IMDB）和图像分类（CIFAR10）任务中提升准确率与学习速度，并在自然语言生成任务（WikiText-2）中实现更低的困惑度（perplexity），验证了其在多类机器学习任务中的泛化能力和稳定性。

链接: https://arxiv.org/abs/2312.10925
作者: Md Zesun Ahmed Mia,Malyaban Bal,Abhronil Sengupta
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Preliminary attempts at incorporating the critical role of astrocytes - cells that constitute more than 50% of human brain cells - in brain-inspired neuromorphic computing remain in infancy. This paper seeks to delve deeper into various key aspects of neuron-synapse-astrocyte interactions to mimic self-attention mechanisms in Transformers. The cross-layer perspective explored in this work involves bioplausible modeling of Hebbian and presynaptic plasticities in neuron-astrocyte networks, incorporating effects of non-linearities and feedback along with algorithmic formulations to map the neuron-astrocyte computations to self-attention mechanism and evaluating the impact of incorporating bio-realistic effects from the machine learning application side. Our analysis on sentiment and image classification tasks (IMDB and CIFAR10 datasets) highlights the advantages of Astromorphic Transformers, offering improved accuracy and learning speed. Furthermore, the model demonstrates strong natural language generation capabilities on the WikiText-2 dataset, achieving better perplexity compared to conventional models, thus showcasing enhanced generalization and stability across diverse machine learning tasks.
zh

[AI-86] A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

【速读】：该论文旨在解决扩散加权磁共振成像（Diffusion-Weighted MRI, DW-MRI）中 intravoxel incoherent motion (IVIM) 参数估计的挑战，尤其是由于反问题 ill-posed 性和对噪声高度敏感，特别是在灌注分量（perfusion compartment）中的估计不稳定性问题。其关键解决方案是提出一种基于 Deep Ensembles (DE) 的概率深度学习框架，结合混合密度网络（Mixture Density Networks, MDNs），能够量化总预测不确定性并分解为偶然性不确定性（aleatoric uncertainty, AU）和认知性不确定性（epistemic uncertainty, EU）。该方法通过监督训练在合成数据上实现，并在模拟与两个体内数据集上验证，结果表明 MDNs 提供了更校准且更尖锐的预测分布，尤其在 D 和 f 参数上表现优异，同时借助 DE 机制有效捕捉了模型对真实采集条件的适应不足所导致的认知不确定性，从而提升了 IVIM 参数估计的可靠性与可解释性。

链接: https://arxiv.org/abs/2508.04588
作者: Nicola Casali,Alessandro Brusaferri,Giuseppe Baselli,Stefano Fumagalli,Edoardo Micotti,Gianluigi Forloni,Riaz Hussein,Giovanna Rizzo,Alfonso Mastropietro
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate estimation of intravoxel incoherent motion (IVIM) parameters from diffusion-weighted MRI remains challenging due to the ill-posed nature of the inverse problem and high sensitivity to noise, particularly in the perfusion compartment. In this work, we propose a probabilistic deep learning framework based on Deep Ensembles (DE) of Mixture Density Networks (MDNs), enabling estimation of total predictive uncertainty and decomposition into aleatoric (AU) and epistemic (EU) components. The method was benchmarked against non probabilistic neural networks, a Bayesian fitting approach and a probabilistic network with single Gaussian parametrization. Supervised training was performed on synthetic data, and evaluation was conducted on both simulated and two in vivo datasets. The reliability of the quantified uncertainties was assessed using calibration curves, output distribution sharpness, and the Continuous Ranked Probability Score (CRPS). MDNs produced more calibrated and sharper predictive distributions for the D and f parameters, although slight overconfidence was observed in D*. The Robust Coefficient of Variation (RCV) indicated smoother in vivo estimates for D* with MDNs compared to Gaussian model. Despite the training data covering the expected physiological range, elevated EU in vivo suggests a mismatch with real acquisition conditions, highlighting the importance of incorporating EU, which was allowed by DE. Overall, we present a comprehensive framework for IVIM fitting with uncertainty quantification, which enables the identification and interpretation of unreliable estimates. The proposed approach can also be adopted for fitting other physical models through appropriate architectural and simulation adjustments.
zh

[AI-87] Benchmarking Quantum and Classical Sequential Models for Urban Telecommunication Forecasting

【速读】：该论文旨在解决基于经典与量子启发式序列模型对单变量短信接收量（SMS-in）时间序列进行预测的性能差异问题。其解决方案的关键在于系统性地比较五种模型——包括LSTM（基线）、量子LSTM（QLSTM）、量子自适应自注意力（QASA）、量子接受权重键值（QRWKV）和量子快速权重程序员（QFWP）——在不同输入序列长度（4至64个时间步）下的预测表现，揭示量子模块在特定任务中是否具有优势，并强调模型架构设计、参数化策略与时间建模能力之间的权衡关系。

链接: https://arxiv.org/abs/2508.04488
作者: Chi-Sheng Chen,Samuel Yen-Chi Chen,Yun-Cheng Tsai
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we evaluate the performance of classical and quantum-inspired sequential models in forecasting univariate time series of incoming SMS activity (SMS-in) using the Milan Telecommunication Activity Dataset. Due to data completeness limitations, we focus exclusively on the SMS-in signal for each spatial grid cell. We compare five models, LSTM (baseline), Quantum LSTM (QLSTM), Quantum Adaptive Self-Attention (QASA), Quantum Receptance Weighted Key-Value (QRWKV), and Quantum Fast Weight Programmers (QFWP), under varying input sequence lengths (4, 8, 12, 16, 32 and 64). All models are trained to predict the next 10-minute SMS-in value based solely on historical values within a given sequence window. Our findings indicate that different models exhibit varying sensitivities to sequence length, suggesting that quantum enhancements are not universally advantageous. Rather, the effectiveness of quantum modules is highly dependent on the specific task and architectural design, reflecting inherent trade-offs among model size, parameterization strategies, and temporal modeling capabilities.
zh

[AI-88] Metric Learning in an RKHS UAI2025

【速读】：该论文旨在解决从三元组比较（triplet comparisons）中学习非线性度量（metric learning）的理论基础薄弱问题，即如何在再生核希尔伯特空间（Reproducing Kernel Hilbert Space, RKHS）中构建具有良好泛化能力的度量函数，以准确反映样本间的相似性与差异性。其解决方案的关键在于提出一个通用的RKHS框架，并首次为基于核方法和神经网络的非线性度量学习提供新颖的泛化保证（generalization guarantees）和样本复杂度边界（sample complexity bounds），从而填补了该领域长期缺乏理论支撑的空白。

链接: https://arxiv.org/abs/2508.04476
作者: Gokcan Tatli,Yi Chen,Blake Mason,Robert Nowak,Ramya Korlakai Vinayak
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Appeared in the 41st Conference on Uncertainty in Artificial Intelligence (UAI 2025)

点击查看摘要

Abstract:Metric learning from a set of triplet comparisons in the form of “Do you think item h is more similar to item i or item j?”, indicating similarity and differences between items, plays a key role in various applications including image retrieval, recommendation systems, and cognitive psychology. The goal is to learn a metric in the RKHS that reflects the comparisons. Nonlinear metric learning using kernel methods and neural networks have shown great empirical promise. While previous works have addressed certain aspects of this problem, there is little or no theoretical understanding of such methods. The exception is the special (linear) case in which the RKHS is the standard Euclidean space \mathbbR^d ; there is a comprehensive theory for metric learning in \mathbbR^d . This paper develops a general RKHS framework for metric learning and provides novel generalization guarantees and sample complexity bounds. We validate our findings through a set of simulations and experiments on real datasets. Our code is publicly available at this https URL.
zh

[AI-89] Challenges in Applying Variational Quantum Algorithms to Dynamic Satellite Network Routing

【速读】：该论文旨在解决将近中期变分量子算法应用于动态卫星网络路由问题的可行性挑战，核心目标是评估量子优化与强化学习方法在该场景下的实际表现。其关键解决方案在于对两类主流量子算法进行系统性测试：一是静态量子优化器（如VQE和QAOA）用于离线路径计算，二是量子强化学习（QRL）方法用于在线决策。研究发现，即使在理想无噪声环境下，静态优化器也无法解决简单的4节点最短路径问题，而基于策略梯度的QRL代理在8节点动态环境中未能学习有效路由策略，性能甚至不如随机动作。这些负面结果揭示了当前量子算法面临的关键障碍，包括“ barren plateaus”（平坦景观）和学习不稳定性等问题，为未来研究指明方向：需从优化器设计、梯度估计改进及环境建模等方面突破瓶颈，以实现量子算法在通信网络中的实际优势。

链接: https://arxiv.org/abs/2508.04288
作者: Phuc Hao Do,Tran Duc Le
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 17 pages and 3 figures

点击查看摘要

Abstract:Applying near-term variational quantum algorithms to the problem of dynamic satellite network routing represents a promising direction for quantum computing. In this work, we provide a critical evaluation of two major approaches: static quantum optimizers such as the Variational Quantum Eigensolver (VQE) and the Quantum Approximate Optimization Algorithm (QAOA) for offline route computation, and Quantum Reinforcement Learning (QRL) methods for online decision-making. Using ideal, noise-free simulations, we find that these algorithms face significant challenges. Specifically, static optimizers are unable to solve even a classically easy 4-node shortest path problem due to the complexity of the optimization landscape. Likewise, a basic QRL agent based on policy gradient methods fails to learn a useful routing strategy in a dynamic 8-node environment and performs no better than random actions. These negative findings highlight key obstacles that must be addressed before quantum algorithms can offer real advantages in communication networks. We discuss the underlying causes of these limitations, including barren plateaus and learning instability, and suggest future research directions to overcome them.
zh

[AI-90] Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning DATE ATC

【速读】：该论文旨在解决生成式 AI（Generative AI）在量子纠错码（Quantum Error Correction, QEC）解码中面临的鲁棒性问题，特别是针对微小、对抗性扰动下图神经网络（Graph Neural Networks, GNNs）解码器易被误导的脆弱性。解决方案的关键在于提出一种基于强化学习（Reinforcement Learning, RL）的系统性框架，通过训练一个RL代理作为攻击者，自动寻找最小的校验值修改以诱导GNN解码器误判；随后利用这些对抗样本进行对抗训练（adversarial training），显著提升解码器的鲁棒性。该方法实现了自动化漏洞探测与针对性再训练的闭环迭代流程，为构建更可靠、抗干扰的神经网络量子纠错解码器提供了有效路径。

链接: https://arxiv.org/abs/2508.03783
作者: Ryota Ikeda
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, Affiliation updated to match user registration

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful, data-driven approach for Quantum Error Correction (QEC) decoding, capable of learning complex noise characteristics directly from syndrome data. However, the robustness of these decoders against subtle, adversarial perturbations remains a critical open question. This work introduces a novel framework to systematically probe the vulnerabilities of a GNN decoder using a reinforcement learning (RL) agent. The RL agent is trained as an adversary with the goal of finding minimal syndrome modifications that cause the decoder to misclassify. We apply this framework to a Graph Attention Network (GAT) decoder trained on experimental surface code data from Google Quantum AI. Our results show that the RL agent can successfully identify specific, critical vulnerabilities, achieving a high attack success rate with a minimal number of bit flips. Furthermore, we demonstrate that the decoder’s robustness can be significantly enhanced through adversarial training, where the model is retrained on the adversarial examples generated by the RL agent. This iterative process of automated vulnerability discovery and targeted retraining presents a promising methodology for developing more reliable and robust neural network decoders for fault-tolerant quantum computing.
zh

[AI-91] Do GNN-based QEC Decoders Require Classical Knowledge? Evaluating the Efficacy of Knowledge Distillation from MWPM DATE ATC

【速读】：该论文旨在解决量子纠错码（Quantum Error Correction, QEC）中解码器性能提升的问题，尤其是如何有效利用理论知识增强基于图神经网络（Graph Neural Networks, GNNs）的解码模型。其解决方案的关键在于通过知识蒸馏（knowledge distillation）技术，将经典算法最小权重完美匹配（Minimum Weight Perfect Matching, MWPM）所推导出的理论错误概率作为软标签引入GNN训练过程，以期改善模型性能。实验结果表明，尽管加入知识蒸馏后模型最终测试准确率与纯数据驱动基线相当，但训练收敛更慢且耗时增加约五倍，说明现代GNN架构具备从真实硬件数据中直接学习复杂错误相关性的强大能力，无需依赖近似理论模型的指导。

链接: https://arxiv.org/abs/2508.03782
作者: Ryota Ikeda
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 1 table. Affiliation updated to match user registration

点击查看摘要

Abstract:The performance of decoders in Quantum Error Correction (QEC) is key to realizing practical quantum computers. In recent years, Graph Neural Networks (GNNs) have emerged as a promising approach, but their training methodologies are not yet well-established. It is generally expected that transferring theoretical knowledge from classical algorithms like Minimum Weight Perfect Matching (MWPM) to GNNs, a technique known as knowledge distillation, can effectively improve performance. In this work, we test this hypothesis by rigorously comparing two models based on a Graph Attention Network (GAT) architecture that incorporates temporal information as node features. The first is a purely data-driven model (baseline) trained only on ground-truth labels, while the second incorporates a knowledge distillation loss based on the theoretical error probabilities from MWPM. Using public experimental data from Google, our evaluation reveals that while the final test accuracy of the knowledge distillation model was nearly identical to the baseline, its training loss converged more slowly, and the training time increased by a factor of approximately five. This result suggests that modern GNN architectures possess a high capacity to efficiently learn complex error correlations directly from real hardware data, without guidance from approximate theoretical models.
zh

[AI-92] Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors

【速读】：该论文旨在解决脊髓损伤（Spinal Cord Injury, SCI）患者中自主神经反射异常（Autonomic Dysreflexia, AD）的早期、准确检测问题，当前方法受限于侵入性监测或主观症状报告，难以在日常环境中应用。解决方案的关键在于构建一个非侵入式、可解释的机器学习框架，利用多模态可穿戴传感器（包括心电图（ECG）、光电容积脉搏波（PPG）、生物阻抗（BioZ）、温度、呼吸频率（RR）和心率（HR））采集生理信号，并结合同步血压测量作为客观标签。通过BorutaSHAP进行特征选择与SHAP值提供模型可解释性，训练模态和设备特异性的弱学习器并以堆叠集成元模型聚合，最终实现高精度（宏F1=0.77±0.03）且对传感器丢失鲁棒的AD检测，显著优于基线模型，为SCI患者的个性化实时监测提供了可行路径。

链接: https://arxiv.org/abs/2508.03715
作者: Bertram Fuchs,Mehdi Ejtehadi,Ana Cisnal,Jürgen Pannek,Anke Scheel-Sailer,Robert Riener,Inge Eriks-Hoogland,Diego Paez-Granados
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure (BP) spikes in individuals with spinal cord injury (SCI). Early, accurate detection is essential to prevent cardiovascular complications, yet current monitoring methods are either invasive or rely on subjective symptom reporting, limiting applicability in daily file. This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors. Data were collected from 27 individuals with chronic SCI during urodynamic studies, including electrocardiography (ECG), photoplethysmography (PPG), bioimpedance (BioZ), temperature, respiratory rate (RR), and heart rate (HR), across three commercial devices. Objective AD labels were derived from synchronized cuff-based BP measurements. Following signal preprocessing and feature extraction, BorutaSHAP was used for robust feature selection, and SHAP values for explainability. We trained modality- and device-specific weak learners and aggregated them using a stacked ensemble meta-model. Cross-validation was stratified by participants to ensure generalizability. HR- and ECG-derived features were identified as the most informative, particularly those capturing rhythm morphology and variability. The Nearest Centroid ensemble yielded the highest performance (Macro F1 = 0.77+/-0.03), significantly outperforming baseline models. Among modalities, HR achieved the highest area under the curve (AUC = 0.93), followed by ECG (0.88) and PPG (0.86). RR and temperature features contributed less to overall accuracy, consistent with missing data and low specificity. The model proved robust to sensor dropout and aligned well with clinical AD events. These results represent an important step toward personalized, real-time monitoring for individuals with SCI.
zh

[AI-93] Controllable Surface Diffusion Generative Model for Neurodevelopmental Trajectories

【速读】：该论文旨在解决早产儿（preterm birth）个体间神经发育结果差异大、难以早期预测的问题，其核心挑战在于如何准确建模个体特异性的皮层形态演化轨迹以识别潜在风险生物标志物。解决方案的关键在于提出了一种新颖的图扩散网络（graph-diffusion network），该模型能够在保持受试者特异性皮层折叠模式的同时，可控地模拟皮层成熟过程；实验表明，该模型生成的模拟数据足以欺骗一个独立训练的年龄回归网络，预测准确率达到0.85 ± 0.62，验证了其在保留区域特异性形态变异方面的有效性。

链接: https://arxiv.org/abs/2508.03706
作者: Zhenshan Xie,Levente Baljer,M. Jorge Cardoso,Emma Robinson
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Preterm birth disrupts the typical trajectory of cortical neurodevelopment, increasing the risk of cognitive and behavioral difficulties. However, outcomes vary widely, posing a significant challenge for early prediction. To address this, individualized simulation offers a promising solution by modeling subject-specific neurodevelopmental trajectories, enabling the identification of subtle deviations from normative patterns that might act as biomarkers of risk. While generative models have shown potential for simulating neurodevelopment, prior approaches often struggle to preserve subject-specific cortical folding patterns or to reproduce region-specific morphological variations. In this paper, we present a novel graph-diffusion network that supports controllable simulation of cortical maturation. Using cortical surface data from the developing Human Connectome Project (dHCP), we demonstrate that the model maintains subject-specific cortical morphology while modeling cortical maturation sufficiently well to fool an independently trained age regression network, achieving a prediction accuracy of 0.85 \pm 0.62 .
zh

机器学习

[LG-0] Robustly Learning Monotone Single-Index Models

链接: https://arxiv.org/abs/2508.04670
作者: Puqian Wang,Nikos Zarifis,Ilias Diakonikolas,Jelena Diakonikolas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider the basic problem of learning Single-Index Models with respect to the square loss under the Gaussian distribution in the presence of adversarial label noise. Our main contribution is the first computationally efficient algorithm for this learning task, achieving a constant factor approximation, that succeeds for the class of \em all monotone activations with bounded moment of order 2 + \zeta, for \zeta 0. This class in particular includes all monotone Lipschitz functions and even discontinuous functions like (possibly biased) halfspaces. Prior work for the case of unknown activation either does not attain constant factor approximation or succeeds for a substantially smaller family of activations. The main conceptual novelty of our approach lies in developing an optimization framework that steps outside the boundaries of usual gradient methods and instead identifies a useful vector field to guide the algorithm updates by directly leveraging the problem structure, properties of Gaussian spaces, and regularity of monotone functions.

[LG-1] Perch 2.0: The Bittern Lesson for Bioacoustics

链接: https://arxiv.org/abs/2508.04665
作者: Bart van Merriënboer,Vincent Dumoulin,Jenny Hamer,Lauren Harrell,Andrea Burns,Tom Denton
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.

[LG-2] Live Music Models

链接: https://arxiv.org/abs/2508.04651
作者: Lyria Team:Antoine Caillon,Brian McWilliams,Cassie Tarakajian,Ian Simon,Ilaria Manco,Jesse Engel,Noah Constant,Pen Li,Timo I. Denk,Alberto Lalama,Andrea Agostinelli,Anna Huang,Ethan Manilow,George Brower,Hakan Erdogan,Heidi Lei,Itai Rolnick,Ivan Grishchenko,Manu Orsini,Matej Kastelic,Mauricio Zuluaga,Mauro Verzetti,Michael Dooley,Ondrej Skopek,Rafael Ferrer,Zalán Borsos,Äaron van den Oord,Douglas Eck,Eli Collins,Jason Baldridge,Tom Hume,Chris Donahue,Kehang Han,Adam Roberts
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

[LG-3] CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

链接: https://arxiv.org/abs/2508.04630
作者: Yutong Xia,Yingying Zhang,Yuxuan Liang,Lunting Fan,Qingsong Wen,Roger Zimmermann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

[LG-4] A Reproducible Scalable Pipeline for Synthesizing Autoregressive Model Literature

链接: https://arxiv.org/abs/2508.04612
作者: Faruk Alpay,Bugra Kilictas,Hamdi Alakkad
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:The accelerating pace of research on autoregressive generative models has produced thousands of papers, making manual literature surveys and reproduction studies increasingly impractical. We present a fully open-source, reproducible pipeline that automatically retrieves candidate documents from public repositories, filters them for relevance, extracts metadata, hyper-parameters and reported results, clusters topics, produces retrieval-augmented summaries and generates containerised scripts for re-running selected experiments. Quantitative evaluation on 50 manually-annotated papers shows F1 scores above 0.85 for relevance classification, hyper-parameter extraction and citation identification. Experiments on corpora of up to 1000 papers demonstrate near-linear scalability with eight CPU workers. Three case studies – AWD-LSTM on WikiText-2, Transformer-XL on WikiText-103 and an autoregressive music model on the Lakh MIDI dataset – confirm that the extracted settings support faithful reproduction, achieving test perplexities within 1–3% of the original reports.

[LG-5] Multitask Learning with Stochastic Interpolants

链接: https://arxiv.org/abs/2508.04605
作者: Hugo Negrel,Florentin Coeurdoux,Michael S. Albergo,Eric Vanden-Eijnden
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

[LG-6] Improved Training Strategies for Physics-Informed Neural Networks using Real Experimental Data in Aluminum Spot Welding

链接: https://arxiv.org/abs/2508.04595
作者: Jan A. Zak,Christian Weißenfels
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Resistance spot welding is the dominant joining process for the body-in-white in the automotive industry, where the weld nugget diameter is the key quality metric. Its measurement requires destructive testing, limiting the potential for efficient quality control. Physics-informed neural networks were investigated as a promising tool to reconstruct internal process states from experimental data, enabling model-based and non-invasive quality assessment in aluminum spot welding. A major challenge is the integration of real-world data into the network due to competing optimization objectives. To address this, we introduce two novel training strategies. First, experimental losses for dynamic displacement and nugget diameter are progressively included using a fading-in function to prevent excessive optimization conflicts. We also implement a custom learning rate scheduler and early stopping based on a rolling window to counteract premature reduction due to increased loss magnitudes. Second, we introduce a conditional update of temperature-dependent material parameters via a look-up table, activated only after a loss threshold is reached to ensure physically meaningful temperatures. An axially symmetric two-dimensional model was selected to represent the welding process accurately while maintaining computational efficiency. To reduce computational burden, the training strategies and model components were first systematically evaluated in one dimension, enabling controlled analysis of loss design and contact models. The two-dimensional network predicts dynamic displacement and nugget growth within the experimental confidence interval, supports transferring welding stages from steel to aluminum, and demonstrates strong potential for fast, model-based quality control in industrial applications.

[LG-7] Algebraically Observable Physics-Informed Neural Network and its Application to Epidemiological Modelling

链接: https://arxiv.org/abs/2508.04590
作者: Mizuka Komatsu
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG); Dynamical Systems (math.DS); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Network (PINN) is a deep learning framework that integrates the governing equations underlying data into a loss function. In this study, we consider the problem of estimating state variables and parameters in epidemiological models governed by ordinary differential equations using PINNs. In practice, not all trajectory data corresponding to the population described by models can be measured. Learning PINNs to estimate the unmeasured state variables and epidemiological parameters using partial measurements is challenging. Accordingly, we introduce the concept of algebraic observability of the state variables. Specifically, we propose augmenting the unmeasured data based on algebraic observability analysis. The validity of the proposed method is demonstrated through numerical experiments under three scenarios in the context of epidemiological modelling. Specifically, given noisy and partial measurements, the accuracy of unmeasured states and parameter estimation of the proposed method is shown to be higher than that of the conventional methods. The proposed method is also shown to be effective in practical scenarios, such as when the data corresponding to certain variables cannot be reconstructed from the measurements. Subjects: Symbolic Computation (cs.SC); Machine Learning (cs.LG); Dynamical Systems (math.DS); Quantitative Methods (q-bio.QM) Cite as: arXiv:2508.04590 [cs.SC] (or arXiv:2508.04590v1 [cs.SC] for this version) https://doi.org/10.48550/arXiv.2508.04590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Attack Pattern Mining to Discover Hidden Threats to Industrial Control Systems

链接: https://arxiv.org/abs/2508.04561
作者: Muhammad Azmi Umer,Chuadhry Mujeeb Ahmed,Aditya Mathur,Muhammad Taha Jilani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work focuses on validation of attack pattern mining in the context of Industrial Control System (ICS) security. A comprehensive security assessment of an ICS requires generating a large and variety of attack patterns. For this purpose we have proposed a data driven technique to generate attack patterns for an ICS. The proposed technique has been used to generate over 100,000 attack patterns from data gathered from an operational water treatment plant. In this work we present a detailed case study to validate the attack patterns.

[LG-9] Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape

链接: https://arxiv.org/abs/2508.04542
作者: Haoran Niu,K. Suzanne Barber
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
*备注: 8 pages, 9 figures, 1 table

点击查看摘要

Abstract:It is difficult for individuals and organizations to protect personal information without a fundamental understanding of relative privacy risks. By analyzing over 5,000 empirical identity theft and fraud cases, this research identifies which types of personal data are exposed, how frequently exposures occur, and what the consequences of those exposures are. We construct an Identity Ecosystem graph–a foundational, graph-based model in which nodes represent personally identifiable information (PII) attributes and edges represent empirical disclosure relationships between them (e.g., the probability that one PII attribute is exposed due to the exposure of another). Leveraging this graph structure, we develop a privacy risk prediction framework that uses graph theory and graph neural networks to estimate the likelihood of further disclosures when certain PII attributes are compromised. The results show that our approach effectively answers the core question: Can the disclosure of a given identity attribute possibly lead to the disclosure of another attribute?

[LG-10] Channel-Independent Federated Traffic Prediction

链接: https://arxiv.org/abs/2508.04517
作者: Mo Zhang,Xiaoyu Li,Bin Xu,Meng Chen,Yongshun Gong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, traffic prediction has achieved remarkable success and has become an integral component of intelligent transportation systems. However, traffic data is typically distributed among multiple data owners, and privacy constraints prevent the direct utilization of these isolated datasets for traffic prediction. Most existing federated traffic prediction methods focus on designing communication mechanisms that allow models to leverage information from other clients in order to improve prediction accuracy. Unfortunately, such approaches often incur substantial communication overhead, and the resulting transmission delays significantly slow down the training process. As the volume of traffic data continues to grow, this issue becomes increasingly critical, making the resource consumption of current methods unsustainable. To address this challenge, we propose a novel variable relationship modeling paradigm for federated traffic prediction, termed the Channel-Independent Paradigm(CIP). Unlike traditional approaches, CIP eliminates the need for inter-client communication by enabling each node to perform efficient and accurate predictions using only local information. Based on the CIP, we further develop Fed-CI, an efficient federated learning framework, allowing each client to process its own data independently while effectively mitigating the information loss caused by the lack of direct data sharing among clients. Fed-CI significantly reduces communication overhead, accelerates the training process, and achieves state-of-the-art performance while complying with privacy regulations. Extensive experiments on multiple real-world datasets demonstrate that Fed-CI consistently outperforms existing methods across all datasets and federated settings. It achieves improvements of 8%, 14%, and 16% in RMSE, MAE, and MAPE, respectively, while also substantially reducing communication costs.

[LG-11] Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

链接: https://arxiv.org/abs/2508.04481
作者: Anushka Srivastava
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 3 pages, 2 tables, submitted for arXiv preprint

点击查看摘要

Abstract:This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

[LG-12] Who cuts emissions who turns up the heat? causal machine learning estimates of energy efficiency interventions

链接: https://arxiv.org/abs/2508.04478
作者: Bernardino D’Amico,Francesco Pomponi,Jay H. Arehart,Lina Khaddour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.

[LG-13] FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions

链接: https://arxiv.org/abs/2508.04470
作者: Jianheng Tang,Zhirui Yang,Jingchao Wang,Kejia Fan,Jinfeng Xu,Huiping Zhuang,Anfeng Liu,Houbing Herbert Song,Leye Wang,Yunhuai Liu
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client’s local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

[LG-14] GFocal: A Global-Focal Neural Operator for Solving PDEs on Arbitrary Geometries

链接: https://arxiv.org/abs/2508.04463
作者: Fangzhi Fei,Jiaxin Hu,Qiaofeng Li,Zhenyu Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based neural operators have emerged as promising surrogate solvers for partial differential equations, by leveraging the effectiveness of Transformers for capturing long-range dependencies and global correlations, profoundly proven in language modeling. However, existing methodologies overlook the coordinated learning of interdependencies between local physical details and global features, which are essential for tackling multiscale problems, preserving physical consistency and numerical stability in long-term rollouts, and accurately capturing transitional dynamics. In this work, we propose GFocal, a Transformer-based neural operator method that enforces simultaneous global and local feature learning and fusion. Global correlations and local features are harnessed through Nyström attention-based \textbfglobal blocks and slices-based \textbffocal blocks to generate physics-aware tokens, subsequently modulated and integrated via convolution-based gating blocks, enabling dynamic fusion of multiscale information. GFocal achieves accurate modeling and prediction of physical features given arbitrary geometries and initial conditions. Experiments show that GFocal achieves state-of-the-art performance with an average 15.2% relative gain in five out of six benchmarks and also excels in industry-scale simulations such as aerodynamics simulation of automotives and airfoils.

[LG-15] CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

链接: https://arxiv.org/abs/2508.04462
作者: Enyu Zhou,Kai Sheng,Hao Chen,Xin He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding (SD), where an extra draft model first provides multiple draft tokens and the original target model then verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods must adhere to the ‘draft-then-verify’ paradigm, which forces drafting and verification processes to execute sequentially during SD, resulting in inefficient inference performance and limiting the size of the draft model. Furthermore, once a single token in the candidate sequence is rejected during the drafting process, all subsequent candidate tokens must be discarded, leading to inefficient drafting. To address these challenges, we propose a cache-based parallel speculative decoding framework employing a ‘query-and-correct’ paradigm. Specifically, CARD decouples drafting and verification: the draft model generates candidate tokens to populate a shared cache, while the target model concurrently rectifies the draft model’s generation direction. This effectively enables the target model to perform inference at speed approaching that of the draft model. Our approach achieves up to 4.83 speedup over vanilla decoding without requiring fine-tuning of either the draft or target models. Our code is available at this https URL.

[LG-16] Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

链接: https://arxiv.org/abs/2508.04444
作者: Askar Tsyganov,Evgeny Frolov,Sergey Samsonov,Maxim Rakhuba
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson’s diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

[LG-17] Algorithm Selection for Recommender Systems via Meta-Learning on Algorithm Characteristics

链接: https://arxiv.org/abs/2508.04419
作者: Jarne Mathi Decker,Joeran Beel
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Algorithm Selection Problem for recommender systems-choosing the best algorithm for a given user or context-remains a significant challenge. Traditional meta-learning approaches often treat algorithms as categorical choices, ignoring their intrinsic properties. Recent work has shown that explicitly characterizing algorithms with features can improve model performance in other domains. Building on this, we propose a per-user meta-learning approach for recommender system selection that leverages both user meta-features and automatically extracted algorithm features from source code. Our preliminary results, averaged over six diverse datasets, show that augmenting a meta-learner with algorithm features improves its average NDCG@10 performance by 8.83% from 0.135 (user features only) to 0.147. This enhanced model outperforms the Single Best Algorithm baseline (0.131) and successfully closes 10.5% of the performance gap to a theoretical oracle selector. These findings show that even static source code metrics provide a valuable predictive signal, presenting a promising direction for building more robust and intelligent recommender systems.

[LG-18] FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

链接: https://arxiv.org/abs/2508.04405
作者: Hao Zhang,Aining Jia,Weifeng Bu,Yushu Cai,Kai Sheng,Hao Chen,Xin He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade accuracy or lack optimal efficiency. INT6 quantization offers a superior trade-off between model accuracy and inference efficiency, but lacks hardware support in modern GPUs, forcing emulation via higher-precision arithmetic units that limit acceleration. In this paper, we propose FlexQ, a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. FlexQ employs uniform 6-bit weight quantization across all layers, with adaptive retention of 8-bit activations in layers identified through layer-wise sensitivity analysis. To maximize hardware efficiency, we develop a specialized high-performance GPU kernel supporting matrix multiplication for W6A6 and W6A8 representations via Binary Tensor Core (BTC) equivalents, effectively bypassing the lack of native INT6 tensor cores. Evaluations on LLaMA models show FlexQ maintains near-FP16 accuracy, with perplexity increases of no more than 0.05. The proposed kernel achieves an average 1.39 \times speedup over ABQ-LLM on LLaMA-2-70B linear layers. End-to-end, FlexQ delivers 1.33 \times inference acceleration and 1.21 \times memory savings over SmoothQuant. Code is released at this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.04405 [cs.LG] (or arXiv:2508.04405v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.04405 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

链接: https://arxiv.org/abs/2508.04351
作者: Justin Lee,Behnaz Moradijamei,Heman Shakeri
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 23 pages, 10 figures

点击查看摘要

Abstract:Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets, including gene expression data collected at uneven time points and an image progression task, demonstrating the method’s versatility.

[LG-20] From Split to Share: Private Inference with Distributed Feature Sharing

链接: https://arxiv.org/abs/2508.04346
作者: Zihan Liu,Jiayi Wen,Shouhong Tan,Zhirun Zheng,Cheng Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers’ outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

[LG-21] Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning

链接: https://arxiv.org/abs/2508.04329
作者: Ali Taheri Ghahrizjani,Alireza Taban,Qizhou Wang,Shanshan Ye,Abdolreza Mirzaei,Tongliang Liu,Bo Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs), notably enhancing their capacity to acquire domain-specific knowledge while preserving or potentially augmenting their general-purpose capabilities. However, the efficacy of SFT hinges on data quality as well as data volume, otherwise it may result in limited performance gains or even degradation relative to the associated baselines. To mitigate such reliance, we suggest categorizing tokens within each corpus into two parts – positive and negative tokens – based on whether they are useful to improve model performance. Positive tokens can be trained in common ways, whereas negative tokens, which may lack essential semantics or be misleading, should be explicitly forgotten. Overall, the token categorization facilitate the model to learn less informative message, and the forgetting process shapes a knowledge boundary to guide the model on what information to learn more precisely. We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.

[LG-22] WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification

链接: https://arxiv.org/abs/2508.04308
作者: Thang Duc Tran,Thai Hoang Le
类目: Machine Learning (cs.LG)
*备注: 17th International Conference on Computational Collective Intelligence 2025

点击查看摘要

Abstract:Machine unlearning, the efficient deletion of the impact of specific data in a trained model, remains a challenging problem. Current machine unlearning approaches that focus primarily on data-centric or weight-based strategies frequently encounter challenges in achieving precise unlearning, maintaining stability, and ensuring applicability across diverse domains. In this work, we introduce a new two-phase efficient machine unlearning method for image classification, in terms of weight saliency, leveraging weight saliency to focus the unlearning process on critical model parameters. Our method is called weight saliency soft-guided contrastive learning for efficient machine unlearning image classification (WSS-CL), which significantly narrows the performance gap with “exact” unlearning. First, the forgetting stage maximizes kullback-leibler divergence between output logits and aggregated pseudo-labels for efficient forgetting in logit space. Next, the adversarial fine-tuning stage introduces contrastive learning in a self-supervised manner. By using scaled feature representations, it maximizes the distance between the forgotten and retained data samples in the feature space, with the forgotten and the paired augmented samples acting as positive pairs, while the retained samples act as negative pairs in the contrastive loss computation. Experimental evaluations reveal that our proposed method yields much-improved unlearning efficacy with negligible performance loss compared to state-of-the-art approaches, indicative of its usability in supervised and self-supervised settings.

[LG-23] Mockingbird: How does LLM perform in general machine learning tasks?

链接: https://arxiv.org/abs/2508.04279
作者: Haoyu Jia,Yoshiki Obinata,Kento Kawaharazuka,Kei Okada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are now being used with increasing frequency as chat bots, tasked with the summarizing information or generating text and code in accordance with user instructions. The rapid increase in reasoning capabilities and inference speed of LLMs has revealed their remarkable potential for applications extending beyond the domain of chat bots to general machine learning tasks. This work is conducted out of the curiosity about such potential. In this work, we propose a framework Mockingbird to adapt LLMs to general machine learning tasks and evaluate its performance and scalability on several general machine learning tasks. The core concept of this framework is instructing LLMs to role-play functions and reflect on its mistakes to improve itself. Our evaluation and analysis result shows that LLM-driven machine learning methods, such as Mockingbird, can achieve acceptable results on common machine learning tasks; however, solely reflecting on its own currently cannot outperform the effect of domain-specific documents and feedback from human experts.

[LG-24] A virtual sensor fusion approach for state of charge estimation of lithium-ion cells

链接: https://arxiv.org/abs/2508.04268
作者: Davide Previtali,Daniele Masti,Mirko Mazzoleni,Fabio Previdi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the estimation of the State Of Charge (SOC) of lithium-ion cells via the combination of two widely used paradigms: Kalman Filters (KFs) equipped with Equivalent Circuit Models (ECMs) and machine-learning approaches. In particular, a recent Virtual Sensor (VS) synthesis technique is considered, which operates as follows: (i) learn an Affine Parameter-Varying (APV) model of the cell directly from data, (ii) derive a bank of linear observers from the APV model, (iii) train a machine-learning technique from features extracted from the observers together with input and output data to predict the SOC. The SOC predictions returned by the VS are supplied to an Extended KF (EKF) as output measurements along with the cell terminal voltage, combining the two paradigms. A data-driven calibration strategy for the noise covariance matrices of the EKF is proposed. Experimental results show that the designed approach is beneficial w.r.t. SOC estimation accuracy and smoothness.

[LG-25] 3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion

链接: https://arxiv.org/abs/2508.04251
作者: Abdul Monaf Chowdhury,Rabeya Akter,Safaeid Hossain Arib
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting (MTSF) seeks to model temporal dynamics among variables to predict future trends. Transformer-based models and large language models (LLMs) have shown promise due to their ability to capture long-range dependencies and patterns. However, current methods often rely on rigid inductive biases, ignore intervariable interactions, or apply static fusion strategies that limit adaptability across forecast horizons. These limitations create bottlenecks in capturing nuanced, horizon-specific relationships in time-series data. To solve this problem, we propose T3Time, a novel trimodal framework consisting of time, spectral, and prompt branches, where the dedicated frequency encoding branch captures the periodic structures along with a gating mechanism that learns prioritization between temporal and spectral features based on the prediction horizon. We also proposed a mechanism which adaptively aggregates multiple cross-modal alignment heads by dynamically weighting the importance of each head based on the features. Extensive experiments on benchmark datasets demonstrate that our model consistently outperforms state-of-the-art baselines, achieving an average reduction of 3.28% in MSE and 2.29% in MAE. Furthermore, it shows strong generalization in few-shot learning settings: with 5% training data, we see a reduction in MSE and MAE by 4.13% and 1.91%, respectively; and with 10% data, by 3.62% and 1.98% on average. Code - this https URL

[LG-26] Causal Reward Adjustment: Mitigating Reward Hacking in External Reasoning via Backdoor Correction

链接: https://arxiv.org/abs/2508.04216
作者: Ruike Song,Zeen Song,Huijie Guo,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:External reasoning systems combine language models with process reward models (PRMs) to select high-quality reasoning paths for complex tasks such as mathematical problem solving. However, these systems are prone to reward hacking, where high-scoring but logically incorrect paths are assigned high scores by the PRMs, leading to incorrect answers. From a causal inference perspective, we attribute this phenomenon primarily to the presence of confounding semantic features. To address it, we propose Causal Reward Adjustment (CRA), a method that mitigates reward hacking by estimating the true reward of a reasoning path. CRA trains sparse autoencoders on the PRM’s internal activations to recover interpretable features, then corrects confounding by using backdoor adjustment. Experiments on math solving datasets demonstrate that CRA mitigates reward hacking and improves final accuracy, without modifying the policy model or retraining PRM.

[LG-27] Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes

链接: https://arxiv.org/abs/2508.04193
作者: Chengcheng Yan,Jiawei Xu,Zheng Peng,Qingsong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and high computational cost. To address these issues, we propose a novel method, Stochastic Alternating Minimization with Trainable Step Sizes (SAMT), which updates network parameters in an alternating manner by treating the weights of each layer as a block. By decomposing the overall optimization into sub-problems corresponding to different blocks, this block-wise alternating strategy reduces per-step computational overhead and enhances training stability in nonconvex settings. To fully leverage these benefits, inspired by meta-learning, we proposed a novel adaptive step size strategy to incorporate into the sub-problem solving steps of alternating updates. It supports different types of trainable step sizes, including but not limited to scalar, element-wise, row-wise, and column-wise, enabling adaptive step size selection tailored to each block via meta-learning. We further provide a theoretical convergence guarantee for the proposed algorithm, establishing its optimization soundness. Extensive experiments for multiple benchmarks demonstrate that SAMT achieves better generalization performance with fewer parameter updates compared to state-of-the-art methods, highlighting its effectiveness and potential in neural network optimization.

[LG-28] One Small Step with Fingerprints One Giant Leap for emphDe Novo Molecule Generation from Mass Spectra

链接: https://arxiv.org/abs/2508.04180
作者: Neng Kai Nigel Neo,Lim Jing,Ngoui Yong Zhau Preston,Koh Xue Ting Serene,Bingquan Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common approach to the \emphde novo molecular generation problem from mass spectra involves a two-stage pipeline: (1) encoding mass spectra into molecular fingerprints, followed by (2) decoding these fingerprints into molecular structures. In our work, we adopt \textscMIST~\citepMISTgoldmanAnnotatingMetaboliteMass2023 as the encoder and \textscMolForge~\citepucakReconstructionLosslessMolecular2023 as the decoder, leveraging pretraining to enhance performance. Notably, pretraining \textscMolForge proves especially effective, enabling it to serve as a robust fingerprint-to-structure decoder. Additionally, instead of passing the probability of each bit in the fingerprint, thresholding the probabilities as a step function helps focus the decoder on the presence of substructures, improving recovery of accurate molecular structures even when the fingerprints predicted by \textscMIST only moderately resembles the ground truth in terms of Tanimoto similarity. This combination of encoder and decoder results in a tenfold improvement over previous state-of-the-art methods, generating top-1 28% / top-10 36% of molecular structures correctly from mass spectra. We position this pipeline as a strong baseline for future research in \emphde novo molecule elucidation from mass spectra.

[LG-29] Agent ic-AI based Mathematical Framework for Commercialization of Energy Resilience in Electrical Distribution System Planning and Operation

链接: https://arxiv.org/abs/2508.04170
作者: Aniket Johri,Divyanshi Dwivedi,Mayukha Pal
类目: ystems and Control (eess.SY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing vulnerability of electrical distribution systems to extreme weather events and cyber threats necessitates the development of economically viable frameworks for resilience enhancement. While existing approaches focus primarily on technical resilience metrics and enhancement strategies, there remains a significant gap in establishing market-driven mechanisms that can effectively commercialize resilience features while optimizing their deployment through intelligent decision-making. Moreover, traditional optimization approaches for distribution network reconfiguration often fail to dynamically adapt to both normal and emergency conditions. This paper introduces a novel framework integrating dual-agent Proximal Policy Optimization (PPO) with market-based mechanisms, achieving an average resilience score of 0.85 0.08 over 10 test episodes. The proposed architecture leverages a dual-agent PPO scheme, where a strategic agent selects optimal DER-driven switching configurations, while a tactical agent fine-tunes individual switch states and grid preferences under budget and weather constraints. These agents interact within a custom-built dynamic simulation environment that models stochastic calamity events, budget limits, and resilience-cost trade-offs. A comprehensive reward function is designed that balances resilience enhancement objectives with market profitability (with up to 200x reward incentives, resulting in 85% of actions during calamity steps selecting configurations with 4 DERs), incorporating factors such as load recovery speed, system robustness, and customer satisfaction. Over 10 test episodes, the framework achieved a benefit-cost ratio of 0.12 0.01, demonstrating sustainable market incentives for resilience investment. This framework creates sustainable market incentives

[LG-30] Semi-Supervised Deep Domain Adaptation for Predicting Solar Power Across Different Locations

链接: https://arxiv.org/abs/2508.04165
作者: Md Shazid Islam,A S M Jahid Hasan,Md Saydur Rahman,Md Saiful Islam Sajol
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate solar generation prediction is essential for proper estimation of renewable energy resources across diverse geographic locations. However, geographical and weather features vary from location to location which introduces domain shift - a major bottleneck to develop location-agnostic prediction model. As a result, a machine-learning model which can perform well to predict solar power in one location, may exhibit subpar performance in another location. Moreover, the lack of properly labeled data and storage issues make the task even more challenging. In order to address domain shift due to varying weather conditions across different meteorological regions, this paper presents a semi-supervised deep domain adaptation framework, allowing accurate predictions with minimal labeled data from the target location. Our approach involves training a deep convolutional neural network on a source location’s data and adapting it to the target location using a source-free, teacher-student model configuration. The teacher-student model leverages consistency and cross-entropy loss for semi-supervised learning, ensuring effective adaptation without any source data requirement for prediction. With annotation of only 20 % data in the target domain, our approach exhibits an improvement upto 11.36 % , 6.65 % , 4.92% for California, Florida and New York as target domain, respectively in terms of accuracy in predictions with respect to non-adaptive approach.

[LG-31] Evaluating Selective Encryption Against Gradient Inversion Attacks

链接: https://arxiv.org/abs/2508.04155
作者: Jiajun Gu,Yuhang Yao,Shuaiqi Wang,Carlee Joe-Wong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient inversion attacks pose significant privacy threats to distributed training frameworks such as federated learning, enabling malicious parties to reconstruct sensitive local training data from gradient communications between clients and an aggregation server during the aggregation process. While traditional encryption-based defenses, such as homomorphic encryption, offer strong privacy guarantees without compromising model utility, they often incur prohibitive computational overheads. To mitigate this, selective encryption has emerged as a promising approach, encrypting only a subset of gradient data based on the data’s significance under a certain metric. However, there have been few systematic studies on how to specify this metric in practice. This paper systematically evaluates selective encryption methods with different significance metrics against state-of-the-art attacks. Our findings demonstrate the feasibility of selective encryption in reducing computational overhead while maintaining resilience against attacks. We propose a distance-based significance analysis framework that provides theoretical foundations for selecting critical gradient elements for encryption. Through extensive experiments on different model architectures (LeNet, CNN, BERT, GPT-2) and attack types, we identify gradient magnitude as a generally effective metric for protection against optimization-based gradient inversions. However, we also observe that no single selective encryption strategy is universally optimal across all attack scenarios, and we provide guidelines for choosing appropriate strategies for different model architectures and privacy requirements.

[LG-32] Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

链接: https://arxiv.org/abs/2508.04097
作者: Ngoc-Bao Nguyen,Sy-Tuyen Ho,Koh Jun Hao,Ngai-Man Cheung
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs’ vulnerability in leaking private visual training data. To tailored for VLMs’ token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

[LG-33] Convolutional autoencoders for the reconstruction of three-dimensional interfacial multiphase flows

链接: https://arxiv.org/abs/2508.04084
作者: Murray Cutforth,Shahab Mirjalili
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In this work, we perform a comprehensive investigation of autoencoders for reduced-order modeling of three-dimensional multiphase flows. Focusing on the accuracy of reconstructing multiphase flow volume/mass fractions with a standard convolutional architecture, we examine the advantages and disadvantages of different interface representation choices (diffuse, sharp, level set). We use a combination of synthetic data with non-trivial interface topologies and high-resolution simulation data of multiphase homogeneous isotropic turbulence for training and validation. This study clarifies the best practices for reducing the dimensionality of multiphase flows via autoencoders. Consequently, this paves the path for uncoupling the training of autoencoders for accurate reconstruction and the training of temporal or input/output models such as neural operators (e.g., FNOs, DeepONets) and neural ODEs on the lower-dimensional latent space given by the autoencoders. As such, the implications of this study are significant and of interest to the multiphase flow community and beyond.

[LG-34] he Ubiquitous Sparse Matrix-Matrix Products

链接: https://arxiv.org/abs/2508.04077
作者: Aydın Buluç
类目: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Mathematical Software (cs.MS); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Multiplication of a sparse matrix with another (dense or sparse) matrix is a fundamental operation that captures the computational patterns of many data science applications, including but not limited to graph algorithms, sparsely connected neural networks, graph neural networks, clustering, and many-to-many comparisons of biological sequencing data. In many application scenarios, the matrix multiplication takes places on an arbitrary algebraic semiring where the scalar operations are overloaded with user-defined functions with certain properties or a more general heterogenous algebra where even the domains of the input matrices can be different. Here, we provide a unifying treatment of the sparse matrix-matrix operation and its rich application space including machine learning, computational biology and chemistry, graph algorithms, and scientific computing. Subjects: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Mathematical Software (cs.MS); Combinatorics (math.CO) Cite as: arXiv:2508.04077 [math.NA] (or arXiv:2508.04077v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2508.04077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Adversarial Fair Multi-View Clustering

链接: https://arxiv.org/abs/2508.04071
作者: Mudi Jiang,Jiahui Zhou,Lianyu Hu,Xinying Liu,Zengyou He,Zhikui Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cluster analysis is a fundamental problem in data mining and machine learning. In recent years, multi-view clustering has attracted increasing attention due to its ability to integrate complementary information from multiple views. However, existing methods primarily focus on clustering performance, while fairness-a critical concern in human-centered applications-has been largely overlooked. Although recent studies have explored group fairness in multi-view clustering, most methods impose explicit regularization on cluster assignments, relying on the alignment between sensitive attributes and the underlying cluster structure. However, this assumption often fails in practice and can degrade clustering performance. In this paper, we propose an adversarial fair multi-view clustering (AFMVC) framework that integrates fairness learning into the representation learning process. Specifically, our method employs adversarial training to fundamentally remove sensitive attribute information from learned features, ensuring that the resulting cluster assignments are unaffected by it. Furthermore, we theoretically prove that aligning view-specific clustering assignments with a fairness-invariant consensus distribution via KL divergence preserves clustering consistency without significantly compromising fairness, thereby providing additional theoretical guarantees for our framework. Extensive experiments on data sets with fairness constraints demonstrate that AFMVC achieves superior fairness and competitive clustering performance compared to existing multi-view clustering and fairness-aware clustering methods.

[LG-36] Fine-tuning for Better Few Shot Prompting: An Empirical Comparison for Short Answer Grading

链接: https://arxiv.org/abs/2508.04063
作者: Joel Walsh,Siddarth Mamidanna,Benjamin Nye,Mark Core,Daniel Auerbach
类目: Machine Learning (cs.LG)
*备注: Proceedings of the Second Workshop on Automated Evaluation of Learning and Assessment Content co-located with 26th International Conference on Artificial Intelligence in Education (AIED 2025)

点击查看摘要

Abstract:Research to improve Automated Short Answer Grading has recently focused on Large Language Models (LLMs) with prompt engineering and no- or few-shot prompting to achieve best results. This is in contrast to the fine-tuning approach, which has historically required large-scale compute clusters inaccessible to most users. New closed-model approaches such as OpenAI’s fine-tuning service promise results with as few as 100 examples, while methods using open weights such as quantized low-rank adaptive (QLORA) can be used to fine-tune models on consumer GPUs. We evaluate both of these fine-tuning methods, measuring their interaction with few-shot prompting for automated short answer grading (ASAG) with structured (JSON) outputs. Our results show that finetuning with small amounts of data has limited utility for Llama open-weight models, but that fine-tuning methods can outperform few-shot baseline instruction-tuned LLMs for OpenAI’s closed models. While our evaluation set is limited, we find some evidence that the observed benefits of finetuning may be impacted by the domain subject matter. Lastly, we observed dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by seeding the initial training examples with a significant amount of cheaply generated synthetic training data.

[LG-37] Quantum Temporal Fusion Transformer

链接: https://arxiv.org/abs/2508.04048
作者: Krishnakanta Barik,Goutam Paul
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The Temporal Fusion Transformer (TFT), proposed by Lim et al. [\textitInternational Journal of Forecasting, 2021], is a state-of-the-art attention-based deep neural network architecture specifically designed for multi-horizon time series forecasting. It has demonstrated significant performance improvements over existing benchmarks. In this work, we propose a Quantum Temporal Fusion Transformer (QTFT), a quantum-enhanced hybrid quantum-classical architecture that extends the capabilities of the classical TFT framework. Our results demonstrate that QTFT is successfully trained on the forecasting datasets and is capable of accurately predicting future values. In particular, our experimental results display that in certain test cases, the model outperforms its classical counterpart in terms of both training and test loss, while in the remaining cases, it achieves comparable performance. A key advantage of our approach lies in its foundation on a variational quantum algorithm, enabling implementation on current noisy intermediate-scale quantum (NISQ) devices without strict requirements on the number of qubits or circuit depth.

[LG-38] FeDaL: Federated Dataset Learning for Time Series Foundation Models

链接: https://arxiv.org/abs/2508.04045
作者: Shengchao Chen,Guodong Long,Jing Jiang
类目: Machine Learning (cs.LG)
*备注: 28 pages, scaling FL to time series foundation models

点击查看摘要

Abstract:Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

[LG-39] Decoupled Contrastive Learning for Federated Learning

链接: https://arxiv.org/abs/2508.04005
作者: Hyungbin Kim,Incheol Baek,Yon Dohn Chung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning is a distributed machine learning paradigm that allows multiple participants to train a shared model by exchanging model updates instead of their raw data. However, its performance is degraded compared to centralized approaches due to data heterogeneity across clients. While contrastive learning has emerged as a promising approach to mitigate this, our theoretical analysis reveals a fundamental conflict: its asymptotic assumptions of an infinite number of negative samples are violated in finite-sample regime of federated learning. To address this issue, we introduce Decoupled Contrastive Learning for Federated Learning (DCFL), a novel framework that decouples the existing contrastive loss into two objectives. Decoupling the loss into its alignment and uniformity components enables the independent calibration of the attraction and repulsion forces without relying on the asymptotic assumptions. This strategy provides a contrastive learning method suitable for federated learning environments where each client has a small amount of data. Our experimental results show that DCFL achieves stronger alignment between positive samples and greater uniformity between negative samples compared to existing contrastive learning methods. Furthermore, experimental results on standard benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet, demonstrate that DCFL consistently outperforms state-of-the-art federated learning methods.

[LG-40] nsorized Clustered LoRA Merging for Multi-Task Interference

链接: https://arxiv.org/abs/2508.03999
作者: Zhan Su,Fengran Mo,Guojun Liang,Jinghan Zhang,Bingbing Wen,Prayag Tiwari,Jian-Yun Nie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the success of the monolithic dense paradigm of large language models (LLMs), the LoRA adapters offer an efficient solution by fine-tuning small task-specific modules and merging them with the base model. However, in multi-task settings, merging LoRA adapters trained on heterogeneous sources frequently causes \textittask interference, degrading downstream performance. To address this, we propose a tensorized clustered LoRA (TC-LoRA) library targeting to address the task interference at the \textittext-level and \textitparameter-level. At the \textittext-level, we cluster the training samples in the embedding space to capture input-format similarities, then train a specialized LoRA adapter for each cluster. At the \textitparameter-level, we introduce a joint Canonical Polyadic (CP) decomposition that disentangles task-specific and shared factors across LoRA adapters. This joint factorization preserves essential knowledge while reducing cross-task interference. Extensive experiments on out-of-domain zero-shot and skill-composition tasks-including reasoning, question answering, and coding. Compared to strong SVD-based baselines, TC-LoRA achieves +1.4% accuracy on Phi-3 and +2.3% on Mistral-7B (+2.3%), demonstrating the effectiveness of TC-LoRA in LLM adaptation.

[LG-41] BubbleONet: A Physics-Informed Neural Operator for High-Frequency Bubble Dynamics

链接: https://arxiv.org/abs/2508.03965
作者: Yunhao Zhang,Lin Cheng,Aswin Gnanaskandan,Ameya D. Jagtap
类目: Machine Learning (cs.LG)
*备注: 35 pages, 25 figures

点击查看摘要

Abstract:This paper introduces BubbleONet, an operator learning model designed to map pressure profiles from an input function space to corresponding bubble radius responses. BubbleONet is built upon the physics-informed deep operator network (PI-DeepONet) framework, leveraging DeepONet’s powerful universal approximation capabilities for operator learning alongside the robust physical fidelity provided by the physics-informed neural networks. To mitigate the inherent spectral bias in deep learning, BubbleONet integrates the Rowdy adaptive activation function, enabling improved representation of high-frequency features. The model is evaluated across various scenarios, including: (1) Rayleigh-Plesset equation based bubble dynamics with a single initial radius, (2) Keller-Miksis equation based bubble dynamics with a single initial radius, and (3) Keller-Miksis equation based bubble dynamics with multiple initial radii. Moreover, the performance of single-step versus two-step training techniques for BubbleONet is investigated. The results demonstrate that BubbleONet serves as a promising surrogate model for simulating bubble dynamics, offering a computationally efficient alternative to traditional numerical solvers.

[LG-42] Measuring the stability and plasticity of recommender systems

链接: https://arxiv.org/abs/2508.03941
作者: Maria João Lavoura,Robert Jungnickel,João Vinagre
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only captures performance of a particular model trained at some point in the past. We know, however, that online systems evolve over time. In general, it is a good idea that models reflect such changes, so models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns – stability – and, on the other hand, (quickly) adapt to changes – plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and this http URL additional experiments will be necessary to confirm these observations, they already illustrate the usefulness of the proposed framework to gain insights on the long term dynamics of recommendation models.

[LG-43] Markov Chain Estimation with In-Context Learning

链接: https://arxiv.org/abs/2508.03934
作者: Simon Lepage,Jeremie Mary,David Picard
类目: Machine Learning (cs.LG)
*备注: Accepted at Gretsi 2025

点击查看摘要

Abstract:We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.

[LG-44] Next Generation Equation-Free Multiscale Modelling of Crowd Dynamics via Machine Learning

链接: https://arxiv.org/abs/2508.03926
作者: Hector Vargas Alvarez,Dimitrios G. Patsatzis,Lucia Russo,Ioannis Kevrekidis,Constantinos Siettos
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 29 pages (9 pages of Appendix), 9 figures (3 in Appendix)

点击查看摘要

Abstract:Bridging the microscopic and the macroscopic modelling scales in crowd dynamics constitutes an important, open challenge for systematic numerical analysis, optimization, and control. We propose a combined manifold and machine learning approach to learn the discrete evolution operator for the emergent crowd dynamics in latent spaces from high-fidelity agent-based simulations. The proposed framework builds upon our previous works on next-generation Equation-free algorithms on learning surrogate models for high-dimensional and multiscale systems. Our approach is a four-stage one, explicitly conserving the mass of the reconstructed dynamics in the high-dimensional space. In the first step, we derive continuous macroscopic fields (densities) from discrete microscopic data (pedestrians’ positions) using KDE. In the second step, based on manifold learning, we construct a map from the macroscopic ambient space into the latent space parametrized by a few coordinates based on POD of the corresponding density distribution. The third step involves learning reduced-order surrogate ROMs in the latent space using machine learning techniques, particularly LSTMs networks and MVARs. Finally, we reconstruct the crowd dynamics in the high-dimensional space in terms of macroscopic density profiles. We demonstrate that the POD reconstruction of the density distribution via SVD conserves the mass. With this “embed-learn in latent space-lift back to the ambient space” pipeline, we create an effective solution operator of the unavailable macroscopic PDE for the density evolution. For our illustrations, we use the Social Force Model to generate data in a corridor with an obstacle, imposing periodic boundary conditions. The numerical results demonstrate high accuracy, robustness, and generalizability, thus allowing for fast and accurate modelling/simulation of crowd dynamics from agent-based simulations.

[LG-45] Reinforcement Learning for Target Zone Blood Glucose Control

链接: https://arxiv.org/abs/2508.03875
作者: David H. Mguni,Jing Dong,Wanrong Yang,Ziquan Liu,Muhammad Salman Haleem,Baoxiang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Managing physiological variables within clinically safe target zones is a central challenge in healthcare, particularly for chronic conditions such as Type 1 Diabetes Mellitus (T1DM). Reinforcement learning (RL) offers promise for personalising treatment, but struggles with the delayed and heterogeneous effects of interventions. We propose a novel RL framework to study and support decision-making in T1DM technologies, such as automated insulin delivery. Our approach captures the complex temporal dynamics of treatment by unifying two control modalities: \textitimpulse control for discrete, fast-acting interventions (e.g., insulin boluses), and \textitswitching control for longer-acting treatments and regime shifts. The core of our method is a constrained Markov decision process augmented with physiological state features, enabling safe policy learning under clinical and resource constraints. The framework incorporates biologically realistic factors, including insulin decay, leading to policies that better reflect real-world therapeutic behaviour. While not intended for clinical deployment, this work establishes a foundation for future safe and temporally-aware RL in healthcare. We provide theoretical guarantees of convergence and demonstrate empirical improvements in a stylised T1DM control task, reducing blood glucose level violations from 22.4% (state-of-the-art) to as low as 10.8%.

[LG-46] Prediction-Oriented Subsampling from Data Streams

链接: https://arxiv.org/abs/2508.03868
作者: Benedetta Lavinia Mussati,Freddie Bickford Smith,Tom Rainforth,Stephen Roberts
类目: Machine Learning (cs.LG)
*备注: Published at CoLLAs 2025

点击查看摘要

Abstract:Data is often generated in streams, with new observations arriving over time. A key challenge for learning models from data streams is capturing relevant information while keeping computational costs manageable. We explore intelligent data subsampling for offline learning, and argue for an information-theoretic method centred on reducing uncertainty in downstream predictions of interest. Empirically, we demonstrate that this prediction-oriented approach performs better than a previously proposed information-theoretic technique on two widely studied problems. At the same time, we highlight that reliably achieving strong performance in practice requires careful model design.

[LG-47] Data-Driven Spectrum Demand Prediction: A Spatio-Temporal Framework with Transfer Learning

链接: https://arxiv.org/abs/2508.03863
作者: Amin Farajzadeh,Hongzhao Zheng,Sarah Dumoulin,Trevor Ha,Halim Yanikomeroglu,Amir Ghasemi
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Accepted to be presented at IEEE PIMRC 2025

点击查看摘要

Abstract:Accurate spectrum demand prediction is crucial for informed spectrum allocation, effective regulatory planning, and fostering sustainable growth in modern wireless communication networks. It supports governmental efforts, particularly those led by the international telecommunication union (ITU), to establish fair spectrum allocation policies, improve auction mechanisms, and meet the requirements of emerging technologies such as advanced 5G, forthcoming 6G, and the internet of things (IoT). This paper presents an effective spatio-temporal prediction framework that leverages crowdsourced user-side key performance indicators (KPIs) and regulatory datasets to model and forecast spectrum demand. The proposed methodology achieves superior prediction accuracy and cross-regional generalizability by incorporating advanced feature engineering, comprehensive correlation analysis, and transfer learning techniques. Unlike traditional ITU models, which are often constrained by arbitrary inputs and unrealistic assumptions, this approach exploits granular, data-driven insights to account for spatial and temporal variations in spectrum utilization. Comparative evaluations against ITU estimates, as the benchmark, underscore our framework’s capability to deliver more realistic and actionable predictions. Experimental results validate the efficacy of our methodology, highlighting its potential as a robust approach for policymakers and regulatory bodies to enhance spectrum management and planning.

[LG-48] wo-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

链接: https://arxiv.org/abs/2508.03854
作者: Xin Zhang,Quanyu Zhu,Liangbei Xu,Zain Huda,Wang Zhou,Jin Fang,Dennis van der Staay,Yuxi Hu,Jade Nie,Jiyan Yang,Chunzhi Yang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial component for managing sparse categorical features. Typically, these tables in industrial DLRMs contain trillions of parameters, necessitating model parallelism strategies to address memory constraints. However, as training systems expand with massive GPUs, the traditional fully parallelism strategies for embedding table post significant scalability challenges, including imbalance and straggler issues, intensive lookup communication, and heavy embedding activation memory. To overcome these limitations, we propose a novel two-dimensional sparse parallelism approach. Rather than fully sharding tables across all GPUs, our solution introduces data parallelism on top of model parallelism. This enables efficient all-to-all communication and reduces peak memory consumption. Additionally, we have developed the momentum-scaled row-wise AdaGrad algorithm to mitigate performance losses associated with the shift in training paradigms. Our extensive experiments demonstrate that the proposed approach significantly enhances training efficiency while maintaining model performance parity. It achieves nearly linear training speed scaling up to 4K GPUs, setting a new state-of-the-art benchmark for recommendation model training.

[LG-49] DP-NCB: Privacy Preserving Fair Bandits

链接: https://arxiv.org/abs/2508.03836
作者: Dhruv Sarkar,Nishant Pandey,Sayak Ray Chowdhury
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-armed bandit algorithms are fundamental tools for sequential decision-making under uncertainty, with widespread applications across domains such as clinical trials and personalized decision-making. As bandit algorithms are increasingly deployed in these socially sensitive settings, it becomes critical to protect user data privacy and ensure fair treatment across decision rounds. While prior work has independently addressed privacy and fairness in bandit settings, the question of whether both objectives can be achieved simultaneously has remained largely open. Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)-a novel and unified algorithmic framework that simultaneously ensures \epsilon -differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. The framework is sufficiently general to operate under both global and local differential privacy models, and is anytime, requiring no prior knowledge of the time horizon. We support our theoretical guarantees with simulations on synthetic bandit instances, showing that DP-NCB incurs substantially lower Nash regret than state-of-the-art baselines. Our results offer a principled foundation for designing bandit algorithms that are both privacy-preserving and fair, making them suitable for high-stakes, socially impactful applications. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2508.03836 [cs.LG] (or arXiv:2508.03836v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.03836 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Scalable Neural Network-based Blackbox Optimization

链接: https://arxiv.org/abs/2508.03827
作者: Pavankumar Koratikere,Leifur Leifsson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This preprint has been submitted to Structural and Multidisciplinary Optimization for peer review. An open-source implementation of SNBO is available at: this https URL

点击查看摘要

Abstract:Bayesian Optimization (BO) is a widely used approach for blackbox optimization that leverages a Gaussian process (GP) model and an acquisition function to guide future sampling. While effective in low-dimensional settings, BO faces scalability challenges in high-dimensional spaces and with large number of function evaluations due to the computational complexity of GP models. In contrast, neural networks (NNs) offer better scalability and can model complex functions, which led to the development of NN-based BO approaches. However, these methods typically rely on estimating model uncertainty in NN prediction – a process that is often computationally intensive and complex, particularly in high dimensions. To address these limitations, a novel method, called scalable neural network-based blackbox optimization (SNBO), is proposed that does not rely on model uncertainty estimation. Specifically, SNBO adds new samples using separate criteria for exploration and exploitation, while adaptively controlling the sampling region to ensure efficient optimization. SNBO is evaluated on a range of optimization problems spanning from 10 to 102 dimensions and compared against four state-of-the-art baseline algorithms. Across the majority of test problems, SNBO attains function values better than the best-performing baseline algorithm, while requiring 40-60% fewer function evaluations and reducing the runtime by at least an order of magnitude.

[LG-51] Bernoulli-LoRA: A Theoretical Framework for Randomized Low-Rank Adaptation

链接: https://arxiv.org/abs/2508.03820
作者: Igor Sokolov,Abdurakhmon Sadiev,Yury Demidovich,Fawaz S Al-Qahtani,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 64 Pages, 9 Algorithms, 22 Theorems, 10 Lemmas, 2 Figures, 3 Tables

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large foundational models to specific tasks, particularly as model sizes continue to grow exponentially. Among PEFT methods, Low-Rank Adaptation (LoRA) (arXiv:2106.09685) stands out for its effectiveness and simplicity, expressing adaptations as a product of two low-rank matrices. While extensive empirical studies demonstrate LoRA’s practical utility, theoretical understanding of such methods remains limited. Recent work on RAC-LoRA (arXiv:2410.08305) took initial steps toward rigorous analysis. In this work, we introduce Bernoulli-LoRA, a novel theoretical framework that unifies and extends existing LoRA approaches. Our method introduces a probabilistic Bernoulli mechanism for selecting which matrix to update. This approach encompasses and generalizes various existing update strategies while maintaining theoretical tractability. Under standard assumptions from non-convex optimization literature, we analyze several variants of our framework: Bernoulli-LoRA-GD, Bernoulli-LoRA-SGD, Bernoulli-LoRA-PAGE, Bernoulli-LoRA-MVR, Bernoulli-LoRA-QGD, Bernoulli-LoRA-MARINA, and Bernoulli-LoRA-EF21, establishing convergence guarantees for each variant. Additionally, we extend our analysis to convex non-smooth functions, providing convergence rates for both constant and adaptive (Polyak-type) stepsizes. Through extensive experiments on various tasks, we validate our theoretical findings and demonstrate the practical efficacy of our approach. This work is a step toward developing theoretically grounded yet practically effective PEFT methods.

[LG-52] Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

链接: https://arxiv.org/abs/2508.03768
作者: Debamita Ghosh,George K. Atia,Yue Wang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.03578 by other authors

点击查看摘要

Abstract:Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment – assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general f -divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

[LG-53] A Robust and Efficient Pipeline for Enterprise-Level Large-Scale Entity Resolution

链接: https://arxiv.org/abs/2508.03767
作者: Sandeepa Kannangara,Arman Abrahamyan,Daniel Elias,Thomas Kilby,Nadav Dar,Luiz Pizzato,Anna Leontjeva,Dan Jermyn
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Entity resolution (ER) remains a significant challenge in data management, especially when dealing with large datasets. This paper introduces MERAI (Massive Entity Resolution using AI), a robust and efficient pipeline designed to address record deduplication and linkage issues in high-volume datasets at an enterprise level. The pipeline’s resilience and accuracy have been validated through various large-scale record deduplication and linkage projects. To evaluate MERAI’s performance, we compared it with two well-known entity resolution libraries, Dedupe and Splink. While Dedupe failed to scale beyond 2 million records due to memory constraints, MERAI successfully processed datasets of up to 15.7 million records and produced accurate results across all experiments. Experimental data demonstrates that MERAI outperforms both baseline systems in terms of matching accuracy, with consistently higher F1 scores in both deduplication and record linkage tasks. MERAI offers a scalable and reliable solution for enterprise-level large-scale entity resolution, ensuring data integrity and consistency in real-world applications.

[LG-54] LLM -Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

链接: https://arxiv.org/abs/2508.03766
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \textttLLMPrior, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \textttFed-LLMPrior, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

[LG-55] PILOT-C: Physics-Informed Low-Distortion Optimal Trajectory Compression

链接: https://arxiv.org/abs/2508.03730
作者: Kefei Wu,Baihua Zheng,Weiwei Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Location-aware devices continuously generate massive volumes of trajectory data, creating demand for efficient compression. Line simplification is a common solution but typically assumes 2D trajectories and ignores time synchronization and motion continuity. We propose PILOT-C, a novel trajectory compression framework that integrates frequency-domain physics modeling with error-bounded optimization. Unlike existing line simplification methods, PILOT-C supports trajectories in arbitrary dimensions, including 3D, by compressing each spatial axis independently. Evaluated on four real-world datasets, PILOT-C achieves superior performance across multiple dimensions. In terms of compression ratio, PILOT-C outperforms CISED-W, the current state-of-the-art SED-based line simplification algorithm, by an average of 19.2%. For trajectory fidelity, PILOT-C achieves an average of 32.6% reduction in error compared to CISED-W. Additionally, PILOT-C seamlessly extends to three-dimensional trajectories while maintaining the same computational complexity, achieving a 49% improvement in compression ratios over SQUISH-E, the most efficient line simplification algorithm on 3D datasets.

[LG-56] Privileged Contrastive Pretraining for Multimodal Affect Modelling

链接: https://arxiv.org/abs/2508.03729
作者: Kosmas Pinitas,Konstantinos Makantasis,Georgios N. Yannakakis
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Affective Computing (AC) has made significant progress with the advent of deep learning, yet a persistent challenge remains: the reliable transfer of affective models from controlled laboratory settings (in-vitro) to uncontrolled real-world environments (in-vivo). To address this challenge we introduce the Privileged Contrastive Pretraining (PriCon) framework according to which models are first pretrained via supervised contrastive learning (SCL) and then act as teacher models within a Learning Using Privileged Information (LUPI) framework. PriCon both leverages privileged information during training and enhances the robustness of derived affect models via SCL. Experiments conducted on two benchmark affective corpora, RECOLA and AGAIN, demonstrate that models trained using PriCon consistently outperform LUPI and end to end models. Remarkably, in many cases, PriCon models achieve performance comparable to models trained with access to all modalities during both training and testing. The findings underscore the potential of PriCon as a paradigm towards further bridging the gap between in-vitro and in-vivo affective modelling, offering a scalable and practical solution for real-world applications.

[LG-57] Evaluating Generative AI Tools for Personalized Offline Recommendations: A Comparative Study

链接: https://arxiv.org/abs/2508.03710
作者: Rafael Salinas-Buestan,Otto Parra,Nelly Condori-Fernandez,Maria Fernanda Granda
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: ESEM Registered Report Track

点击查看摘要

Abstract:Background: Generative AI tools have become increasingly relevant in supporting personalized recommendations across various domains. However, their effectiveness in health-related behavioral interventions, especially those aiming to reduce the use of technology, remains underexplored. Aims: This study evaluates the performance and user satisfaction of the five most widely used generative AI tools when recommending non-digital activities tailored to individuals at risk of repetitive strain injury. Method: Following the Goal/Question/Metric (GQM) paradigm, this proposed experiment involves generative AI tools that suggest offline activities based on predefined user profiles and intervention scenarios. The evaluation is focused on quantitative performance (precision, recall, F1-score and MCC-score) and qualitative aspects (user satisfaction and perceived recommendation relevance). Two research questions were defined: RQ1 assessed which tool delivers the most accurate recommendations, and RQ2 evaluated how tool choice influences user satisfaction.

[LG-58] Suggest Complement Inspire: Story of Two Tower Recommendations at Allegro.com RECSYS2025

链接: https://arxiv.org/abs/2508.03702
作者: Aleksandra Osowska-Kurczab,Klaudia Nazarko,Mateusz Marzec,Lidia Wojciechowska,Eliška Kremeňová
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Recsys 2025 Industrial Track

点击查看摘要

Abstract:Building large-scale e-commerce recommendation systems requires addressing three key technical challenges: (1) designing a universal recommendation architecture across dozens of placements, (2) decreasing excessive maintenance costs, and (3) managing a highly dynamic product catalogue. This paper presents a unified content-based recommendation system deployed at this http URL, the largest e-commerce platform of European origin. The system is built on a prevalent Two Tower retrieval framework, representing products using textual and structured attributes, which enables efficient retrieval via Approximate Nearest Neighbour search. We demonstrate how the same model architecture can be adapted to serve three distinct recommendation tasks: similarity search, complementary product suggestions, and inspirational content discovery, by modifying only a handful of components in either the model or the serving logic. Extensive A/B testing over two years confirms significant gains in engagement and profit-based metrics across desktop and mobile app channels. Our results show that a flexible, scalable architecture can serve diverse user intents with minimal maintenance overhead.

[LG-59] Accept-Reject Lasso

链接: https://arxiv.org/abs/2508.04646
作者: Yanxin Liu,Yunqi Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Lasso method is known to exhibit instability in the presence of highly correlated features, often leading to an arbitrary selection of predictors. This issue manifests itself in two primary error types: the erroneous omission of features that lack a true substitutable relationship (falsely redundant features) and the inclusion of features with a true substitutable relationship (truly redundant features). Although most existing methods address only one of these challenges, we introduce the Accept-Reject Lasso (ARL), a novel approach that resolves this dilemma. ARL operationalizes an Accept-Reject framework through a fine-grained analysis of feature selection across data subsets. This framework is designed to partition the output of an ensemble method into beneficial and detrimental components through fine-grained analysis. The fundamental challenge for Lasso is that inter-variable correlation obscures the true sources of information. ARL tackles this by first using clustering to identify distinct subset structures within the data. It then analyzes Lasso’s behavior across these subsets to differentiate between true and spurious correlations. For truly correlated features, which induce multicollinearity, ARL tends to select a single representative feature and reject the rest to ensure model stability. Conversely, for features linked by spurious correlations, which may vanish in certain subsets, ARL accepts those that Lasso might have incorrectly omitted. The distinct patterns arising from true versus spurious correlations create a divisible separation. By setting an appropriate threshold, our framework can effectively distinguish between these two phenomena, thereby maximizing the inclusion of informative variables while minimizing the introduction of detrimental ones. We illustrate the efficacy of the proposed method through extensive simulation and real-data experiments.

[LG-60] Quantum circuit complexity and unsupervised machine learning of topological order

链接: https://arxiv.org/abs/2508.04486
作者: Yanming Che,Clemens Gneiting,Xiaoguang Wang,Franco Nori
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 17 pages, with appendix; 4 figures. Code is available upon reasonable request, and will be open-sourced along with the publication. Comments are welcome

点击查看摘要

Abstract:Inspired by the close relationship between Kolmogorov complexity and unsupervised machine learning, we explore quantum circuit complexity, an important concept in quantum computation and quantum information science, as a pivot to understand and to build interpretable and efficient unsupervised machine learning for topological order in quantum many-body systems. To span a bridge from conceptual power to practical applicability, we present two theorems that connect Nielsen’s quantum circuit complexity for the quantum path planning between two arbitrary quantum many-body states with fidelity change and entanglement generation, respectively. Leveraging these connections, fidelity-based and entanglement-based similarity measures or kernels, which are more practical for implementation, are formulated. Using the two proposed kernels, numerical experiments targeting the unsupervised clustering of quantum phases of the bond-alternating XXZ spin chain, the ground state of Kitaev’s toric code and random product states, are conducted, demonstrating their superior performance. Relations with classical shadow tomography and shadow kernel learning are also discussed, where the latter can be naturally derived and understood from our approach. Our results establish connections between key concepts and tools of quantum circuit computation, quantum complexity, and machine learning of topological quantum order.

[LG-61] Benchmarking Uncertainty and its Disentanglement in multi-label Chest X-Ray Classification

链接: https://arxiv.org/abs/2508.04457
作者: Simon Baur,Wojciech Samek,Jackie Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable uncertainty quantification is crucial for trustworthy decision-making and the deployment of AI models in medical imaging. While prior work has explored the ability of neural networks to quantify predictive, epistemic, and aleatoric uncertainties using an information-theoretical approach in synthetic or well defined data settings like natural image classification, its applicability to real life medical diagnosis tasks remains underexplored. In this study, we provide an extensive uncertainty quantification benchmark for multi-label chest X-ray classification using the MIMIC-CXR-JPG dataset. We evaluate 13 uncertainty quantification methods for convolutional (ResNet) and transformer-based (Vision Transformer) architectures across a wide range of tasks. Additionally, we extend Evidential Deep Learning, HetClass NNs, and Deep Deterministic Uncertainty to the multi-label setting. Our analysis provides insights into uncertainty estimation effectiveness and the ability to disentangle epistemic and aleatoric uncertainties, revealing method- and architecture-specific strengths and limitations.

[LG-62] he Relative Instability of Model Comparison with Cross-validation

链接: https://arxiv.org/abs/2508.04409
作者: Alexandre Bayle,Lucas Janson,Lester Mackey
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 4 figures

点击查看摘要

Abstract:Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where such confidence intervals will be valid. However, in the common setting where CV is used to compare two algorithms, it becomes necessary to consider a notion of relative stability which cannot easily be derived from existing stability results, even for simple algorithms. To better understand relative stability and when CV provides valid confidence intervals for the test error difference of two algorithms, we study the soft-thresholded least squares algorithm, a close cousin of the Lasso. We prove that while stability holds when assessing the individual test error of this algorithm, relative stability fails to hold when comparing the test error of two such algorithms, even in a sparse low-dimensional linear model setting. Additionally, we empirically confirm the invalidity of CV confidence intervals for the test error difference when either soft-thresholding or the Lasso is used. In short, caution is needed when quantifying the uncertainty of CV estimates of the performance difference of two machine learning algorithms, even when both algorithms are individually stable.

[LG-63] Deep Neural Network-Driven Adaptive Filtering

链接: https://arxiv.org/abs/2508.04258
作者: Qizhen Wang,Gang Wang,Ying-Chang Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF). In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition. The DNN, functioning as a universal nonlinear operator, is structurally embedded into the core architecture of the AF system, establishing a direct mapping between filtering residuals and learning gradients. The maximum likelihood is adopted as the implicit cost function, rendering the derived algorithm inherently data-driven and thus endowed with exemplary generalization capability, which is validated by extensive numerical experiments across a spectrum of non-Gaussian scenarios. Corresponding mean value and mean square stability analyses are also conducted in detail.

[LG-64] Negative binomial regression and inference using a pre-trained transformer

链接: https://arxiv.org/abs/2508.04111
作者: Valentine Svensson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Negative binomial regression is essential for analyzing over-dispersed count data in in comparative studies, but parameter estimation becomes computationally challenging in large screens requiring millions of comparisons. We investigate using a pre-trained transformer to produce estimates of negative binomial regression parameters from observed count data, trained through synthetic data generation to learn to invert the process of generating counts from parameters. The transformer method achieved better parameter accuracy than maximum likelihood optimization while being 20 times faster. However, comparisons unexpectedly revealed that method of moment estimates performed as well as maximum likelihood optimization in accuracy, while being 1,000 times faster and producing better-calibrated and more powerful tests, making it the most efficient solution for this application.

[LG-65] Hybrid Quantum–Classical Machine Learning Potential with Variational Quantum Circuits

链接: https://arxiv.org/abs/2508.04098
作者: Soohaeng Yoo Willow,D. ChangMo Yang,Chang Woo Myung
类目: Quantum Physics (quant-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 26+6 pages, 6+4 figures

点击查看摘要

Abstract:Quantum algorithms for simulating large and complex molecular systems are still in their infancy, and surpassing state-of-the-art classical techniques remains an ever-receding goal post. A promising avenue of inquiry in the meanwhile is to seek practical advantages through hybrid quantum-classical algorithms, which combine conventional neural networks with variational quantum circuits (VQCs) running on today’s noisy intermediate-scale quantum (NISQ) hardware. Such hybrids are well suited to NISQ hardware. The classical processor performs the bulk of the computation, while the quantum processor executes targeted sub-tasks that supply additional non-linearity and expressivity. Here, we benchmark a purely classical E(3)-equivariant message-passing machine learning potential (MLP) against a hybrid quantum-classical MLP for predicting density functional theory (DFT) properties of liquid silicon. In our hybrid architecture, every readout in the message-passing layers is replaced by a VQC. Molecular dynamics simulations driven by the HQC-MLP reveal that an accurate reproduction of high-temperature structural and thermodynamic properties is achieved with VQCs. These findings demonstrate a concrete scenario in which NISQ-compatible HQC algorithm could deliver a measurable benefit over the best available classical alternative, suggesting a viable pathway toward near-term quantum advantage in materials modeling.

[LG-66] Comparing Normalization Methods for Portfolio Optimization with Reinforcement Learning

链接: https://arxiv.org/abs/2508.03910
作者: Caio de Souza Barbosa Costa,Anna Helena Reali Costa
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, reinforcement learning has achieved remarkable results in various domains, including robotics, games, natural language processing, and finance. In the financial domain, this approach has been applied to tasks such as portfolio optimization, where an agent continuously adjusts the allocation of assets within a financial portfolio to maximize profit. Numerous studies have introduced new simulation environments, neural network architectures, and training algorithms for this purpose. Among these, a domain-specific policy gradient algorithm has gained significant attention in the research community for being lightweight, fast, and for outperforming other approaches. However, recent studies have shown that this algorithm can yield inconsistent results and underperform, especially when the portfolio does not consist of cryptocurrencies. One possible explanation for this issue is that the commonly used state normalization method may cause the agent to lose critical information about the true value of the assets being traded. This paper explores this hypothesis by evaluating two of the most widely used normalization methods across three different markets (IBOVESPA, NYSE, and cryptocurrencies) and comparing them with the standard practice of normalizing data before training. The results indicate that, in this specific domain, the state normalization can indeed degrade the agent’s performance.

[LG-67] Reinforcement Learning in MDPs with Information-Ordered Policies

链接: https://arxiv.org/abs/2508.03904
作者: Zhongjun Zhang,Shipra Agrawal,Ilan Lobel,Sean R. Sinclair,Christina Lee Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 57 pages, 2 figures

点击查看摘要

Abstract:We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, \pi’ \leq \pi if data collected under \pi can be used to estimate the performance of \pi’ , enabling counterfactual inference without additional environment interaction. Leveraging this partial order, we show that our algorithm achieves a regret bound of O(\sqrtw \log(|\Theta|) T) , where w is the width of the partial order. Notably, the bound is independent of the state and action space sizes. We illustrate the applicability of these partial orders in many domains in operations research, including inventory control and queuing systems. For each, we apply our framework to that problem, yielding new theoretical guarantees and strong empirical results without imposing extra assumptions such as convexity in the inventory model or specialized arrival-rate structure in the queuing model.

[LG-68] Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities

链接: https://arxiv.org/abs/2508.03896
作者: Verónica Álvarez,Santiago Mazuelas,Steven An,Sanjoy Dasgupta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly provide guesses with assorted types and unknown interdependences that can result in unreliable predictions. Furthermore, existing techniques for programmatic weak supervision cannot provide assessments for the reliability of the probabilistic predictions for labels. This paper presents a methodology for programmatic weak supervision that can provide confidence intervals for label probabilities and obtain more reliable predictions. In particular, the methods proposed use uncertainty sets of distributions that encapsulate the information provided by LFs with unrestricted behavior and typology. Experiments on multiple benchmark datasets show the improvement of the presented methods over the state-of-the-art and the practicality of the confidence intervals presented.

[LG-69] Constraining the outputs of ReLU neural networks

链接: https://arxiv.org/abs/2508.03867
作者: Yulia Alexandr,Guido Montúfar
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 4 figures

点击查看摘要

Abstract:We introduce a class of algebraic varieties naturally associated with ReLU neural networks, arising from the piecewise linear structure of their outputs across activation regions in input space, and the piecewise multilinear structure in parameter space. By analyzing the rank constraints on the network outputs within each activation region, we derive polynomial equations that characterize the functions representable by the network. We further investigate conditions under which these varieties attain their expected dimension, providing insight into the expressive and structural properties of ReLU networks.

[LG-70] Viability of perturbative expansion for quantum field theories on neurons

链接: https://arxiv.org/abs/2508.03810
作者: Srimoyee Sen,Varun Vaidya
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures

点击查看摘要

Abstract:Neural Network (NN) architectures that break statistical independence of parameters have been proposed as a new approach for simulating local quantum field theories (QFTs). In the infinite neuron number limit, single-layer NNs can exactly reproduce QFT results. This paper examines the viability of this architecture for perturbative calculations of local QFTs for finite neuron number N using scalar \phi^4 theory in d Euclidean dimensions as an example. We find that the renormalized O(1/N) corrections to two- and four-point correlators yield perturbative series which are sensitive to the ultraviolet cut-off and therefore have a weak convergence. We propose a modification to the architecture to improve this convergence and discuss constraints on the parameters of the theory and the scaling of N which allow us to extract accurate field theory results.

[LG-71] A semi-automatic approach to study population dynamics based on population pyramids

链接: https://arxiv.org/abs/2508.03788
作者: Max Hahn-Klimroth,João Pedro Meireles,Laurie Bingaman Lackey,Nick van Eeuwijk Mads F. Bertelsen,Paul W. Dierkes,Marcus Clauss
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The depiction of populations - of humans or animals - as “population pyramids” is a useful tool for the assessment of various characteristics of populations at a glance. Although these visualisations are well-known objects in various communities, formalised and algorithmic approaches to gain information from these data are less present. Here, we present an algorithm-based classification of population data into “pyramids” of different shapes ([normal and inverted] pyramid / plunger / bell, [lower / middle / upper] diamond, column, hourglass) that are linked to specific characteristics of the population. To develop the algorithmic approach, we used data describing global zoo populations of mammals from 1970-2024. This algorithm-based approach delivers plausible classifications, in particular with respect to changes in population size linked to specific series of, and transitions between, different “pyramid” shapes. We believe this approach might become a useful tool for analysing and communicating historical population developments in multiple contexts and is of broad interest. Moreover, it might be useful for animal population management strategies.

[LG-72] Predicting fall risk in older adults: A machine learning comparison of accelerometric and non-accelerometric factors

链接: https://arxiv.org/abs/2508.03756
作者: Ana González-Castro,José Alberto Benítez-Andrades,Rubén González-González,Camino Prada-García,Raquel Leirós-Rodríguez
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates fall risk prediction in older adults using various machine learning models trained on accelerometric, non-accelerometric, and combined data from 146 participants. Models combining both data types achieved superior performance, with Bayesian Ridge Regression showing the highest accuracy (MSE = 0.6746, R2 = 0.9941). Non-accelerometric variables, such as age and comorbidities, proved critical for prediction. Results support the use of integrated data and Bayesian approaches to enhance fall risk assessment and inform prevention strategies.

[LG-73] Understanding Human Daily Experience Through Continuous Sensing: ETRI Lifelog Dataset 2024

链接: https://arxiv.org/abs/2508.03698
作者: Se Won Oh,Hyuntae Jeong,Seungeun Chung,Jeong Mook Lim,Kyoung Ju Noh,Sunkyung Lee,Gyuwon Jung
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: This work is intended for submission to an IEEE conference. The content is also relevant to the cs.HC category

点击查看摘要

Abstract:Improving human health and well-being requires an accurate and effective understanding of an individual’s physical and mental state throughout daily life. To support this goal, we utilized smartphones, smartwatches, and sleep sensors to collect data passively and continuously for 24 hours a day, with minimal interference to participants’ usual behavior, enabling us to gather quantitative data on daily behaviors and sleep activities across multiple days. Additionally, we gathered subjective self-reports of participants’ fatigue, stress, and sleep quality through surveys conducted immediately before and after sleep. This comprehensive lifelog dataset is expected to provide a foundational resource for exploring meaningful insights into human daily life and lifestyle patterns, and a portion of the data has been anonymized and made publicly available for further research. In this paper, we introduce the ETRI Lifelog Dataset 2024, detailing its structure and presenting potential applications, such as using machine learning models to predict sleep quality and stress.

信息检索

[IR-0] RAIL: Joint Inference and Refinement of Knowledge Graphs with Large Language Models

链接: https://arxiv.org/abs/2508.04474
作者: Xinkui Zhao,Haode Li,Yifan Zhang,Guanjie Cheng,Yueshen Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have unlocked powerful reasoning and decision-making capabilities. However, their inherent dependence on static parametric memory fundamentally limits their adaptability, factual accuracy, and interpretability in knowledge-intensive scenarios. Knowledge graphs (KGs), as structured repositories of explicit relational knowledge, offer a promising approach for augmenting LLMs with external, interpretable memory. Nevertheless, most existing methods that combine LLMs with KGs treat reasoning and knowledge updating as separate processes, resulting in suboptimal utilization of new information and hindering real-time updates. In this work, we propose TRAIL: a novel, unified framework for Thinking, Reasoning, And Incremental Learning that couples joint inference and dynamic KG refinement with large language models. TRAIL enables LLM agents to iteratively explore, update, and refine knowledge graphs during the reasoning process, employing a confidence-driven mechanism for the generation, validation, and pruning of new facts. This plug-and-play architecture facilitates seamless integration with various LLMs, supporting continual adaptation without the need for retraining. Extensive experiments on multiple benchmarks demonstrate that TRAIL outperforms existing KG-augmented and retrieval-augmented LLM baselines by 3% to 13%. More importantly, these results represent a significant step toward developing adaptive, memory-augmented language models capable of continual learning and reliable, transparent reasoning.

[IR-1] I3-MRec: Invariant Learning with Information Bottleneck for Incomplete Modality Recommendation

链接: https://arxiv.org/abs/2508.04247
作者: Huilin Chen,Miaomiao Cai,Fan Liu,Zhiyong Cheng,Richang Hong,Meng Wang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: ACM Multimedia 2025 Accepted

点击查看摘要

Abstract:Multimodal recommender systems (MRS) improve recommendation performance by integrating diverse semantic information from multiple modalities. However, the assumption of the availability of all modalities rarely holds in practice due to missing images, incomplete descriptions, or inconsistent user content. These challenges significantly degrade the robustness and generalization capabilities of current models. To address these challenges, we introduce a novel method called \textbfI ^3 -MRec, which uses \textbfInvariant learning with \textbfInformation bottleneck principle for \textbfIncomplete \textbfModality \textbfRecommendation. To achieve robust performance in missing modality scenarios, I ^3 -MRec enforces two pivotal properties: (i) cross-modal preference invariance, which ensures consistent user preference modeling across varying modality environments, and (ii) compact yet effective modality representation, which filters out task-irrelevant modality information while maximally preserving essential features relevant to recommendation. By treating each modality as a distinct semantic environment, I ^3 -MRec employs invariant risk minimization (IRM) to learn modality-specific item representations. In parallel, a missing-aware fusion module grounded in the Information Bottleneck (IB) principle extracts compact and effective item embeddings by suppressing modality noise and preserving core user preference signals. Extensive experiments conducted on three real-world datasets demonstrate that I ^3 -MRec consistently outperforms existing state-of-the-art MRS methods across various modality-missing scenarios, highlighting its effectiveness and robustness in practical applications. The code and processed datasets are released at this https URL.

[IR-2] Discrete-event Tensor Factorization: Learning a Smooth Embedding for Continuous Domains

链接: https://arxiv.org/abs/2508.04221
作者: Joey De Pauw,Bart Goethals
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems learn from past user behavior to predict future user preferences. Intuitively, it has been established that the most recent interactions are more indicative of future preferences than older interactions. Many recommendation algorithms use this notion to either drop older interactions or to assign them a lower weight, so the model can focus on the more informative, recent information. However, very few approaches model the flow of time explicitly. This paper analyzes how time can be encoded in factorization-style recommendation models. By including absolute time as a feature, our models can learn varying user preferences and changing item perception over time. In addition to simple binning approaches, we also propose a novel, fully continuous time encoding mechanism. Through the use of a polynomial fit inside the loss function, our models completely avoid the need for discretization, and they are able to capture the time dimension in arbitrary resolution. We perform a comparative study on three real-world datasets that span multiple years, where long user histories are present, and items stay relevant for a longer time. Empirical results show that, by explicitly modeling time, our models are very effective at capturing temporal signals, such as varying item popularities over time. Despite this however, our experiments also indicate that a simple post-hoc popularity adjustment is often sufficient to achieve the best performance on the unseen test set. This teaches us that, for the recommendation task, predicting the future is more important than capturing past trends. As such, we argue that specialized mechanisms are needed for extrapolation to future data. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.04221 [cs.IR] (or arXiv:2508.04221v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.04221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] ViLLA-MMBench: A Unified Benchmark Suite for LLM -Augmented Multimodal Movie Recommendation

链接: https://arxiv.org/abs/2508.04206
作者: Fatemeh Nazary,Ali Tourani,Yashar Deldjoo,Tommaso Di Noia
类目: Information Retrieval (cs.IR)
*备注: 17 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show LLM-based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: this https URL

[IR-4] SSEmb: A Joint Structural and Semantic Embedding Framework for Mathematical Formula Retrieval

链接: https://arxiv.org/abs/2508.04162
作者: Ruyin Li,Xiaoyu Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Formula retrieval is an important topic in Mathematical Information Retrieval. We propose SSEmb, a novel embedding framework capable of capturing both structural and semantic features of mathematical formulas. Structurally, we employ Graph Contrastive Learning to encode formulas represented as Operator Graphs. To enhance structural diversity while preserving mathematical validity of these formula graphs, we introduce a novel graph data augmentation approach through a substitution strategy. Semantically, we utilize Sentence-BERT to encode the surrounding text of formulas. Finally, for each query and its candidates, structural and semantic similarities are calculated separately and then fused through a weighted scheme. In the ARQMath-3 formula retrieval task, SSEmb outperforms existing embedding-based methods by over 5 percentage points on P’@10 and nDCG’@10. Furthermore, SSEmb enhances the performance of all runs of other methods and achieves state-of-the-art results when combined with Approach0.

[IR-5] Bridging Search and Recommendation through Latent Cross Reasoning

链接: https://arxiv.org/abs/2508.04152
作者: Teng Shi,Weicong Qin,Weijie Yu,Xiao Zhang,Ming He,Jianping Fan,Jun Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Search and recommendation (SR) are fundamental components of modern online platforms, yet effectively leveraging search behaviors to improve recommendation remains a challenging problem. User search histories often contain noisy or irrelevant signals that can even degrade recommendation performance, while existing approaches typically encode SR histories either jointly or separately without explicitly identifying which search behaviors are truly useful. Inspired by the human decision-making process, where one first identifies recommendation intent and then reasons about relevant evidence, we design a latent cross reasoning framework that first encodes user SR histories to capture global interests and then iteratively reasons over search behaviors to extract signals beneficial for recommendation. Contrastive learning is employed to align latent reasoning states with target items, and reinforcement learning is further introduced to directly optimize ranking performance. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines, validating the importance of reasoning in enhancing search-aware recommendation.

[IR-6] Benefit from Rich: Tackling Search Interaction Sparsity in Search Enhanced Recommendation CIKM2025

链接: https://arxiv.org/abs/2508.04145
作者: Teng Shi,Weijie Yu,Xiao Zhang,Ming He,Jianping Fan,Jun Xu
类目: Information Retrieval (cs.IR)
*备注: Accepted by CIKM 2025

点击查看摘要

Abstract:In modern online platforms, search and recommendation (SR) often coexist, offering opportunities for performance improvement through search-enhanced approaches. Existing studies show that incorporating search signals boosts recommendation performance. However, the effectiveness of these methods relies heavily on rich search interactions. They primarily benefit a small subset of users with abundant search behavior, while offering limited improvements for the majority of users who exhibit only sparse search activity. To address the problem of sparse search data in search-enhanced recommendation, we face two key challenges: (1) how to learn useful search features for users with sparse search interactions, and (2) how to design effective training objectives under sparse conditions. Our idea is to leverage the features of users with rich search interactions to enhance those of users with sparse search interactions. Based on this idea, we propose GSERec, a method that utilizes message passing on the User-Code Graphs to alleviate data sparsity in Search-Enhanced Recommendation. Specifically, we utilize Large Language Models (LLMs) with vector quantization to generate discrete codes, which connect similar users and thereby construct the graph. Through message passing on this graph, embeddings of users with rich search data are propagated to enhance the embeddings of users with sparse interactions. To further ensure that the message passing captures meaningful information from truly similar users, we introduce a contrastive loss to better model user similarities. The enhanced user representations are then integrated into downstream search-enhanced recommendation models. Experiments on three real-world datasets show that GSERec consistently outperforms baselines, especially for users with sparse search behaviors.

[IR-7] Recommending With Not For: Co-Designing Recommender Systems for Social Good

链接: https://arxiv.org/abs/2508.03792
作者: Michael D. Ekstrand,Afsaneh Razi,Aleksandra Sarcevic,Maria Soledad Pera,Robin Burke,Katherine Landau Wright
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: Accepted to ACM TORS Special Issue on Recommender Systems for Social Good

点击查看摘要

Abstract:Recommender systems are usually designed by engineers, researchers, designers, and other members of development teams. These systems are then evaluated based on goals set by the aforementioned teams and other business units of the platforms operating the recommender systems. This design approach emphasizes the designers’ vision for how the system can best serve the interests of users, providers, businesses, and other stakeholders. Although designers may be well-informed about user needs through user experience and market research, they are still the arbiters of the system’s design and evaluation, with other stakeholders’ interests less emphasized in user-centered design and evaluation. When extended to recommender systems for social good, this approach results in systems that reflect the social objectives as envisioned by the designers and evaluated as the designers understand them. Instead, social goals and operationalizations should be developed through participatory and democratic processes that are accountable to their stakeholders. We argue that recommender systems aimed at improving social good should be designed by and with, not just for, the people who will experience their benefits and harms. That is, they should be designed in collaboration with their users, creators, and other stakeholders as full co-designers, not only as user study participants.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-08-07

目录

概览 (2025-08-07)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载