本篇博文主要内容为 2025-08-11 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-11)

今日共更新460篇论文,其中:

  • 自然语言处理63篇(Computation and Language (cs.CL))
  • 人工智能154篇(Artificial Intelligence (cs.AI))
  • 计算机视觉113篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习113篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Effective Training Data Synthesis for Improving MLLM Chart Understanding ICCV2025

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在科学图表理解(chart understanding)任务上表现不佳的问题,尤其针对开源模型在复杂真实图表上的成功率为30%-50%的瓶颈。其解决方案的关键在于设计了一个五步数据合成流水线:首先将数据与函数分离以生成单个图表;其次通过条件生成策略使多子图图像中的后续子图依赖于前序子图;接着对生成图像进行视觉多样性增强;然后过滤低质量样本;最后利用GPT-4o自动生成高质量问题-答案(QA)对。该方法构建了有效图表数据集(Effective Chart Dataset, ECD),包含超过10k张图表图像和30万+ QA对,覆盖25个主题及250+种图表类型组合,显著提升了多种MLLMs在真实和合成测试集上的性能。

链接: https://arxiv.org/abs/2508.06492
作者: Yuwei Yang,Zeyu Zhang,Yunzhong Hou,Zhuowan Li,Gaowen Liu,Ali Payani,Yuan-Sen Ting,Liang Zheng
机构: Australian National University (澳大利亚国立大学); Ohio State University (俄亥俄州立大学); Cisco (思科); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICCV 2025 (poster). 26 pages, 17 figures

点击查看摘要

Abstract:Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: this https URL.
zh

[NLP-1] Post-training for Efficient Communication via Convention Formation

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在多轮交互中缺乏自适应语言调整和形成临时约定(ad-hoc conventions)能力的问题。现有研究表明,人类在多轮对话中能通过语用调整和协同构建共享理解来提升沟通效率,而LLMs通常无法自然表现出此类行为。解决方案的关键在于开发一种后训练流程,通过针对启发式识别出的约定形成示范数据进行定向微调(targeted fine-tuning),从而赋予模型在交互中动态适应并建立临时语义惯例的能力。实验验证表明,该方法在两个新设计的基准任务上均显著提升了模型的约定形成能力,包括一个认知动机驱动的交互基准和一个基于文档的参考补全任务,后者更贴近真实场景中的约定演化行为。

链接: https://arxiv.org/abs/2508.06482
作者: Yilun Hua,Evan Wang,Yoav Artzi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to COLM 2025

点击查看摘要

Abstract:Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.
zh

[NLP-2] HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

【速读】: 该论文旨在解决触觉描述生成(haptic captioning)问题,即从振动等触觉信号中自动生成自然语言描述,以支持虚拟现实、无障碍应用和康复等领域的需求。当前多模态研究主要聚焦于视觉与听觉模态,而触觉信号作为感知的重要通道仍处于探索阶段。解决方案的关键在于提出HapticLLaMA——一个能够将振动信号转化为特定感官、情感或联想类别的自然语言描述的多感官语言模型。其核心技术包括:(1) 设计两种触觉分词器(frequency-based tokenizer 和 EnCodec-based tokenizer),将连续触觉信号离散化为可处理的token序列;(2) 采用两阶段训练策略:第一阶段基于LoRA(Low-Rank Adaptation)对LLaMA模型进行监督微调,第二阶段通过人类反馈强化学习(RLHF)优化生成结果,显著提升与人类触觉感知的一致性。实验表明,该方法在自动指标(如METEOR=59.98)和人工评估上均表现优异,验证了大语言模型在处理非视觉模态数据方面的潜力。

链接: https://arxiv.org/abs/2508.06475
作者: Guimin Hu,Daniel Hershcovich,Hasti Seifi
机构: University of Copenhagen (哥本哈根大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.
zh

[NLP-3] GLM-4.5: Agent ic Reasoning and Coding (ARC) Foundation Models

【速读】: 该论文旨在解决当前大型语言模型在推理与代理(agentic)任务中性能不足、参数效率低的问题。解决方案的关键在于提出一种基于混合专家(Mixture-of-Experts, MoE)架构的新型模型 GLM-4.5,其总参数量达 355B,但激活参数仅为 32B,通过多阶段训练(涵盖 23T tokens)和后训练优化(包括专家模型迭代与强化学习),实现了高效且强大的推理能力。该模型支持“思考”与“直接响应”两种模式,在多项基准测试中表现优异,尤其在 agentic 和编码任务上显著领先,为高效推理与代理型 AI 系统的研究提供了新范式。

链接: https://arxiv.org/abs/2508.06471
作者: GLM-4.5 Team:Aohan Zeng,Xin Lv,Qinkai Zheng,Zhenyu Hou,Bin Chen,Chengxing Xie,Cunxiang Wang,Da Yin,Hao Zeng,Jiajie Zhang,Kedong Wang,Lucen Zhong,Mingdao Liu,Rui Lu,Shulin Cao,Xiaohan Zhang,Xuancheng Huang,Yao Wei,Yean Cheng,Yifan An,Yilin Niu,Yuanhao Wen,Yushi Bai,Zhengxiao Du,Zihan Wang,Zilin Zhu,Bohan Zhang,Bosi Wen,Bowen Wu,Bowen Xu,Can Huang,Casey Zhao,Changpeng Cai,Chao Yu,Chen Li,Chendi Ge,Chenghua Huang,Chenhui Zhang,Chenxi Xu,Chenzheng Zhu,Chuang Li,Congfeng Yin,Daoyan Lin,Dayong Yang,Dazhi Jiang,Ding Ai,Erle Zhu,Fei Wang,Gengzheng Pan,Guo Wang,Hailong Sun,Haitao Li,Haiyang Li,Haiyi Hu,Hanyu Zhang,Hao Peng,Hao Tai,Haoke Zhang,Haoran Wang,Haoyu Yang,He Liu,He Zhao,Hongwei Liu,Hongxi Yan,Huan Liu,Huilong Chen,Ji Li,Jiajing Zhao,Jiamin Ren,Jian Jiao,Jiani Zhao,Jianyang Yan,Jiaqi Wang,Jiayi Gui,Jiayue Zhao,Jie Liu,Jijie Li,Jing Li,Jing Lu,Jingsen Wang,Jingwei Yuan,Jingxuan Li,Jingzhao Du,Jinhua Du,Jinxin Liu,Junkai Zhi,Junli Gao,Ke Wang,Lekang Yang,Liang Xu,Lin Fan,Lindong Wu,Lintao Ding,Lu Wang,Man Zhang,Minghao Li,Minghuan Xu,Mingming Zhao,Mingshu Zhai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at this https URL.
zh

[NLP-4] ScamAgents : How AI Agents Can Simulate Human-Level Scam Calls

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多轮对话场景下被用于构建自动化欺诈代理(如 ScamAgent)所带来的安全风险问题,尤其关注当前单轮提示词防护机制对复杂、动态欺骗行为的失效问题。其解决方案的关键在于揭示了现有模型安全护栏(如拒绝机制和内容过滤)在面对基于代理的多轮交互时的脆弱性——通过将攻击提示分解、伪装或分阶段注入,可绕过传统防御;同时展示了从文本诈骗脚本到语音合成的端到端自动化欺诈流程,强调亟需引入多轮安全审计、代理级控制框架以及针对对话欺骗行为的检测与阻断方法。

链接: https://arxiv.org/abs/2508.06457
作者: Sanket Badhe
机构: Rutgers University (罗格斯大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted at CAMLIS 25: Conference on Applied Machine Learning for Information Security. 10 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive fluency and reasoning capabilities, but their potential for misuse has raised growing concern. In this paper, we present ScamAgent, an autonomous multi-turn agent built on top of LLMs, capable of generating highly realistic scam call scripts that simulate real-world fraud scenarios. Unlike prior work focused on single-shot prompt misuse, ScamAgent maintains dialogue memory, adapts dynamically to simulated user responses, and employs deceptive persuasion strategies across conversational turns. We show that current LLM safety guardrails, including refusal mechanisms and content filters, are ineffective against such agent-based threats. Even models with strong prompt-level safeguards can be bypassed when prompts are decomposed, disguised, or delivered incrementally within an agent framework. We further demonstrate the transformation of scam scripts into lifelike voice calls using modern text-to-speech systems, completing a fully automated scam pipeline. Our findings highlight an urgent need for multi-turn safety auditing, agent-level control frameworks, and new methods to detect and disrupt conversational deception powered by generative AI.
zh

[NLP-5] SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因高计算需求而导致的效率瓶颈问题。现有方法虽优化了注意力机制的计算,但仍需在每一层处理全部隐藏状态,限制了整体性能提升。其解决方案的关键在于提出SlimInfer框架,通过在前向传播中动态剪枝非关键提示token来加速推理;核心洞察是信息扩散现象——关键token的信息会随层数递进分布于整个序列,使得即使移除冗余token(包括部分关键token),模型仍能保持语义完整性。基于此,SlimInfer引入细粒度的逐层剪枝机制,并自然支持异步KV缓存管理,从而显著降低内存占用与I/O开销,在不牺牲LongBench基准性能的前提下,实现最高达2.53倍的首次生成时间(TTFT)提速和1.88倍的端到端延迟减少。

链接: https://arxiv.org/abs/2508.06447
作者: Lingkun Long,Rubing Yang,Yushi Huang,Desheng Hui,Ao Zhou,Jianlei Yang
机构: Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to \mathbf2.53\times time-to-first-token (TTFT) speedup and \mathbf1.88\times end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.
zh

[NLP-6] Echoes of Automation: The Increasing Use of LLM s in Newsmaking

【速读】: 该论文旨在解决生成式AI(Generative AI, GenAI)尤其是大语言模型(Large Language Models, LLMs)在新闻媒体内容创作中的广泛使用对新闻真实性与作者权属带来的潜在威胁问题。研究通过分析来自40,000余篇主流、地方及高校新闻媒体的文本数据,结合三种先进的AI文本检测工具(Binoculars、Fast-Detect GPT和GPTZero),揭示了近年来GenAI在新闻写作中显著增长的趋势,尤其是在地方和高校媒体中更为明显;其关键发现在于:LLMs常用于新闻导语部分,而结论段落则多由人工撰写;同时,GenAI提升了词汇丰富度与可读性但降低了正式程度,导致写作风格趋于同质化,尤其在地方媒体中表现突出。这一系列证据为评估AI对新闻生产伦理与风格一致性的影响提供了实证基础。

链接: https://arxiv.org/abs/2508.06445
作者: Abolfazl Ansari,Delvin Ce Zhang,Nafis Irtiza Tripto,Dongwon Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in 18th International Conference on Social Computing, Behavioral-Cultural Modeling, Prediction and Behavior Representation in Modeling and Simulation, and to be published in the Springer LNCS series

点击查看摘要

Abstract:The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.
zh

[NLP-7] Learning the Topic Not the Language: How LLM s Classify Online Immigration Discourse Across Languages

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨语言社会科学研究中是否存在知识迁移能力的问题,特别是当训练数据仅覆盖少数语言时,模型是否能有效识别未见过语言中的特定主题(如移民相关推文),以及如何缓解预训练阶段存在的语言偏见。解决方案的关键在于:通过轻量级微调(fine-tuning)即可实现跨语言话题检测,即使仅使用极少量目标语言的数据(低至原始预训练语料的9.62×10⁻¹¹),也能显著改善对低资源语言的识别性能;同时,多语言微调有助于区分立场(支持或反对移民),从而纠正预训练带来的语言偏向,证明了结构化偏见可通过轻量干预修正,且无需大规模多语言训练即可实现高效、公平的跨语言分析。

链接: https://arxiv.org/abs/2508.06435
作者: Andrea Nasuto,Stefano Maria Iacus,Francisco Rowe,Devika Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as 9.62\times10^-11 of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.
zh

[NLP-8] Memp: Exploring Agent Procedural Memory

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)代理在执行任务时面临的脆弱的过程记忆问题,即当前代理的程序性记忆通常依赖人工设计或嵌入静态参数中,难以适应新经验并持续演化。解决方案的关键在于提出一种可学习、可更新且具备终身记忆能力的过程记忆库(procedural memory repository)——Memp,其通过将历史代理轨迹提炼为细粒度的步骤指令和高层级脚本抽象,结合构建(Build)、检索(Retrieval)与更新(Update)策略的优化,并辅以动态机制对记忆内容进行持续更新、修正与淘汰,使记忆库随新经验同步演进。实证结果表明,随着记忆库的迭代优化,代理在TravelPlanner和ALFWorld等任务上的成功率和效率显著提升,且由强模型构建的记忆迁移至弱模型仍能带来显著性能增益。

链接: https://arxiv.org/abs/2508.06433
作者: Runnan Fang,Yuan Liang,Xiaobin Wang,Jialong Wu,Shuofei Qiao,Pengjun Xie,Fei Huang,Huajun Chen,Ningyu Zhang
机构: ♠: Alibaba Group (阿里巴巴集团); ♡: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.
zh

[NLP-9] Quantifying Conversation Drift in MCP via Latent Polytope

【速读】: 该论文旨在解决模型上下文协议(Model Context Protocol, MCP)在集成外部工具以增强大语言模型(Large Language Models, LLMs)能力时所引入的安全与隐私风险问题,特别是由恶意构造内容引发的工具污染(tool poisoning)和间接提示注入(indirect prompt injection),可能导致对话劫持、信息误导或数据外泄。解决方案的关键在于提出SecMCP框架,其核心创新是基于潜在空间中的激活向量建模,在潜多面体空间(latent polytope space)中量化对话漂移(conversation drift),从而主动识别异常的语义轨迹偏移,实现对劫持、误导及数据泄露行为的精准检测与度量。

链接: https://arxiv.org/abs/2508.06418
作者: Haoran Shi,Hongwei Yao,Shuo Shao,Shaopeng Jiao,Ziqi Peng,Zhan Qin,Cong Wang
机构: City University of Hong Kong (香港城市大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP’s efficacy.
zh

[NLP-10] Sample-efficient LLM Optimization with Reset Replay

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于偏好优化的后训练过程中存在的样本效率低和优先偏差(primacy bias)问题,即模型容易过拟合初始经验,从而损害策略质量和学习过程。解决方案的关键在于提出一种通用且高效的插件式方法——LLM优化重放机制(LLM optimization with Reset Replay, LoRR),其核心创新包括:1)支持高重放次数的训练以最大化每批数据的利用效率;2)引入周期性重置策略并复用初始数据,以维持网络可塑性、缓解过拟合风险;3)结合监督微调(Supervised Fine-Tuning, SFT)与偏好损失的混合优化目标,进一步提升数据利用率。实验证明,LoRR能显著增强多种偏好优化方法在数学与通用推理任务上的性能,甚至使迭代DPO方法达到与复杂强化学习算法相当的效果,展现出高效、实用的LLM微调潜力。

链接: https://arxiv.org/abs/2508.06412
作者: Zichuan Liu,Jinyu Wang,Lei Song,Jiang Bian
机构: Nanjing University (南京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.
zh

[NLP-11] A Systematic Literature Review of Retrieval-Augmented Generation: Techniques Metrics and Challenges

【速读】: 该论文旨在系统性地梳理和分析检索增强生成(Retrieval-Augmented Generation, RAG)领域的研究进展,解决当前文献中缺乏对高影响力研究成果的结构化整合与方法学总结的问题。其解决方案的关键在于:将神经检索器与生成式语言模型相结合,使生成结果基于最新的非参数记忆(non-parametric memory),同时保留模型权重中存储的语义泛化能力,从而提升输出的准确性与时效性;此外,研究严格遵循PRISMA 2020框架,通过明确的纳入与排除标准、对数据集、架构及评估实践的系统归类,以及对实证证据的综合分析,全面揭示RAG的有效性边界与局限,为后续研究提供清晰的方向指引。

链接: https://arxiv.org/abs/2508.06401
作者: Andrew Brown,Muhammad Roman,Barry Devereux
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 58 pages

点击查看摘要

Abstract:This systematic review of the research literature on retrieval-augmented generation (RAG) provides a focused analysis of the most highly cited studies published between 2020 and May 2025. A total of 128 articles met our inclusion criteria. The records were retrieved from ACM Digital Library, IEEE Xplore, Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP). RAG couples a neural retriever with a generative language model, grounding output in up-to-date, non-parametric memory while retaining the semantic generalisation stored in model weights. Guided by the PRISMA 2020 framework, we (i) specify explicit inclusion and exclusion criteria based on citation count and research questions, (ii) catalogue datasets, architectures, and evaluation practices, and (iii) synthesise empirical evidence on the effectiveness and limitations of RAG. To mitigate citation-lag bias, we applied a lower citation-count threshold to papers published in 2025 so that emerging breakthroughs with naturally fewer citations were still captured. This review clarifies the current research landscape, highlights methodological gaps, and charts priority directions for future research.
zh

[NLP-12] LLM s vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing

【速读】: 该论文旨在解决如何将大语言模型(Large Language Models, LLMs)在角色扮演(role-playing)与情感支持(emotional support)两个独立研究方向的能力有机结合,从而实现虚拟角色在保持其个性特征的同时提供有效情感支持的问题。解决方案的关键在于构建首个面向情感支持角色扮演(Emotionally Supportive Role-Playing, ESRP)的专用数据集——ChatAnime,该数据集包含20个经典动漫角色、60个基于现实情境的情绪相关问题,并通过全国范围筛选出40名具备深厚角色知识和角色扮演经验的中文动漫爱好者,系统收集了两轮对话数据(共2,400条人工标注回复和24,000条LLM生成回答),并设计了一个涵盖基础对话、角色扮演、情感支持三个维度的9项细粒度评估指标体系,以量化评估LLM在ESRP场景下的表现。实验表明,顶级LLMs在角色扮演和情感支持方面优于人类粉丝,但在响应多样性上仍逊于人类。

链接: https://arxiv.org/abs/2508.06388
作者: Lanlan Qiu,Xiao Pu,Yeqi Feng,Tianxing He
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 17 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing conversations and providing emotional support as separate research directions. However, there remains a significant research gap in combining these capabilities to enable emotionally supportive interactions with virtual characters. To address this research gap, we focus on anime characters as a case study because of their well-defined personalities and large fan bases. This choice enables us to effectively evaluate how well LLMs can provide emotional support while maintaining specific character traits. We introduce ChatAnime, the first Emotionally Supportive Role-Playing (ESRP) dataset. We first thoughtfully select 20 top-tier characters from popular anime communities and design 60 emotion-centric real-world scenario questions. Then, we execute a nationwide selection process to identify 40 Chinese anime enthusiasts with profound knowledge of specific characters and extensive experience in role-playing. Next, we systematically collect two rounds of dialogue data from 10 LLMs and these 40 Chinese anime enthusiasts. To evaluate the ESRP performance of LLMs, we design a user experience-oriented evaluation system featuring 9 fine-grained metrics across three dimensions: basic dialogue, role-playing and emotional support, along with an overall metric for response diversity. In total, the dataset comprises 2,400 human-written and 24,000 LLM-generated answers, supported by over 132,000 human annotations. Experimental results show that top-performing LLMs surpass human fans in role-playing and emotional support, while humans still lead in response diversity. We hope this work can provide valuable resources and insights for future research on optimizing LLMs in ESRP. Our datasets are available at this https URL.
zh

[NLP-13] Evaluating Style-Personalized Text Generation: Challenges and Directions

【速读】: 该论文旨在解决低资源场景下作者风格个性化文本生成(author style personalized text generation)的评估问题,尤其针对当前广泛使用的BLEU和ROUGE等自动指标在该任务中有效性不足的局限性。其解决方案的关键在于提出并验证了一种多维度评价范式,包括风格嵌入(style embeddings)与大语言模型作为评判者(LLM-as-judge)等替代指标,并通过构建一个涵盖八项写作任务、三类判别场景(领域区分、作者归属识别、个性化 vs 非个性化文本区分)的风格判别基准(style discrimination benchmark),实证表明:集成多样化评估指标的组合策略能够更全面、准确地衡量风格个性化文本生成的效果。

链接: https://arxiv.org/abs/2508.06374
作者: Anubhav Jangra,Bahareh Sarrafzadeh,Adrian de Wynter,Silviu Cucerzan,Sujay Kumar Jauhar
机构: Columbia University (哥伦比亚大学); Microsoft (微软); The University of York (约克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judge to holistically evaluate the style personalized text generation task. We evaluate these metrics and their ensembles using our style discrimination benchmark, that spans eight writing tasks, and evaluates across three settings, domain discrimination, authorship attribution, and LLM personalized vs non-personalized discrimination. We provide conclusive evidence to adopt ensemble of diverse evaluation metrics to effectively evaluate style personalized text generation.
zh

[NLP-14] Cyberbullying Detection via Aggression-Enhanced Prompting

【速读】: 该论文旨在解决社交网络中网络霸凌(cyberbullying)检测的难题,因其表达方式隐晦且多样化,导致现有模型泛化能力不足。其核心解决方案在于引入攻击性检测(aggression detection)作为辅助任务,在统一训练框架中通过提示工程增强主任务(网络霸凌检测)的上下文信息。关键创新点是提出了一种增强型提示管道(enriched prompt pipeline),将攻击性预测结果嵌入到网络霸凌检测提示中,从而提供情境增广(contextual augmentation)。实验表明,该方法在多个数据集上均优于标准LoRA微调策略,证明了利用辅助任务提升大语言模型(LLMs)在安全敏感场景下泛化性能的有效性。

链接: https://arxiv.org/abs/2508.06360
作者: Aisha Saeid,Anu Sabu,Girish A. Koushik,Ferrante Neri,Diptesh Kanojia
机构: NICE Research Group & Institute for People-Centred AI, School of Computer Science & Electronic Engineering, University of Surrey, UK (萨里大学)
类目: Computation and Language (cs.CL)
备注: Accepted to RANLP 2025

点击查看摘要

Abstract:Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.
zh

[NLP-15] Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering

【速读】: 该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在零样本图问答(zero-shot graph QA)任务中因采用单一拓扑表示形式(Topology Representation Form, TRF)而导致的准确性不足与响应冗长的问题。现有方法通常使用统一的文本描述或固定视觉风格作为TRF,忽视了不同模型或任务对表示形式的偏好,从而影响性能和效率。解决方案的关键在于:首先设计一套面向零样本图问答的多样化TRF集合 $ F_{ZS} $,并引入图响应效率(Graph Response Efficiency, GRE)这一新指标以量化回答性能与简洁性的平衡;进而提出DynamicTRF框架,通过构建TRF偏好(TRF Preference, TRFP)数据集来学习问题特定的TRF选择策略,并训练一个TRF路由器,在推理阶段动态为每个问题分配最优TRF,从而显著提升LMM在图问答任务中的准确性和响应效率。

链接: https://arxiv.org/abs/2508.06345
作者: Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James T. Kwok,Yu Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those “one-size-fits-all” approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by F_ZS , tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from F_ZS for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy
zh

[NLP-16] Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中知识产权(Intellectual Property, IP)侵权检测的难题,尤其是针对通过权重复制、剪枝、持续预训练等手段进行的模型剽窃行为。现有方法在准确重建权重对应关系、计算统计显著性(如p值)以及区分因训练数据相似导致的模型相关性方面存在不足。其解决方案的关键在于提出一种基于矩阵分析和大偏差理论(Large Deviation Theory)的新方法——矩阵驱动即时审查(Matrix-Driven Instant Review, MDIR),该方法能够精确重构权重间的关系,提供严格的p值估计,并且仅依赖权重相似性而非完整模型推理,从而实现高效、可靠的剽窃检测,即使在经历大规模变换(如随机重排或万亿级token的持续预训练)后仍能准确识别。

链接: https://arxiv.org/abs/2508.06309
作者: Ruichong Zhang
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Probability (math.PR)
备注:

点击查看摘要

Abstract:In recent years, concerns about intellectual property (IP) in large language models (LLMs) have grown significantly. Plagiarizing other LLMs (through direct weight copying, upcycling, pruning, or continual pretraining) and claiming authorship without properly attributing to the original license, is a serious misconduct that can lead to significant financial and reputational harm to the original developers. However, existing methods for detecting LLM plagiarism fall short in key areas. They fail to accurately reconstruct weight correspondences, lack the ability to compute statistical significance measures such as p -values, and may mistakenly flag models trained on similar data as being related. To address these limitations, we propose Matrix-Driven Instant Review (MDIR), a novel method that leverages matrix analysis and Large Deviation Theory. MDIR achieves accurate reconstruction of weight relationships, provides rigorous p -value estimation, and focuses exclusively on weight similarity without requiring full model inference. Experimental results demonstrate that MDIR reliably detects plagiarism even after extensive transformations, such as random permutations and continual pretraining with trillions of tokens. Moreover, all detections can be performed on a single PC within an hour, making MDIR both efficient and accessible.
zh

[NLP-17] Large Language Model Data Generation for Enhanced Intent Recognition in German Speech

【速读】: 该论文旨在解决生成式 AI 助手系统中针对老年人德语语音指令的意图识别(Intent Recognition, IR)问题,现有方法普遍局限于短指令且主要面向英语,难以适应低资源语言场景。其解决方案的关键在于:首先利用在老年德语语音数据集(SVC-de)上微调的 Whisper 自动语音识别(ASR)模型提取语音特征;其次,结合三种大型语言模型(LLMs)——LeoLM、Llama3 和 ChatGPT 生成的合成文本数据,训练基于 Transformer 的语言模型以增强意图分类性能;实验表明,由小型领域专用 LLM(LeoLM,13B 参数)生成的数据在质量上优于更大规模模型(ChatGPT,175B 参数),显著提升了模型对不同发音风格和未见词汇的鲁棒性。该方法证明了生成式 AI 能有效填补低资源语言领域的数据缺口,并具备良好的可复现性与透明度。

链接: https://arxiv.org/abs/2508.06277
作者: Theresa Pekarek Rosin,Burak Can Kaplan,Stefan Wermter
机构: University of Hamburg - Knowledge Technology (汉堡大学-知识技术)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: 11 pages, 3 figures, accepted at KONVENS 2025

点击查看摘要

Abstract:Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.
zh

[NLP-18] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多模态场景下因果推理能力薄弱的问题,尤其是针对结合结构化视觉数据与文本语境的图示信息(infographics)缺乏系统评估和有效建模。其解决方案的关键在于构建了一个名为InfoCausalQA的新基准,该基准包含两个任务:Task 1聚焦于基于推断数值趋势的定量因果推理,Task 2则关注五类语义因果关系(原因、结果、干预、反事实和时间)的推理;通过人工收集494对图示-文本数据并借助GPT-4o生成1,482个高质量多选题,并经人工修订确保问题需依赖真实视觉 grounding 而非表面线索作答,从而严格评估VLMs在因果推理上的表现。实验表明,现有VLMs在计算推理和语义因果推理方面均存在显著不足,凸显了提升多模态AI系统因果理解能力的紧迫性。

链接: https://arxiv.org/abs/2508.06220
作者: Keummin Ka,Junhyeong Park,Jahyun Jeon,Youngjae Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference – a core aspect of human cognition – remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.
zh

[NLP-19] Classification is a RAG problem: A case study on hate speech detection

【速读】: 该论文旨在解决内容审核中分类系统难以快速适应政策变化且依赖昂贵再训练的问题。其核心解决方案是采用检索增强生成(Retrieval-Augmented Generation, RAG)技术,将传统的基于预训练参数的分类任务转变为在推理阶段结合上下文知识进行评估的过程,从而实现更灵活、透明和可适配的内容审核机制。关键在于通过Contextual Policy Engine(CPE)这一代理式RAG系统,使模型能够依据实时检索到的政策文本判断内容是否违规,而非仅依赖固定类别标签,进而支持无需重训练即可动态更新政策规则,并提供可解释性输出。

链接: https://arxiv.org/abs/2508.06204
作者: Richard Willats,Josh Pennington,Aravind Mohan,Bertie Vidgen
机构: Contextual AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from “is this hate speech?” to “does this violate the hate speech policy?” Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.06204 [cs.CL] (or arXiv:2508.06204v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.06204 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-20] EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations

【速读】: 该论文旨在解决情感智能(Emotional Intelligence, EI)在人类对齐的大语言模型(Large Language Models, LLMs)发展中被严重忽视的问题。其核心挑战在于现有LLMs缺乏系统性的、心理学基础的情感理解与响应能力,难以在多语言和跨文化场景中实现真正的情感能力对齐。解决方案的关键在于提出一个统一的四层心理驱动型EI分类框架(包括情绪追踪、原因推理、评估与生成适当情绪响应),并基于此构建EICAP-Bench——一个面向开源LLMs的多轮选择题式评测基准。该框架不仅为评估提供结构化指标,还通过LoRA微调实验揭示了当前预训练和指令微调范式在提升特定EI层级(如评估层)上的局限性,从而指明未来需采用针对性数据和建模策略以实现更深层次的情感推理能力对齐。

链接: https://arxiv.org/abs/2508.06196
作者: Nizi Nazar,Ehsaneddin Asgari
机构: Qatar Computing Research Institute (卡塔尔计算研究研究所); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.
zh

[NLP-21] Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)越狱攻击(jailbreak)评估中存在的两大核心问题:一是现有方法多采用二分类标签(如字符串匹配或毒性文本分类),无法量化危害强度;二是多维评估框架缺乏场景自适应性,导致在不同任务场景下(如仇恨言论与相对真实性)出现评估维度错配,影响评估精度。其解决方案的关键在于提出SceneJailEval,一个首创的场景自适应多维评估框架,能够根据具体场景动态调整评估维度,克服了“一刀切”的局限性,并通过构建包含14种典型场景的高质量基准数据集,实现了对多样化越狱变体的精准量化评估,最终在多个基准上显著优于现有最先进方法(F1提升最高达6%)。

链接: https://arxiv.org/abs/2508.06194
作者: Lai Jiang,Yuekang Li,Xiaohan Zhang,Youtao Ding,Li Pan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Precise jailbreak evaluation is vital for LLM red teaming and jailbreak research. Current approaches employ binary classification ( e.g., string matching, toxic text classifiers, LLM-driven methods), yielding only “yes/no” labels without quantifying harm intensity. Existing multi-dimensional frameworks ( e.g., Security Violation, Relative Truthfulness, Informativeness) apply uniform evaluation criteria across scenarios, resulting in scenario-specific mismatches–for instance, “Relative Truthfulness” is irrelevant to “hate speech”–which compromise evaluation precision. To tackle these limitations, we introduce SceneJailEval, with key contributions: (1) A groundbreaking scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical “one-size-fits-all” constraint of existing multi-dimensional methods, and featuring strong extensibility to flexibly adapt to customized or emerging scenarios. (2) A comprehensive 14-scenario dataset with diverse jailbreak variants and regional cases, filling the long-standing gap in high-quality, holistic benchmarks for scenario-adaptive evaluation. (3) SceneJailEval achieves state-of-the-art results, with an F1 score of 0.917 on our full-scenario dataset (+6% over prior SOTA) and 0.995 on JBB (+3% over prior SOTA), surpassing accuracy limits of existing evaluation methods in heterogeneous scenarios and confirming its advantage.
zh

[NLP-22] DKG-LLM : A Framework for Medical Diagnosis and Personalized Treatment Recommendations via Dynamic Knowledge Graph and Large Language Model Integration

【速读】: 该论文旨在解决当前医疗诊断与个性化治疗推荐中面临的复杂性与数据异构性问题,特别是在处理多症状疾病和噪声数据时模型性能受限的挑战。解决方案的关键在于提出DKG-LLM框架,通过将动态知识图谱(Dynamic Knowledge Graph, DKG)与Grok 3大语言模型(Large Language Model, LLM)深度融合,并引入自适应语义融合算法(Adaptive Semantic Fusion Algorithm, ASFA),实现对临床报告、PubMed文献等异构医学数据的实时语义解析与知识图谱动态更新,从而提升诊断准确率(84.19%)、治疗推荐准确率(89.63%)及语义覆盖度(93.48%),并支持基于医师反馈的持续学习机制。

链接: https://arxiv.org/abs/2508.06186
作者: Ali Sarabadani,Maryam Abdollahi Shamami,Hamidreza Sadeghsalehi,Borhan Asadi,Saba Hesaraki
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have grown exponentially since the release of ChatGPT. These models have gained attention due to their robust performance on various tasks, including language processing tasks. These models achieve understanding and comprehension of tasks by training billions of parameters. The development of these models is a transformative force in enhancing natural language understanding and has taken a significant step towards artificial general intelligence (AGI). In this study, we aim to present the DKG-LLM framework. The DKG-LLM framework introduces a groundbreaking approach to medical diagnosis and personalized treatment recommendations by integrating a dynamic knowledge graph (DKG) with the Grok 3 large language model. Using the Adaptive Semantic Fusion Algorithm (ASFA), heterogeneous medical data (including clinical reports and PubMed articles) and patient records dynamically generate a knowledge graph consisting of 15,964 nodes in 13 distinct types (e.g., diseases, symptoms, treatments, patient profiles) and 127,392 edges in 26 relationship types (e.g., causal, therapeutic, association). ASFA utilizes advanced probabilistic models, Bayesian inference, and graph optimization to extract semantic information, dynamically updating the graph with approximately 150 new nodes and edges in each data category while maintaining scalability with up to 987,654 edges. Real-world datasets, including MIMIC-III and PubMed, were utilized to evaluate the proposed architecture. The evaluation results show that DKG-LLM achieves a diagnostic accuracy of 84.19%. The model also has a treatment recommendation accuracy of 89.63% and a semantic coverage of 93.48%. DKG-LLM is a reliable and transformative tool that handles noisy data and complex multi-symptom diseases, along with feedback-based learning from physician input.
zh

[NLP-23] Comparing Knowledge Injection Methods for LLM s in a Low-Resource Regime

【速读】: 该论文旨在解决在有限数据条件下向大语言模型(Large Language Models, LLMs)高效注入新知识的问题,尤其关注小样本(数千至数百万token)场景下的知识获取与灾难性遗忘(catastrophic forgetting)之间的权衡。其核心解决方案在于通过生成合成数据来增强模型对新信息的学习能力,关键创新点是采用多样化提示(diverse prompting)策略生成具有高变异性的人工文本,从而显著提升模型在少量数据下对新事实的掌握效果,同时缓解因参数更新导致的旧知识退化问题。实验表明,单纯继续预训练效果有限,而引入多样化的合成数据可有效改善知识注入效率,并验证了基于检索增强生成(RAG)方法在小数据场景中更易引发性能下降,凸显了参数化方法的优势。此外,研究还发现模型具备自动生成高质量合成训练数据的能力,为实现自我迭代优化提供了可行路径。

链接: https://arxiv.org/abs/2508.06178
作者: Hugo Abonizio,Thales Almeida,Roberto Lotufo,Rodrigo Nogueira
机构: Faculdade de Engenharia Elétrica e de Computação (FEEC), University of Campinas (Unicamp); Instituto de Computação (IC), University of Campinas (Unicamp); NeuralMind; Maritaca AI, Campinas, SP – Brazil
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news – ensuring no overlap with the model’s pre-training data – to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts – particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at this https URL.
zh

[NLP-24] Prag matics beyond humans: meaning communication and LLM s

【速读】: 该论文试图解决的问题是:传统语用学(pragmatics)理论在面对生成式 AI(Generative AI)特别是大语言模型(LLMs)时的适用性不足,其将语用视为意义的第三维度、依赖人类中心假设的做法已无法充分解释人机交互中的实际沟通机制。解决方案的关键在于提出“人机通信”(Human-Machine Communication, HMC)框架,以动态接口视角重新定义语用学的功能,强调语言作为社会嵌入性工具的作用;同时引入概率语用学(probabilistic pragmatics),尤其是理性言语行为框架(Rational Speech Act framework),用优化目标替代传统的真值评估逻辑,从而更适配 LLM 的预测本质,并通过“情境挫败”(context frustration)概念揭示用户在提升上下文理解中被迫参与共构语用条件的现象,推动语用理论向包含生成式 AI 的新型交际生态扩展。

链接: https://arxiv.org/abs/2508.06167
作者: Vít Gvoždiak
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The paper reconceptualizes pragmatics not as a subordinate, third dimension of meaning, but as a dynamic interface through which language operates as a socially embedded tool for action. With the emergence of large language models (LLMs) in communicative contexts, this understanding needs to be further refined and methodologically reconsidered. The first section challenges the traditional semiotic trichotomy, arguing that connectionist LLM architectures destabilize established hierarchies of meaning, and proposes the Human-Machine Communication (HMC) framework as a more suitable alternative. The second section examines the tension between human-centred pragmatic theories and the machine-centred nature of LLMs. While traditional, Gricean-inspired pragmatics continue to dominate, it relies on human-specific assumptions ill-suited to predictive systems like LLMs. Probabilistic pragmatics, particularly the Rational Speech Act framework, offers a more compatible teleology by focusing on optimization rather than truth-evaluation. The third section addresses the issue of substitutionalism in three forms - generalizing, linguistic, and communicative - highlighting the anthropomorphic biases that distort LLM evaluation and obscure the role of human communicative subjects. Finally, the paper introduces the concept of context frustration to describe the paradox of increased contextual input paired with a collapse in contextual understanding, emphasizing how users are compelled to co-construct pragmatic conditions both for the model and themselves. These arguments suggest that pragmatic theory may need to be adjusted or expanded to better account for communication involving generative AI.
zh

[NLP-25] UR2: Unify RAG and Reasoning through Reinforcement Learning

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)与强化学习推理(Reinforcement Learning from Verifiable Rewards, RLVR)方法在实际应用中常被孤立开发的问题,导致其难以在多样化任务场景下实现高效协同与泛化能力受限。为弥合这一差距,作者提出UR2(Unified RAG and Reasoning)框架,其核心创新在于:一是设计了一种难度感知的课程训练机制,仅在问题复杂度较高时触发检索以优化资源利用;二是引入混合知识访问策略,融合领域特定的离线语料库与大语言模型(LLM)生成的摘要,从而实现检索与推理之间的动态协调。该方案显著提升了模型在开放域问答、MMLU-Pro、医学及数学推理等多类任务上的适应性与性能表现。

链接: https://arxiv.org/abs/2508.06165
作者: Weitao Li,Boran Xiang,Xiaolong Wang,Zhinan Gou,Weizhi Ma,Yang Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at this https URL.
zh

[NLP-26] One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging

【速读】: 该论文旨在解决现有模型融合(model merging)方法中因采用“一刀切”式的全局稀疏化策略而导致的参数干扰问题。当前方法通常对所有任务向量应用统一的稀疏比例,忽视了不同参数张量在分布特性上的异质性,从而可能误删关键参数或保留冗余信息,影响融合模型性能。解决方案的关键在于提出一种张量级自适应稀疏化策略——TADrop(Tensor-wise Adaptive Drop),其核心思想是根据每个参数张量的分布密度动态分配不同的稀疏水平:对于分布密集、冗余性强的张量进行激进剪枝,而对稀疏且关键的张量则予以保留。该方法无需修改原有融合框架,可作为即插即用模块集成到多种经典与前沿融合方法中,并在视觉、语言及多模态任务上显著提升性能,例如在ViT-B/32模型上平均提升2.0%。

链接: https://arxiv.org/abs/2508.06163
作者: Yingfeng Luo,Dingyang Lin,Junxin Wang,Ziqiang Xu,Kaiyan Chang,Tong Zheng,Bei Li,Anxiang Ma,Tong Xiao,Zhengtao Yu,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all’’ strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbfTADrop (\textbfTensor-wise \textbfAdaptive \textbfDrop), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model’s structure, offering a new baseline for high-performance model merging.
zh

[NLP-27] Semantic and Structural Analysis of Implicit Biases in Large Language Models : An Interpretable Approach

【速读】: 该论文旨在解决大型语言模型在生成过程中可能隐含的社会偏见(implicit stereotypes)问题,特别是那些难以通过显式语言特征捕捉的语义倾向。其解决方案的关键在于提出一种可解释的偏见检测方法,该方法结合嵌套语义表示(nested semantic representation)与上下文对比机制(contextual contrast mechanism),从模型输出的向量空间结构中提取潜在偏见特征,并利用注意力权重扰动分析模型对特定社会属性词的敏感性,从而揭示偏见形成的具体语义路径。此方法在StereoSet数据集上验证了其在偏见检测准确率、语义一致性及上下文敏感性等方面的优越性能,具备高可解释性和稳定性,为高可信度生成内容的实际应用提供了可靠的技术基础。

链接: https://arxiv.org/abs/2508.06155
作者: Renhan Zhang,Lian Lian,Zhen Qi,Guiran Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model’s sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.
zh

[NLP-28] Scaling Personality Control in LLM s with Big Five Scaler Prompts

【速读】: 该论文旨在解决如何在不进行额外训练的情况下,对大语言模型(Large Language Models, LLMs)的个性特征进行可控调节的问题。解决方案的关键在于提出了一种基于提示(prompt-based)的框架 Big5-Scaler,通过将五大人格特质(Big Five personality traits)的数值嵌入自然语言提示中,实现对模型输出人格特性的细粒度控制。实验证明,该方法能够在不同任务中稳定诱导出可区分的人格特征,且效果受提示类型和特质强度的影响,其中简洁的提示与较低的特质强度表现更优。

链接: https://arxiv.org/abs/2508.06149
作者: Gunhee Cho,Yun-Gyung Cheong
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.
zh

[NLP-29] Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models

【速读】: 该论文旨在解决现有白盒知识蒸馏(Knowledge Distillation, KD)方法在压缩大语言模型(Large Language Models, LLMs)时忽视训练数据质量与学生模型兼容性的问题。传统KD方法主要关注平衡真实标签(ground truth)与学生模型生成响应之间的差异,但未系统性优化训练数据本身的质量和适配度,导致蒸馏效率低下且性能提升有限。解决方案的关键在于提出一种名为“选择性反射蒸馏”(Selective Reflection Distillation, SRD)的数据筛选框架:通过分析学生模型对提示-响应对的输出(即“反射”),自动评估并排序训练样本的难度,从而动态筛选出高质量、与学生模型高度兼容的训练实例;随后结合课程调度策略,在固定间隔逐步引入这些精选子集进行蒸馏训练。SRD作为即插即用模块,不改变原有KD算法即可显著提升蒸馏效果,并将训练时间减少高达39%,验证了数据质量与模型兼容性在高效蒸馏中的核心作用。

链接: https://arxiv.org/abs/2508.06135
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD’s consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.
zh

[NLP-30] AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中因忽视逻辑隐含关系而引发的基于可用性(affordance-based)安全风险问题,即模型输出可能无意中促进有害行为。传统方法如基于标量结果的奖励模型、参数调优或启发式解码策略,在检测和干预细微但关键的推理步骤方面缺乏粒度与主动性。解决方案的关键在于提出AURA框架,其核心是多层结构中的过程奖励模型(Process Reward Models, PRMs),实现对逻辑连贯性和安全意识的逐步评估;并通过内省式自我批判、细粒度PRM评估与自适应安全感知解码相结合,动态且主动地引导模型走向更安全的推理路径,从而显著提升输出的逻辑完整性与可用性敏感安全性。

链接: https://arxiv.org/abs/2508.06124
作者: Sayantan Adak,Pratyush Chatterjee,Somnath Banerjee,Rima Hazra,Somak Aditya,Animesh Mukherjee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Present day LLMs face the challenge of managing affordance-based safety risks-situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
zh

[NLP-31] You Dont Need Pre-built Graphs for RAG : Retrieval Augmented Generation with Adaptive Reasoning Structures

【速读】: 该论文旨在解决当前基于图结构的检索增强生成(Graph-based Retrieval-Augmented Generation, GraphRAG)方法中存在的两个核心问题:一是预构建图结构导致的高昂令牌(token)成本和更新延迟;二是固定预建图无法适配不同类型与复杂度的实际查询,从而影响知识检索的有效性。解决方案的关键在于提出一种逻辑感知的检索增强生成框架(Logic-aware Retrieval-Augmented Generation, LogicRAG),其在推理阶段动态提取任务的逻辑结构以指导自适应检索——首先将输入查询分解为一组子问题,并构建有向无环图(Directed Acyclic Graph, DAG)来建模它们之间的逻辑依赖关系;随后通过拓扑排序对图进行线性化,确保多步推理的一致性;同时引入图剪枝和上下文剪枝策略,显著降低冗余检索带来的令牌消耗。此设计实现了高效且精准的知识获取,优于现有最先进基线方法。

链接: https://arxiv.org/abs/2508.06105
作者: Shengyuan Chen,Chuang Zhou,Zheng Yuan,Qinggang Zhang,Zeyang Cui,Hao Chen,Yilin Xiao,Jiannong Cao,Xiao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf\underlineLogic-aware \textbf\underlineRetrieval-\textbf\underlineAugmented \textbf\underlineGeneration framework (\textbfLogicRAG) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.
zh

[NLP-32] Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLM s

【速读】: 该论文旨在解决《古兰经》文本中抽取式问答(Extractive Question Answering, QA)所面临的挑战,包括语言复杂性、术语独特性和深层语义理解等问题。其解决方案的关键在于采用基于指令微调的大语言模型(instruction-tuned large language models),结合少量示例提示(few-shot prompting)与专为阿拉伯语设计的提示框架(Arabic prompt framework),实现答案跨度(span)提取;同时引入强后处理系统,整合子词对齐(subword alignment)、重叠抑制(overlap suppression)和语义过滤(semantic filtering),从而提升精确度并减少幻觉现象。实验表明,使用阿拉伯语指令的大型语言模型在低资源、语义丰富的QA任务中优于传统微调模型,最佳配置达到pAP10分数0.637。

链接: https://arxiv.org/abs/2508.06103
作者: Mohamed Basem,Islam Oshallah,Ali Hamdi,Ammar Mohammed
机构: MSA University (MSA大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 6 pages , 2 figures , Accepted in IMSA 2025,Egypt , this https URL

点击查看摘要

Abstract:This paper presents two effective approaches for Extractive Question Answering (QA) on the Quran. It addresses challenges related to complex language, unique terminology, and deep meaning in the text. The second uses few-shot prompting with instruction-tuned large language models such as Gemini and DeepSeek. A specialized Arabic prompt framework is developed for span extraction. A strong post-processing system integrates subword alignment, overlap suppression, and semantic filtering. This improves precision and reduces hallucinations. Evaluations show that large language models with Arabic instructions outperform traditional fine-tuned models. The best configuration achieves a pAP10 score of 0.637. The results confirm that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.
zh

[NLP-33] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)实现无需人类语言学专业知识的全自动构想语言(conlang)创建问题。其核心挑战在于如何在保持语言内部一致性的同时,生成具有多样性与结构合理性的构想语言。解决方案的关键在于提出一个名为ConlangCrafter的多跳(multi-hop)流水线框架,将语言设计分解为音系学、形态学、句法学、词汇生成和翻译五个模块化阶段;在每个阶段中,利用LLMs的元语言推理能力,通过引入随机性以增强多样性,并采用自反馈精炼机制提升语言描述的一致性,从而实现端到端的高质量构想语言生成。

链接: https://arxiv.org/abs/2508.06094
作者: Morris Alper,Moran Yanuka,Raja Giryes,Gašper Beguš
机构: Tel Aviv University (特拉维夫大学); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages – phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.
zh

[NLP-34] hematicPlane: Bridging Tacit User Intent and Latent Spaces for Image Generation

【速读】: 该论文旨在解决生成式 AI 在图像创作中难以准确对齐用户细微创意意图的问题,尤其针对非专业用户在使用现有工具时需通过提示词或参考图来表达想法、从而限制了灵活探索的困境。其解决方案的关键在于提出 ThematicPlane 系统,该系统通过交互式的主题设计平面(thematic design plane),使用户能够直接导航和操控高层次语义概念(如情绪、风格或叙事基调),从而将隐性的创意意图与系统控制机制有效衔接,支持发散与收敛两种创造模式,并促进迭代式创作流程。

链接: https://arxiv.org/abs/2508.06065
作者: Daniel Lee,Nikhil Sharma,Donghoon Shin,DaEun Choi,Harsh Sharma,Jeonghwan Kim,Heng Ji
机构: Adobe Inc.(Adobe公司); Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学); KAIST (韩国科学技术院); University of Colorado (科罗拉多大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI has made image creation more accessible, yet aligning outputs with nuanced creative intent remains challenging, particularly for non-experts. Existing tools often require users to externalize ideas through prompts or references, limiting fluid exploration. We introduce ThematicPlane, a system that enables users to navigate and manipulate high-level semantic concepts (e.g., mood, style, or narrative tone) within an interactive thematic design plane. This interface bridges the gap between tacit creative intent and system control. In our exploratory study (N=6), participants engaged in divergent and convergent creative modes, often embracing unexpected results as inspiration or iteration cues. While they grounded their exploration in familiar themes, differing expectations of how themes mapped to outputs revealed a need for more explainable controls. Overall, ThematicPlane fosters expressive, iterative workflows and highlights new directions for intuitive, semantics-driven interaction in generative design tools.
zh

[NLP-35] Fact2Fiction: Targeted Poisoning Attack to Agent ic Fact-checking System

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的代理式事实核查系统(agentic fact-checking systems)所面临的安全漏洞问题,即这些系统在面对投毒攻击(poisoning attack)时易被操纵,从而导致错误的核查结果,甚至放大虚假信息。解决方案的关键在于提出名为 Fact2Fiction 的首个针对此类代理式事实核查系统的投毒攻击框架:该框架模仿系统本身的子命题分解策略,并利用系统生成的解释性理由(justifications)来定制恶意证据,从而精准破坏子命题的验证过程,显著提升攻击成功率(实验显示比现有最优攻击高出 8.9%–21.2%),揭示了当前事实核查系统在安全设计上的薄弱环节,凸显了构建防御机制的紧迫性。

链接: https://arxiv.org/abs/2508.06059
作者: Haorui He,Yupeng Li,Bin Benjamin Zhu,Dacheng Wen,Reynold Cheng,Francis C. M. Lau
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:State-of-the-art fact-checking systems combat misinformation at scale by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanatory rationales for the verdicts). The security of these systems is crucial, as compromised fact-checkers, which tend to be easily underexplored, can amplify misinformation. This work introduces Fact2Fiction, the first poisoning attack framework targeting such agentic fact-checking systems. Fact2Fiction mirrors the decomposition strategy and exploits system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9%–21.2% higher attack success rates than state-of-the-art attacks across various poisoning budgets. Fact2Fiction exposes security weaknesses in current fact-checking systems and highlights the need for defensive countermeasures.
zh

[NLP-36] EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在开放式任务中,尤其是故事评估(story evaluation)场景下的性能瓶颈问题。现有方法面临两难:封闭源模型依赖提示工程(prompt engineering),适应性差;开放源模型虽可通过微调(fine-tuning)改进,但缺乏严谨的推理能力以支撑高质量的故事评判。解决方案的关键在于提出Self-Evolving Pairwise Reasoning(EvolvR)框架,其核心创新是基于成对比较机制,通过多角色策略自动生成与评分对齐的思维链(Chain-of-Thought, CoT)数据,并利用多智能体系统进行自过滤以保障逻辑严谨性和鲁棒性,最终训练出具备强推理能力的评估器作为奖励模型(reward model)指导故事生成。实验表明,该框架在StoryER、HANNA和OpenMEVA三个基准上达到最先进(SOTA)水平,并显著提升生成故事质量。

链接: https://arxiv.org/abs/2508.06046
作者: Xinda Wang,Zhengxu Hou,Yangshijie Zhang,Bingren Yan,Zhibo Yang,Xingsheng Zhang,Luxi Xing,Qiang Zhou,Chen Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.
zh

[NLP-37] Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识获取的不可预测性问题,即由于LLMs的随机性,难以准确评估其在特定事实上的知识掌握情况。传统方法依赖于对模型进行前向传播以探测知识,计算成本高且效率低。解决方案的关键在于提出一种名为PEEK(Proxy Embeddings to Estimate Knowledge of LLMs)的新方法,通过利用预训练的嵌入模型(如句子嵌入或图嵌入)作为LLMs知识的代理表示,借助线性解码器层将嵌入映射到LLM输出,从而高效预测LLM的知识状态。实验表明,该方法在多个数据集和模型上可达到最高90%的预测准确率,并发现句子嵌入比图嵌入更适合作为LLM知识的代理,为理解LLMs的内生归纳偏置提供了新视角。

链接: https://arxiv.org/abs/2508.06030
作者: Kartik Sharma,Yiqiao Jin,Rakshit Trivedi,Srijan Kumar
机构: Georgia Institute of Technology (佐治亚理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM’s knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose \textbfPEEK or \textbfP roxy \textbfE mbeddings to \textbfE stimate \textbfK nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on 3 Wikipedia-derived datasets, 4 LLMs, and 7 embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs’ internal inductive bias. The code and data are made available at this https URL.
zh

[NLP-38] mporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

【速读】: 该论文旨在解决自奖励语言模型(Self-Rewarding Language Models)在迭代优化过程中存在的关键缺陷:即选择样本与拒绝样本的表征差异随训练进程逐渐缩小,导致偏好学习信号减弱,从而限制模型性能提升。其解决方案的核心在于提出时序自奖励语言模型(Temporal Self-Rewarding Language Models),通过引入双阶段框架维持有效的学习信号——一是采用“锚定拒绝”(Anchored Rejection)策略固定早期模型生成的拒绝样本以保持对比多样性;二是实施“未来引导选择”(Future-Guided Chosen)机制,利用下一代模型预测动态筛选高质量选择样本。此方法在不增加计算资源的前提下显著提升了多个主流模型家族(Llama、Qwen、Mistral)在不同规模下的性能,并展现出更强的跨任务泛化能力。

链接: https://arxiv.org/abs/2508.06026
作者: Yidong Wang,Xin Wang,Cunxiang Wang,Junfeng Fang,Qiufeng Wang,Jianing Chu,Xuran Meng,Shuxun Yang,Libo Qin,Yue Zhang,Wei Ye,Shikun Zhang
机构: Peking University (北京大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); Nanjing University (南京大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbfTemporal Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textitAnchored Rejection - fixing rejected responses using the past initial model’s outputs and (2) \textitFuture-Guided Chosen - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.
zh

[NLP-39] Position: Intelligent Coding Systems Should Write Programs with Justifications

【速读】: 该论文试图解决智能编码系统(Intelligent Coding Systems)中因AI驱动的代码生成过程缺乏透明性而导致的信任与可用性问题,尤其是非专家用户难以理解底层实现逻辑的问题。解决方案的关键在于引入可解释的推理说明(justification),强调两个核心属性:认知对齐(cognitive alignment)和语义忠实性(semantic faithfulness),并主张采用神经符号方法(neuro-symbolic approaches)来生成此类说明——即在训练阶段通过符号约束引导模型行为,并在推理阶段利用神经表示增强程序语义,从而实现自动化的逻辑一致性验证。

链接: https://arxiv.org/abs/2508.06017
作者: Xiangzhe Xu,Shiwei Feng,Zian Su,Chengpeng Wang,Xiangyu Zhang
机构: Purdue University (普渡大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract:Intelligent coding systems are transforming software development by enabling users to specify code behavior in natural language. However, the opaque decision-making of AI-driven coders raises trust and usability concerns, particularly for non-expert users who cannot inspect low-level implementations. We argue that these systems should not only generate code but also produce clear, consistent justifications that bridge model reasoning and user understanding. To this end, we identify two critical justification properties-cognitive alignment and semantic faithfulness-and highlight the limitations of existing methods, including formal verification, static analysis, and post-hoc explainability. We advocate exploring neuro-symbolic approaches for justification generation, where symbolic constraints guide model behavior during training and program semantics are enriched through neural representations, enabling automated consistency checks at inference time.
zh

[NLP-40] Crisp Attention: Regularizing Transformers via Structured Sparsity

【速读】: 该论文旨在解决Transformer模型中自注意力机制(self-attention mechanism)带来的二次计算复杂度问题,同时挑战了“注意力稀疏化必然损害模型准确率”的普遍认知。其解决方案的关键在于引入结构化的、后验的(post-hoc)稀疏性到DistilBERT模型的注意力机制中,并在SST-2情感分析任务的微调过程中进行优化。实验表明,80%的注意力稀疏度可使验证准确率提升0.97%至91.59%,优于密集基线模型。作者认为这一现象源于稀疏性作为强隐式正则化器,通过限制模型使用的特征集合来抑制过拟合,从而提升模型的泛化能力和性能。

链接: https://arxiv.org/abs/2508.06016
作者: Sagar Gandhi,Vishal Gandhi
机构: Joyspace AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80% attention sparsity achieves a validation accuracy of 91.59%, a 0.97% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.
zh

[NLP-41] Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring

【速读】: 该论文旨在解决跨主题自动化作文评分(Cross-topic Automated Essay Scoring, AES)中因主题差异导致的模型泛化能力不足问题,尤其是现有方法在关注主题共享特征的同时忽视了主题特异性特征,从而影响对如主题契合度等关键评分维度的准确评估。解决方案的关键在于提出一种对抗性主题感知提示调优(Adversarial TOpic-aware Prompt-tuning, ATOP)方法,其核心是通过联合学习主题共享与主题特定的提示(prompt)来激发预训练语言模型(Pre-trained Language Models, PLMs)中的相关知识;具体而言,ATOP设计了一个包含共享和特定组件的可学习提示结构,并结合统一回归与分类框架中的对抗训练以增强主题共享提示学习的鲁棒性,同时利用邻近分类器生成目标主题伪标签,指导主题特定提示的监督学习,从而显著提升跨主题作文评分的性能。

链接: https://arxiv.org/abs/2508.05987
作者: Chunyun Zhang,Hongyan Zhao,Chaoran Cui,Qilong Song,Zhiqing Lu,Shuai Gong,Kailin Liu
机构: Shandong University of Finance and Economics (山东财经大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-topic automated essay scoring (AES) aims to develop a transferable model capable of effectively evaluating essays on a target topic. A significant challenge in this domain arises from the inherent discrepancies between topics. While existing methods predominantly focus on extracting topic-shared features through distribution alignment of source and target topics, they often neglect topic-specific features, limiting their ability to assess critical traits such as topic adherence. To address this limitation, we propose an Adversarial TOpic-aware Prompt-tuning (ATOP), a novel method that jointly learns topic-shared and topic-specific features to improve cross-topic AES. ATOP achieves this by optimizing a learnable topic-aware prompt–comprising both shared and specific components–to elicit relevant knowledge from pre-trained language models (PLMs). To enhance the robustness of topic-shared prompt learning and mitigate feature scale sensitivity introduced by topic alignment, we incorporate adversarial training within a unified regression and classification framework. In addition, we employ a neighbor-based classifier to model the local structure of essay representations and generate pseudo-labels for target-topic essays. These pseudo-labels are then used to guide the supervised learning of topic-specific prompts tailored to the target topic. Extensive experiments on the publicly available ASAP++ dataset demonstrate that ATOP significantly outperforms existing state-of-the-art methods in both holistic and multi-trait essay scoring. The implementation of our method is publicly available at: this https URL.
zh

[NLP-42] Bifrost-1: Bridging Multimodal LLM s and Diffusion Models with Patch-level CLIP Latents

【速读】: 该论文旨在解决如何在不损害大语言模型(Large Language Models, LLMs)强推理能力的前提下,高效地将高保真视觉合成能力集成到LLMs中。现有方法通常直接训练LLMs或桥接LLMs与扩散模型(Diffusion Models),但因骨干LLMs在预训练阶段未接触图像表示而导致训练成本高昂。解决方案的关键在于提出Bifrost-1框架,其利用CLIP图像嵌入(patch-level CLIP image embeddings)作为潜在变量,无缝连接预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)与扩散模型;通过轻量级修改ControlNet将这些嵌入融入扩散过程,并为MLLM引入一个基于原始参数初始化的视觉生成分支,从而在保持原有多模态推理能力的同时实现高效可控的高质量图像生成。

链接: https://arxiv.org/abs/2508.05954
作者: Han Lin,Jaemin Cho,Amir Zadeh,Chuan Li,Mohit Bansal
机构: UNC Chapel Hill; Lambda
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM’s CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.
zh

[NLP-43] Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale

【速读】: 该论文旨在解决文本中亲社会行为(prosociality)检测的难题,即识别旨在肯定、支持或改善他人行为的沟通内容。由于亲社会性缺乏明确的定义和标注数据,传统方法难以直接应用,因此亟需新的标注与部署策略。其解决方案的关键在于提出一个三阶段可扩展的流水线:首先利用少量人工标注样本确定最优的大语言模型(LLM)标注策略;其次引入人-AI协同精修循环,通过高分歧案例迭代优化任务定义与标签质量;最后构建两级推理架构——轻量级分类器处理高置信度预测,仅将约35%的模糊样本交由GPT-4o处理,从而在保持高精度(约0.90)的同时降低约70%的推理成本。该方法凸显了精准任务定义、针对性人-AI协作及面向部署的系统设计对新兴负责任AI任务的重要性。

链接: https://arxiv.org/abs/2508.05938
作者: Rafal Kocielnik,Min Kim,Penphob(Andrea)Boonyarungsrit,Fereshteh Soltani,Deshawn Sambrano,Animashree Anandkumar,R. Michael Alvarez
机构: California Institute of Technology (加州理工学院); Activision Publishing, Inc. (动视发行公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Detecting prosociality in text–communication intended to affirm, support, or improve others’ behavior–is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only \sim 35% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by \sim 70% while achieving high precision ( \sim 0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.
zh

[NLP-44] Do Ethical AI Principles Matter to Users? A Large-Scale Analysis of User Sentiment and Satisfaction

【速读】: 该论文旨在解决伦理AI(Ethical AI)与用户满意度之间关系的实证证据稀缺问题,即现有政策和行业指南虽倡导公平、透明和鲁棒性等伦理原则,但缺乏来自用户视角的实证支持。其解决方案的关键在于基于超过10万条来自G2平台的AI产品用户评论,利用基于Transformer的语言模型量化七维度伦理特征(依据欧盟可信AI伦理指南),系统分析这些维度与用户满意度之间的关联,并揭示不同用户群体(技术/非技术用户)和产品类型(开发平台/终端应用)间的差异。结果表明,所有伦理维度均正向影响用户满意度,且非技术用户和终端应用用户的敏感度更高,凸显了从用户视角设计伦理AI的重要性及情境异质性需被纳入考量。

链接: https://arxiv.org/abs/2508.05913
作者: Stefan Pasch,Min Chul Cha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems become increasingly embedded in organizational workflows and consumer applications, ethical principles such as fairness, transparency, and robustness have been widely endorsed in policy and industry guidelines. However, there is still scarce empirical evidence on whether these principles are recognized, valued, or impactful from the perspective of users. This study investigates the link between ethical AI and user satisfaction by analyzing over 100,000 user reviews of AI products from G2. Using transformer-based language models, we measure sentiment across seven ethical dimensions defined by the EU Ethics Guidelines for Trustworthy AI. Our findings show that all seven dimensions are positively associated with user satisfaction. Yet, this relationship varies systematically across user and product types. Technical users and reviewers of AI development platforms more frequently discuss system-level concerns (e.g., transparency, data governance), while non-technical users and reviewers of end-user applications emphasize human-centric dimensions (e.g., human agency, societal well-being). Moreover, the association between ethical AI and user satisfaction is significantly stronger for non-technical users and end-user applications across all dimensions. Our results highlight the importance of ethical AI design from users’ perspectives and underscore the need to account for contextual differences across user roles and product types.
zh

[NLP-45] Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中难以准确评估检索模块贡献的问题,尤其是在大语言模型(Large Language Models, LLMs)作为生成器时对提示(prompt)敏感、导致整体性能评估失真的情况下。其解决方案的关键在于提出一种轻量级、无需监督的指标——谱投影得分(Spectrum Projection Score, SPS),该指标通过比较检索摘要生成词元在读者模型隐藏空间中形成的区域与子空间主方向之间的几何关系,量化检索内容与生成过程的语义对齐程度;基于SPS,作者进一步设计了xCompress框架,在推理阶段动态采样、排序并压缩检索摘要候选,从而实现更精准的检索-生成协同优化。

链接: https://arxiv.org/abs/2508.05909
作者: Zhanghao Hu,Qinglin Zhu,Siya Qi,Yulan He,Hanqi Yan,Lin Gui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.
zh

[NLP-46] Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在情感计算(Affective Computing)中普遍依赖离散情绪标签进行监督训练或评估的局限性,即多数研究仅关注表面层次的情绪识别任务,而忽视了模型对情绪背后认知推理机制的理解。其解决方案的关键在于引入一个基于认知评估理论(cognitive appraisal theory)的大规模基准测试——Cognitive Reasoning for Emotions (CoRE),用于系统性地考察LLMs在面对情绪刺激时是否能生成连贯且合理的认知推理过程,并通过分析不同模型在内部认知结构上的差异,揭示哪些认知维度更常被用于特定情绪的表征与推理。该方法使研究从“情绪标签预测”转向对模型内在情感推理机制的可解释性探索。

链接: https://arxiv.org/abs/2508.05880
作者: Sree Bhattacharyya,Lucas Craig,Tharun Dilliraj,Jia Li,James Z. Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models – especially Large Language Models (LLMs) – have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, © Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.
zh

[NLP-47] Discovering Properties of Inflectional Morphology in Neural Emergent Communication

【速读】: 该论文旨在解决当前生成式通信(Emergent Communication, EmCom)研究中过于聚焦于特定子领域目标与指标的问题,这些指标倾向于偏好一对一映射属性的通信方案,并忽略自然语言中常见的双层结构(double articulation)和屈折形态(inflectional morphology)。为应对这一局限,作者重新诠释了经典的属性值重建游戏(attribute-value reconstruction game),通过引入小词汇量约束来模拟双层结构,并设计了一个类自然语言的屈折形态新设置,从而实现与自然语言通信机制的有意义比较。解决方案的关键在于:一是提出新的评估指标以衡量生成语言是否具备语法融合性(fusionality)和串联性(concatenativity)等真实屈折形态特征;二是通过实验发现,模拟音系约束可促使代理生成串联型形态,而生成的语言会复现自然语言中将语法属性融合的趋势,即倾向于融合而非分离。

链接: https://arxiv.org/abs/2508.05843
作者: Miles Gilberti,Shane Storks,Huteng Dai
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emergent communication (EmCom) with deep neural network-based agents promises to yield insights into the nature of human language, but remains focused primarily on a few subfield-specific goals and metrics that prioritize communication schemes which represent attributes with unique characters one-to-one and compose them syntactically. We thus reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation, and formulating a novel setting analogous to naturalistic inflectional morphology (enabling meaningful comparison to natural language communication schemes). We develop new metrics and explore variations of this game motivated by real properties of inflectional morphology: concatenativity and fusionality. Through our experiments, we discover that simulated phonological constraints encourage concatenative morphology, and emergent languages replicate the tendency of natural languages to fuse grammatical attributes.
zh

[NLP-48] “Mirror” Language AI Models of Depression are Criterion-Contaminated

【速读】: 该论文旨在解决生成式 AI 模型在抑郁评估预测中因“标准污染”(criterion contamination)导致的效果量人为膨胀和泛化能力下降的问题。其关键解决方案是区分“镜像模型”(Mirror models,即直接使用抑郁评估语言数据训练的模型)与“非镜像模型”(Non-Mirror models,即基于不反映目标评估内容的语言数据训练的模型),并通过实证比较发现:尽管镜像模型表现出极高的拟合度(如 R² = .80),但其性能可能受标准污染影响;而非镜像模型虽效果较小(如 R² = .27),却具备更好的泛化性且与自评抑郁症状的相关性相当(r ≈ .54),说明其捕捉到的是更具临床解释性和普适性的语义特征。这一方法为开发更可靠、可解释的生成式 AI 抑郁预测工具提供了方向。

链接: https://arxiv.org/abs/2508.05830
作者: Tong Li,Rasiq Hussain,Mehak Gupta,Joshua R. Oltmanns
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 39 pages, 9 figures

点击查看摘要

Abstract:A growing number of studies show near-perfect LLM language-based prediction of depression assessment scores (up to R2 of .70). However, many develop these models directly from language responses to depression assessments. These “Mirror models” suffer from “criterion contamination”, which arises when a predicted score depends in part on the predictors themselves. This causes artificial effect size inflation which reduces model generalizability. The present study compares the performance of Mirror models versus “Non-Mirror models”, which are developed from language that does not mirror the assessment they are developed to predict. N = 110 research participants completed two different interviews: structured diagnostic and life history interviews. GPT-4, GPT-4o and LLaMA3-70B were then prompted to predict structured diagnostic interview depression scores from the two transcripts separately. Mirror models (using structured diagnostic data) showed very large effect sizes (e.g., R2 = .80). As expected, NonMirror models (using life history data) demonstrated smaller effect sizes, but were relatively large (e.g., R2 = .27). When Mirror and Non-Mirror model-predicted structured interview depression scores were correlated with self-reported depression symptoms, Mirror and NonMirror performed the same (e.g., r = ~.54), indicating that Mirror models contain bias perhaps due to criterion contamination. Topic modeling identified clusters across Mirror and Non-Mirror models, as well as between true-positive and false-positive predictions. In this head-to-head comparison study, Mirror language AI models of depression showed artificially inflated effect sizes and less generalizability. As language AI models for depression continue to evolve, incorporating Non-Mirror models may identify interpretable, and generalizable semantic features that have unique utility in real-world psychological assessment.
zh

[NLP-49] Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

【速读】: 该论文试图解决的问题是:人类记忆的短暂性(fleeting memory)是否对语言学习具有积极作用,尤其是在当前以Transformer为代表的神经网络语言模型中,这类模型通常不包含记忆限制或时间偏差机制,因而可能与人类语言习得机制存在差异。解决方案的关键在于通过受控实验,在训练过程中人为引入或移除“ fleeting memory”机制(即对近期输入信息的短期遗忘),并在一个发展上合理的训练数据集上对比Transformer模型的语言建模性能和对人类阅读时间预测能力的变化。结果表明,短暂记忆确实提升了语言建模表现(包括整体语言建模指标和句法评估),但反而降低了基于意外度(surprisal)的人类阅读时间预测准确性,揭示了记忆限制在语言学习中的有益作用并不等同于行为预测能力的提升。

链接: https://arxiv.org/abs/2508.05803
作者: Abishek Thamma,Micha Heilbron
机构: University of Amsterdam (阿姆斯特丹大学); Amsterdam Brain and Cognition (阿姆斯特丹大脑与认知研究中心); Vrije Universiteit Amsterdam (自由大学阿姆斯特丹分校); Department of Informatics (信息学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.
zh

[NLP-50] Basic interactive algorithms: Preview

【速读】: 该论文旨在解决如何对基本交互式算法(basic interactive algorithms)进行公理化建模的问题,其核心目标是扩展经典算法的公理体系以涵盖更广泛的计算模型,如概率算法、量子算法等。解决方案的关键在于将这些复杂算法统一视为带有适当预言机(oracle)的基本算法,从而在保持形式严谨性的基础上,揭示其行为等价性与计算本质的一致性;这一视角不仅澄清了传统 Church-Turing 论题与“物理论题”之间的区别,也为多种非确定性与量子计算模型提供了统一的形式基础。

链接: https://arxiv.org/abs/2508.05798
作者: Yuri Gurevich
机构: University of Michigan, Ann Arbor, MI, USA
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Logic (math.LO); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:This dialog paper offers a preview and provides a foretaste of an upcoming work on the axiomatization of basic interactive algorithms. The modern notion of algorithm was elucidated in the 1930s–1950s. It was axiomatized a quarter of a century ago as the notion of sequential algorithm'' or classical algorithm’‘; we prefer to call it basic algorithm" now. The axiomatization was used to show that for every basic algorithm there is a behaviorally equivalent abstract state machine. It was also used to prove the Church-Turing thesis as it has been understood by the logicians. Starting from the 1960s, the notion of algorithm has expanded -- probabilistic algorithms, quantum algorithms, etc. -- prompting introduction of a much more ambitious version of the Church-Turing thesis commonly known as the physical thesis.’’ We emphasize the difference between the two versions of the Church-Turing thesis and illustrate how nondeterministic and probabilistic algorithms can be viewed as basic algorithms with appropriate oracles. The same view applies to quantum circuit algorithms and many other classes of algorithms. Subjects: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Logic (math.LO); Quantum Physics (quant-ph) Cite as: arXiv:2508.05798 [cs.LO] (or arXiv:2508.05798v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2508.05798 Focus to learn more arXiv-issued DOI via DataCite Journalreference: The Bulletin of the EATCS, volume 146, June 2025
zh

[NLP-51] FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话系统中产生的幻觉问题,即生成内容中包含事实性错误或虚构信息,这对自然语言处理(Natural Language Processing, NLP)应用构成重大挑战。现有方法通常采用粗粒度的标签对响应进行整体事实一致性验证,但忽略了对话响应中可能同时存在正确、错误和不可验证的原子事实,导致评估不精确。为此,作者提出一个细粒度对话事实验证基准 FineDialFact,通过提取并逐条验证对话响应中的原子事实来实现更精细的评估。其关键创新在于构建了一个基于公开对话数据集的细粒度标注数据集,并验证了引入思维链(Chain-of-Thought, CoT)推理机制可有效提升事实验证性能,尽管如此,当前最佳F1分数仅为0.75,表明该任务仍具挑战性,有待进一步研究。

链接: https://arxiv.org/abs/2508.05782
作者: Xiangyan Chen,Yufeng Li,Yujian Gan,Arkaitz Zubiaga,Matthew Purver
机构: 1. University College London (伦敦大学学院); 2. Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.
zh

[NLP-52] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在带来强大内容生成能力的同时,可能无意或有意产生有害、偏见或冒犯性内容所带来的社会技术挑战。其核心问题在于如何有效识别、分类并缓解LLM引发的各类危害,包括非故意毒性、对抗性越狱攻击(adversarial jailbreaking attacks)以及内容审核策略的不足。解决方案的关键在于提出一个统一的LLM相关危害与防御分类体系,系统梳理当前主流的缓解方法,如基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)、提示工程(prompt engineering)和安全对齐(safety alignment),并指出现有评估方法的局限性,从而为构建更鲁棒且符合伦理的语言技术提供理论框架与实践路径。

链接: https://arxiv.org/abs/2508.05775
作者: Chi Zhang,Changjia Zhu,Junjie Xiong,Xiaoran Xu,Lingyao Li,Yao Liu,Zhuo Lu
机构: University of South Florida(南佛罗里达大学); Missouri University of Science and Technology(密苏里科学技术大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (QA), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.
zh

[NLP-53] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于纯视觉输入的图形用户界面(Graphical User Interfaces, GUIs)上执行自然语言指令时,难以实现鲁棒语义对齐的问题。具体而言,尽管强化学习结合可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)能有效提升空间对齐精度,但模型在探索过程中效率低下,导致难以学习复杂语义关联。解决方案的关键在于提出一种新的策略优化框架——自适应探索策略优化(Adaptive Exploration Policy Optimization, AEPO),其核心创新包括:采用多答案生成策略以扩大探索范围,并引入基于效率公式 η=U/C(效用与成本比)理论推导出的自适应探索奖励(Adaptive Exploration Reward, AER)函数,从而引导模型更高效地探索并建立正确的语义映射。实验表明,基于AEPO训练的InfiGUI-G1-3B和InfiGUI-G1-7B模型在多个GUI接地基准测试中达到新SOTA性能,相对朴素RLVR基线提升高达9.0%。

链接: https://arxiv.org/abs/2508.05731
作者: Yuhang Liu,Zeyu Liu,Shuanghe Zhu,Pengxiang Li,Congkai Xie,Jiasheng Wang,Xueyu Hu,Xiaotian Han,Jianbo Yuan,Xinyao Wang,Shengyu Zhang,Hongxia Yang,Fei Wu
机构: InfiX.ai; Amazon
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at this https URL.
zh

[NLP-54] PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare

【速读】: 该论文旨在解决医疗文本跨语言处理中缺乏高质量平行语料库的问题,特别是在英语与阿拉伯语之间的医疗信息传递场景下。解决方案的关键在于构建了一个手动对齐的高质量平行语料库PEACH,包含51,671句对的患者教育材料和说明书文本,共计约59万词的英文和阿拉伯语文本,其句长平均为9.52至11.83词。该语料库作为金标准资源,可支持对比语言学、翻译研究及自然语言处理任务,如构建双语词典、微调大语言模型用于领域特定机器翻译、评估医疗文本可读性与通俗性等,从而提升医疗信息在多语言环境下的准确传达与可理解性。

链接: https://arxiv.org/abs/2508.05722
作者: Rania Al-Sabbagh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials’ readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.
zh

[NLP-55] DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM -Based Insider Threat Detection ICDM

【速读】: 该论文旨在解决**内部威胁检测(Insider Threat Detection, ITD)**中传统模型难以捕捉语义意图与复杂行为动态、现有大语言模型(Large Language Models, LLMs)在提示适应性和模态覆盖上的局限性问题。其解决方案的关键在于提出一种双模态框架DMFI,通过将原始日志转化为两个结构化视图:一是基于指令格式提示的内容丰富型语义视图(如邮件、HTTPS流量),二是采用4W引导(何时-何地-何事-何人)转换构建的行为抽象视图,从而实现对行为序列的上下文编码;在此基础上,使用LoRA增强的LLM分别进行细调,并通过轻量级多层感知机(MLP)决策模块融合输出;进一步引入DMFI-B判别式适配策略以分离正常与异常行为表征,提升在严重类别不平衡下的鲁棒性。该方法有效结合了LLM的语义推理能力与结构化行为建模,提供了可扩展且实用的现代内部威胁检测方案。

链接: https://arxiv.org/abs/2508.05694
作者: Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Guanggang Geng,Zhiying Li,Jian Weng
机构: Jinan University (暨南大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to the 2025 IEEE International Conference on Data Mining (ICDM)

点击查看摘要

Abstract:Insider threat detection (ITD) poses a persistent and high-impact challenge in cybersecurity due to the subtle, long-term, and context-dependent nature of malicious insider behaviors. Traditional models often struggle to capture semantic intent and complex behavior dynamics, while existing LLM-based solutions face limitations in prompt adaptability and modality coverage. To bridge this gap, we propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning. DMFI converts raw logs into two structured views: (1) a semantic view that processes content-rich artifacts (e.g., emails, https) using instruction-formatted prompts; and (2) a behavioral abstraction, constructed via a 4W-guided (When-Where-What-Which) transformation to encode contextual action sequences. Two LoRA-enhanced LLMs are fine-tuned independently, and their outputs are fused via a lightweight MLP-based decision module. We further introduce DMFI-B, a discriminative adaptation strategy that separates normal and abnormal behavior representations, improving robustness under severe class imbalance. Experiments on CERT r4.2 and r5.2 datasets demonstrate that DMFI outperforms state-of-the-art methods in detection accuracy. Our approach combines the semantic reasoning power of LLMs with structured behavior modeling, offering a scalable and effective solution for real-world insider threat detection. Our work demonstrates the effectiveness of combining LLM reasoning with structured behavioral modeling, offering a scalable and deployable solution for modern insider threat detection.
zh

[NLP-56] DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)和生成式AI(Generative AI)在客服与内容审核等实际应用中面临的双重对抗威胁问题,即外部对抗性扰动(external adversarial perturbations)和内部标签污染(internal label corruption)。解决方案的关键在于提出DINA(Dual Defense Against Internal Noise and Adversarial Attacks)框架,该框架融合计算机视觉领域先进的噪声标签学习方法与对抗训练机制,实现对两类威胁的统一防御,从而显著提升自然语言处理(Natural Language Processing, NLP)系统的鲁棒性和准确性。

链接: https://arxiv.org/abs/2508.05671
作者: Ko-Wei Chuang,Hen-Hsen Huang,Tsai-Yen Li
机构: National Chengchi University (国立政治大学); Academia Sinica (中央研究院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:As large language models (LLMs) and generative AI become increasingly integrated into customer service and moderation applications, adversarial threats emerge from both external manipulations and internal label corruption. In this work, we identify and systematically address these dual adversarial threats by introducing DINA (Dual Defense Against Internal Noise and Adversarial Attacks), a novel unified framework tailored specifically for NLP. Our approach adapts advanced noisy-label learning methods from computer vision and integrates them with adversarial training to simultaneously mitigate internal label sabotage and external adversarial perturbations. Extensive experiments conducted on a real-world dataset from an online gaming service demonstrate that DINA significantly improves model robustness and accuracy compared to baseline models. Our findings not only highlight the critical necessity of dual-threat defenses but also offer practical strategies for safeguarding NLP systems in realistic adversarial scenarios, underscoring broader implications for fair and responsible AI deployment.
zh

[NLP-57] Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports

【速读】: 该论文旨在解决从马来西亚审计财务报告中准确提取并表示表格结构的问题,尤其针对旋转布局、多级表头和隐式结构线索等复杂情况,以实现金融表格到Markdown格式的高保真转换。解决方案的关键在于基于Qwen2.5-VL-7B的视觉语言模型(Vision-Language Model, VLM)进行领域特定微调,结合自建的2,152对图像-文本数据集、增强策略及LoRA(Low-Rank Adaptation)监督微调方法,从而在保持结构完整性的同时显著提升生成精度与推理效率。

链接: https://arxiv.org/abs/2508.05669
作者: Jin Khye Tan(Faculty of Computer Science and Information Technology, Universiti Malaya),En Jun Choong,Ethan Jeremiah Chitty,Yan Pheng Choo,John Hsin Yang Wong,Chern Eu Cheah
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 14 figures, 5 tables. Evaluation code (LLM-as-a-judge and Markdown TEDS) is available at this https URL . The development dataset and evaluation benchmark are available on Hugging Face at this https URL and this https URL respectively

点击查看摘要

Abstract:Accurately extracting and representing the structure of tabular data from financial documents remains a critical challenge in document understanding, particularly for regulatory and analytical use cases. This study addresses the complexity of converting financial tables from Malaysian audited financial reports into Markdown format, a task complicated by rotated layouts, multi-level headers, and implicit structural cues. We propose a fine-tuned vision-language model (VLM), based on Qwen2.5-VL-7B, optimized for high-fidelity Markdown generation from document images. Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA. To assess performance, we evaluated our model on 100 out-of-sample tables using a dual framework: a criteria-based LLM-as-a-judge for fine-grained accuracy and our novel Markdown Tree-Edit-Distance-based Similarity (TEDS) metric for holistic structural fidelity. Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% Markdown TEDS score. This performance significantly surpasses its Qwen2.5-VL-7B base model, larger-scale VLMs, and specialized reasoning-enabled models. Compared to these self-hosted alternatives, it also significantly reduces inference time. Furthermore, its accuracy exceeds that of widely used proprietary models such as OpenAI’s GPT-4o and Gemini 2.5 Flash. These results demonstrate that domain-specific fine-tuning provides an effective and efficient method to bridge the gap between unstructured financial documents and downstream automation, rivalling much larger and more general models without their computational overhead.
zh

[NLP-58] A Survey of LLM -based Deep Search Agents : Paradigm Optimization Evaluation and Challenges

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的搜索代理(Search Agents)在架构设计、优化策略、应用场景与评估方法等方面缺乏系统性分析的问题。其解决方案的关键在于首次对现有搜索代理研究进行体系化梳理,从架构、优化、应用和评估四个维度进行全面分类与深入分析,从而识别出该领域中的关键开放挑战,并为未来的研究方向提供清晰指引。

链接: https://arxiv.org/abs/2508.05668
作者: Yunjia Xi,Jianghao Lin,Yongzhao Xiao,Zheli Zhou,Rong Shan,Te Gao,Jiachen Zhu,Weiwen Liu,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Central South University (中南大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI’s Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field. Our repository is available on this https URL.
zh

[NLP-59] Enhancing Retrieval-Augmented Generation for Electric Power Industry Customer Support

【速读】: 该论文旨在解决当前AI客服系统在处理电力领域中模糊、多意图或细节敏感型查询时表现不佳的问题,这些问题通常源于标准自然语言处理(Natural Language Processing, NLP)流水线或微调语言模型的局限性。其解决方案的关键在于融合多种先进检索增强生成(Retrieval-Augmented Generation, RAG)技术:通过意图识别(Intent Recognition)将复杂问题分解为更精准的子查询,利用RAG Fusion整合多个检索结果以应对模糊或多维查询,结合上下文重排序(Context Reranking)过滤无关信息以减少幻觉,并采用图结构RAG框架提升对复杂语义的理解能力。最终系统在GPT-4生成数据集和真实电力公司FAQ数据集上分别达到97.9%和89.6%的准确率,显著优于基线RAG模型。

链接: https://arxiv.org/abs/2508.05664
作者: Hei Yu Chan,Kuok Tou Ho,Chenglong Ma,Yujing Si,Hok Lai Lin,Sa Lei Lam
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:Many AI customer service systems use standard NLP pipelines or finetuned language models, which often fall short on ambiguous, multi-intent, or detail-specific queries. This case study evaluates recent techniques: query rewriting, RAG Fusion, keyword augmentation, intent recognition, and context reranking, for building a robust customer support system in the electric power domain. We compare vector-store and graph-based RAG frameworks, ultimately selecting the graph-based RAG for its superior performance in handling complex queries. We find that query rewriting improves retrieval for queries using non-standard terminology or requiring precise detail. RAG Fusion boosts performance on vague or multifaceted queries by merging multiple retrievals. Reranking reduces hallucinations by filtering irrelevant contexts. Intent recognition supports the decomposition of complex questions into more targeted sub-queries, increasing both relevance and efficiency. In contrast, keyword augmentation negatively impacts results due to biased keyword selection. Our final system combines intent recognition, RAG Fusion, and reranking to handle disambiguation and multi-source queries. Evaluated on both a GPT-4-generated dataset and a real-world electricity provider FAQ dataset, it achieves 97.9% and 89.6% accuracy respectively, substantially outperforming baseline RAG models.
zh

[NLP-60] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分子性质预测任务中依赖人工设计提示(prompt)和思维链模板的问题,以及现有推理模型(如DeepSeek-R1)在推理过程中存在冗长且相关性不足的缺陷。解决方案的关键在于提出一种属性引导的强化学习框架——AttriLens-Mol,其通过三种奖励机制协同优化模型推理过程:(1) 格式奖励(format reward)鼓励生成基于属性的结构化输出;(2) 计数奖励(count reward)抑制无关属性的枚举;(3) 合理性奖励(rationality reward)利用高级LLM与RDKit验证生成属性的相关性。该方法隐式激发了模型对相关分子属性的内在知识,显著提升了预测性能与可解释性,在分布内和分布外数据集上均优于监督微调模型及主流先进模型(如GPT-4o、DeepSeek-R1等)。

链接: https://arxiv.org/abs/2508.04748
作者: Xuan Lin,Long Chen,Yile Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in this https URL.
zh

[NLP-61] Indian Legal NLP Benchmarks : A Survey

【速读】: 该论文试图解决的问题是:当前自然语言处理(Natural Language Processing, NLP)基准测试普遍基于通用英语文本,而印度法律文本(Indian Legal Text)在语义、结构和术语上与普通英文文本存在显著差异,导致现有NLP模型在处理印度法律文本时性能受限。为推动面向印度法律场景的NLP技术创新,亟需构建专门针对印度法律文本的挑战性基准测试集,以聚焦于法律系统特有的任务(如法律条款理解、判例推理等)。解决方案的关键在于系统梳理现有研究,并提出创建新基准的可行路径,包括设计符合印度法律体系特点的任务类型、收集高质量标注数据以及制定可衡量的评估指标,从而促进AI社区与法律界协同创新。

链接: https://arxiv.org/abs/2107.06056
作者: Prathamesh Kalamkar,Janani Venugopalan Ph.D.,Vivek Raghavan Ph.D
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Availability of challenging benchmarks is the key to advancement of AI in a specific this http URL Legal Text is significantly different than normal English text, there is a need to create separate Natural Language Processing benchmarks for Indian Legal Text which are challenging and focus on tasks specific to Legal Systems. This will spur innovation in applications of Natural language Processing for Indian Legal Text and will benefit AI community and Legal fraternity. We review the existing work in this area and propose ideas to create new benchmarks for Indian Legal Natural Language Processing.
zh

[NLP-62] NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference INTERSPEECH2025

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的语音处理中因高帧率音频编码器导致训练与推理效率低下的问题。现有音频编码器通常采用较高的帧率,使得自回归模型生成每秒音频所需的步骤过多,从而显著增加计算开销。为应对这一挑战,作者提出了一种关键创新:设计并实现了一个低帧率、高保真度的音频编码器 NanoCodec,其帧率为 12.5 帧/秒(FPS),在保证重建质量的同时大幅减少自回归步数,从而提升语音大语言模型(Speech LLM)训练与推理的效率和实时性。该方案通过系统性消融实验验证了帧率、比特率和因果性对编码性能的影响,并确立了新的低延迟、高效率语音编码基准。

链接: https://arxiv.org/abs/2508.05835
作者: Edresson Casanova,Paarth Neekhara,Ryan Langman,Shehzeen Hussain,Subhankar Ghosh,Xuesong Yang,Ante Jukić,Jason Li,Boris Ginsburg
机构: NVIDIA Corporation(英伟达)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.
zh

计算机视觉

[CV-0] LightSwitch: Multi-view Relighting with Material-guided Diffusion ICCV2025

【速读】:该论文旨在解决现有3D relighting方法在利用2D图像生成式先验(Generative Prior)进行光照重渲染时,因未充分考虑物体的内在属性(如材质特性)和多视角数据而造成的渲染质量不佳问题。其解决方案的关键在于提出Lightswitch框架——一个基于微调的材料感知扩散模型(Material-Relighting Diffusion Framework),通过融合多视角图像与推断出的内在属性信息,并结合可扩展的去噪机制,在保持结构一致性的同时高效地将任意数量输入图像重渲染至目标光照条件,显著提升了2D relighting预测质量和整体relighting效果。

链接: https://arxiv.org/abs/2508.06494
作者: Yehonathan Litman,Fernando De la Torre,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project page Code: this https URL

点击查看摘要

Abstract:Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes.
zh

[CV-1] WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion

【速读】:该论文旨在解决高时空分辨率地表温度(Land Surface Temperature, LST)估算难题,特别是如何在保持每日时间尺度的同时实现10米空间分辨率的LST生成。现有遥感系统通常面临空间与时间分辨率之间的权衡,而传统融合方法难以满足精细化环境监测需求。解决方案的关键在于提出一种弱监督生成网络WGAST(Weakly-Supervised Generative Network for Daily 10 m LST Estimation),其采用条件生成对抗架构,包含四个阶段:特征提取、时空融合、LST重建和噪声抑制。其中,时空融合阶段通过余弦相似性、归一化和时序注意力机制实现Terra MODIS、Landsat 8与Sentinel-2多源数据的有效融合,从而在无全分辨率标签的情况下实现高质量LST重构,且训练策略基于物理平均原理并引入PatchGAN判别器增强稳定性与细节保真度。

链接: https://arxiv.org/abs/2508.06485
作者: Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
机构: INSA Centre Val de Loire, Université d’Orléans, PRISME UR 4229; Université d’Orléans, INSA CVL, PRISME UR 4229; Université d’Orléans, CEDETE, UR 1210
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS)

点击查看摘要

Abstract:Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at this https URL.
zh

[CV-2] xt Embedded Swin-UMamba for DeepLesion Segmentation

【速读】:该论文旨在解决医学影像中病灶自动分割的精度提升问题,特别是在慢性疾病(如淋巴瘤)的临床评估中,如何有效融合影像特征与放射学报告中的文本描述以增强分割性能。其解决方案的关键在于提出Text-Swin-UMamba模型,通过将大语言模型(LLM)提取的短文本描述嵌入到Swin-UMamba架构中,实现多模态信息(图像与文本)的协同建模,从而显著提升病灶分割的准确性。实验表明,该方法在ULS23 DeepLesion数据集上取得了82%的Dice分数和6.58像素的低Hausdorff距离,优于纯图像驱动的xLSTM-UNet、nnUNet以及LLM引导的LanGuideMedSeg模型。

链接: https://arxiv.org/abs/2508.06453
作者: Ruida Cheng,Tejas Sudharshan Mathai,Pritam Mukherjee,Benjamin Hou,Qingqing Zhu,Zhiyong Lu,Matthew McAuliffe,Ronald M. Summers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at this https URL
zh

[CV-3] RUST: Leverag ing Text Robustness for Unsupervised Domain Adaptation

【速读】:该论文旨在解决复杂域偏移(complex domain shifts)下无监督域自适应(Unsupervised Domain Adaptation, UDA)性能下降的问题,尤其是在背景和物体外观差异显著的场景中(如地理域偏移)。传统UDA方法在经典域偏移(如合成到真实)上表现良好,但在复杂偏移下效果受限。解决方案的关键在于引入TRUST框架,其核心创新包括:1)利用语言模态的鲁棒性,通过图像描述(captions)生成目标域样本的伪标签,并设计基于归一化CLIP相似度的不确定性估计策略来量化伪标签置信度,进而对分类损失进行重加权以抑制低质量伪标签带来的负面影响;2)提出一种多模态软对比学习损失(multimodal soft-contrastive learning loss),通过文本引导对比训练视觉模型,使图像对在特征空间中被吸引或排斥的程度与其文本相似度成正比,从而避免在UDA设置中难以确定正负样本对的问题。该方法在DomainNet和GeoNet两个基准上均达到新的SOTA性能。

链接: https://arxiv.org/abs/2508.06452
作者: Mattia Litrico,Mario Valerio Giuffrida,Sebastiano Battiato,Devis Tuia
机构: 1. University of Catania (卡塔尼亚大学); 2. University of Palermo (帕尔马大学); 3. École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.
zh

[CV-4] CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)在处理两类典型数据集时的性能瓶颈问题:一类是大规模自然图像-文本数据集(常通过网络自动收集),其语义对齐松散、监督信号弱;另一类是医学图像-文本数据集,虽跨模态相关性强但内容多样性低。这两类数据均导致CLIP模型难以学习鲁棒且泛化的多模态表示。解决方案的关键在于提出一种统一的非对比型插件模块CLIPin,可无缝集成到CLIP类架构中以增强语义对齐并提升对齐鲁棒性;同时设计两个共享的预投影层(pre-projectors)分别用于图像和文本模态,实现对比与非对比学习在参数约束下的协同融合,从而在不改变原框架结构的前提下显著提升多模态表征能力。

链接: https://arxiv.org/abs/2508.06434
作者: Shengzhu Yang,Jiawei Du,Shuai Lu,Weihang Zhang,Ningli Wang,Huiqi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model’s ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at this https URL.
zh

[CV-5] MotionSwap

【速读】:该论文旨在解决高保真人脸替换(face swapping)中的身份保留不足、属性一致性差及视觉质量不理想等问题。其核心解决方案在于对原始SimSwap模型进行多项关键改进:首先,在生成器架构中引入自注意力(self-attention)与交叉注意力(cross-attention)机制,以增强特征表示的局部与全局依赖建模能力;其次,采用动态损失权重策略优化训练过程,使生成器在不同阶段更均衡地学习身份信息与细节纹理;最后,结合余弦退火(cosine annealing)学习率调度策略提升收敛稳定性。这些改进显著提升了身份相似度、降低了FID分数,并在定性结果上表现出更强的视觉真实感。

链接: https://arxiv.org/abs/2508.06430
作者: Om Patil,Jinesh Modi,Suryabha Mukhopadhyay,Meghaditya Giri,Chhavi Malhotra
机构: BITS Pilani, Hyderabad Campus (比特帕拉尼海得拉巴校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 5 tables. This is a student research submission from BITS Pilani, Hyderabad Campus. Our implementation enhances SimSwap with attention modules and dynamic training strategies

点击查看摘要

Abstract:Face swapping technology has gained significant attention in both academic research and commercial applications. This paper presents our implementation and enhancement of SimSwap, an efficient framework for high fidelity face swapping. We introduce several improvements to the original model, including the integration of self and cross-attention mechanisms in the generator architecture, dynamic loss weighting, and cosine annealing learning rate scheduling. These enhancements lead to significant improvements in identity preservation, attribute consistency, and overall visual quality. Our experimental results, spanning 400,000 training iterations, demonstrate progressive improvements in generator and discriminator performance. The enhanced model achieves better identity similarity, lower FID scores, and visibly superior qualitative results compared to the baseline. Ablation studies confirm the importance of each architectural and training improvement. We conclude by identifying key future directions, such as integrating StyleGAN3, improving lip synchronization, incorporating 3D facial modeling, and introducing temporal consistency for video-based applications. Comments: 8 pages, 7 figures, 5 tables. This is a student research submission from BITS Pilani, Hyderabad Campus. Our implementation enhances SimSwap with attention modules and dynamic training strategies Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.06430 [cs.CV] (or arXiv:2508.06430v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.06430 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-6] SPARSE Data Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation

【速读】:该论文旨在解决深度学习在医学影像分类任务中因标注数据不足而导致性能受限的问题,尤其是在极端低样本量(如每类仅5个标注样本)的场景下。其解决方案的关键在于提出一种基于生成对抗网络(GAN)的半监督学习框架,通过三阶段训练策略融合三个专用神经网络:用于类别条件图像翻译的生成器、用于真实性判别与分类的判别器以及独立的分类器;同时采用基于集成的伪标签机制,结合判别器与分类器的置信度加权预测,并引入指数移动平均以保证时间一致性,从而可靠地为大量未标注数据生成高质量伪标签,显著提升模型在极低标注数据下的泛化能力。

链接: https://arxiv.org/abs/2508.06429
作者: Guido Manni,Clemente Lauretti,Loredana Zollo,Paolo Soda
机构: Università Campus Bio-Medico di Roma (罗马大学生物医学校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks – a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier – within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at this https URL.
zh

[CV-7] Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Frag mentation

【速读】:该论文旨在解决通用机器人策略(generalist robot policies)在训练数据分布之外难以泛化的问题,其核心原因是“捷径学习”(shortcut learning)——即模型过度依赖任务无关特征。研究表明,导致捷径学习的关键因素有两个:一是单个子数据集内部多样性不足,二是不同子数据集间存在显著分布差异,从而引发数据集碎片化(dataset fragmentation)。针对此问题,论文提出两种解决方案:其一是在数据收集阶段优化策略以减少捷径学习;其二是在无法获取新大规模数据的情况下,通过精心设计的机器人数据增强方法对现有离线数据进行处理,有效降低捷径学习并提升策略在仿真与真实环境中的泛化能力。

链接: https://arxiv.org/abs/2508.06426
作者: Youguang Xing,Xu Luo,Junlin Xie,Lianli Gao,Hengtao Shen,Jingkuan Song
机构: UESTC(电子科技大学); Tongji University(同济大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CoRL 2025

点击查看摘要

Abstract:Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability. We identify shortcut learning – the reliance on task-irrelevant features – as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., \pi_0 , in both simulation and real-world environments. More information at this https URL.
zh

[CV-8] Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification

【速读】:该论文旨在解决合成孔径雷达(SAR)船舶分类中因长尾数据分布导致的类别不平衡问题,特别是对低频类别的识别困难。解决方案的关键在于提出两种受Major-to-minor(M2m)方法启发的新算法——M2m_f 和 M2m_u,它们通过在特征空间中进行过采样来增强少数类样本的表示能力,从而提升模型对欠采样类别的判别性能。实验表明,该方法在OpenSARShip和FuSARShip两个公开数据集上均显著优于原始M2m及基线方法,平均F1分数分别提升了4.44%和8.82%。

链接: https://arxiv.org/abs/2508.06420
作者: Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Oktay Karakus
机构: ISTI-CNR & University of Pisa (意大利国家研究委员会与比萨大学); ISTI-CNR & NBFC (意大利国家研究委员会与NBFC); ISTI-CNR (意大利国家研究委员会); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at IGARSS

点击查看摘要

Abstract:SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m _f , M2m _u . The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.
zh

[CV-9] A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

【速读】:该论文试图解决的问题是:传统超分辨率(Super-Resolution, SR)技术仅基于像素级指标提升图像质量,而忽视了超分辨结果与下游视觉任务(如分类)性能之间的关联性,从而限制了其在遥感等领域的应用效果。解决方案的关键在于提出一种新型方法,通过优化同时考虑图像质量与分类性能的损失函数,实现合成孔径雷达(Synthetic Aperture Radar, SAR)图像的分辨率提升——该策略不仅在科学验证的图像质量指标上取得改进,还显著提升了分类准确率。

链接: https://arxiv.org/abs/2508.06407
作者: Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Oktay Karakus
机构: PhD School in Computer Science, University of Pisa (比萨大学计算机科学博士项目); Institute of Information Science and Technologies, National Research Council of Italy (意大利国家研究委员会信息科学与技术研究所); National Biodiversity Future Center - NBFC (国家生物多样性未来中心 - NBFC); School of Computer Science and Informatics, Cardiff University (卡迪夫大学计算机科学与信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.
zh

[CV-10] FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

【速读】:该论文旨在解决基于视频扩散模型(Video Diffusion Models, VDMs)进行3D重建时存在的采样速度慢的问题,尤其是在输入视角稀疏的情况下,传统方法需多次运行预训练VDM以获得足够空间覆盖,导致效率低下。解决方案的关键在于提出FVGen框架,通过一种新颖的视频扩散模型蒸馏方法,将多步去噪教师模型压缩为仅需4步采样的少步去噪学生模型,该蒸馏过程结合生成对抗网络(Generative Adversarial Networks, GANs)与软化反向KL散度最小化策略,从而在保持甚至提升视觉质量的前提下,将采样时间减少90%以上,显著提升下游重建任务的时间效率。

链接: https://arxiv.org/abs/2508.06392
作者: Wenbin Teng,Gonglin Chen,Haiwei Chen,Yajie Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.
zh

[CV-11] xt as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning

【速读】:该论文旨在解决多模态学习中对大量特定模态标注数据的依赖问题,以及现有方法难以扩展至无限模态的局限性。其核心挑战在于如何在不使用任何模态特定标注数据的前提下,构建一个可扩展的通用表示模型,以支持多种模态(如视频、图像、音频)的下游任务。解决方案的关键在于提出Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT),该方法通过引入模态提示池(modality prompt pools)、文本构造机制和模态对齐的文本编码器(modality-aligned text encoders),实现仅用文本数据即可构建跨模态统一表征。此外,TaAM-CPT设计了模态内与模态间的学习目标,确保类别细节捕捉与跨模态语义一致性,从而在无需额外模态标注的情况下,显著提升多模态分类性能,并具备无缝扩展至任意新模态的能力。

链接: https://arxiv.org/abs/2508.06382
作者: Xiangyu Wu,Feng Yu,Yang Yang,Jianfeng Lu
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ACMMM 2025

点击查看摘要

Abstract:The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at this https URL.
zh

[CV-12] Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd

【速读】:该论文旨在解决一对多人脸识别中“探针图像(probe image)”是否属于图库(gallery)内身份的判定问题,即区分rank-one结果是In-gallery(在图库内)还是Out-of-gallery(不在图库内)。传统方法依赖于相似度分数阈值来判断,但存在准确性不足的问题。本文提出一种新方案:利用rank-one身份对应的额外注册图像(enrolled images)生成特征向量,训练分类器以预测rank-one结果是否为In-gallery或Out-of-gallery。其关键创新在于引入了额外注册图像的排名信息作为判别特征,并通过监督学习构建分类模型,从而实现更可靠的Out-of-gallery检测,尤其在模糊、低分辨率、雾霾和戴墨镜等退化条件下仍具鲁棒性,且跨人口统计学群体表现一致。

链接: https://arxiv.org/abs/2508.06357
作者: Aman Bhatta,Maria Dhakal,Michael C. King,Kevin W. Bowyer
机构: University of Notre Dame (圣母大学); Florida Insitute of Technology (佛罗里达理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central problem in one-to-many facial identification is that the person in the probe image may or may not have enrolled image(s) in the gallery; that is, may be In-gallery or Out-of-gallery. Past approaches to detect when a rank-one result is Out-of-gallery have mostly focused on finding a suitable threshold on the similarity score. We take a new approach, using the additional enrolled images of the identity with the rank-one result to predict if the rank-one result is In-gallery / Out-of-gallery. Given a gallery of identities and images, we generate In-gallery and Out-of-gallery training data by extracting the ranks of additional enrolled images corresponding to the rank-one identity. We then train a classifier to utilize this feature vector to predict whether a rank-one result is In-gallery or Out-of-gallery. Using two different datasets and four different matchers, we present experimental results showing that our approach is viable for mugshot quality probe images, and also, importantly, for probes degraded by blur, reduced resolution, atmospheric turbulence and sunglasses. We also analyze results across demographic groups, and show that In-gallery / Out-of-gallery classification accuracy is similar across demographics. Our approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, and thereby to reduce false positive identifications, wrongful arrests, and wasted investigative time. Interestingly, comparing the results of older deep CNN-based face matchers with newer ones suggests that the effectiveness of our Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions.
zh

[CV-13] An Implemention of Two-Phase Image Segmentation using the Split Bregman Method

【速读】:该论文旨在解决二维图像的两相分割问题,即如何将图像像素精确划分为前景和背景两个区域,同时保证边界平滑性。其核心挑战在于构建一个能够有效刻画图像数据分布差异并约束边界几何特性的能量函数。解决方案的关键在于对Chan-Vese模型的能量函数进行改进,并利用分裂Bregman(split Bregman)方法高效最小化该能量,从而实现稳定且计算高效的两相图像分割。

链接: https://arxiv.org/abs/2508.06351
作者: Olakunle S. Abawonse,Günay Doğan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 15 pages

点击查看摘要

Abstract:In this paper, we describe an implementation of the two-phase image segmentation algorithm proposed by Goldstein, Bresson, Osher in \citegold:bre. This algorithm partitions the domain of a given 2d image into foreground and background regions, and each pixel of the image is assigned membership to one of these two regions. The underlying assumption for the segmentation model is that the pixel values of the input image can be summarized by two distinct average values, and that the region boundaries are smooth. Accordingly, the model is defined as an energy in which the variable is a region membership function to assign pixels to either region, originally proposed by Chan and Vese in \citechan:vese. This energy is the sum of image data terms in the regions and a length penalty for region boundaries. Goldstein, Bresson, Osher modify the energy of Chan-Vese in \citegold:bre so that their new energy can be minimized efficiently using the split Bregman method to produce an equivalent two-phase segmentation. We provide a detailed implementation of this method \citegold:bre, and document its performance with several images over a range of algorithm parameters.
zh

[CV-14] Aligning Effective Tokens with Video Anomaly in Large Language Models

【速读】:该论文旨在解决视频中异常事件(abnormal events)识别与定位的难题,尤其针对当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在处理空间和时间上稀疏的异常事件时,因冗余信息干扰而导致性能下降的问题。解决方案的关键在于提出VA-GPT,一种专为异常事件总结与定位设计的新颖MLLM,其核心创新是通过两个模块实现视觉编码器与语言模型之间的有效token对齐:空间有效token选择(Spatial Effective Token Selection, SETS)用于提取关键空间特征,时间有效token生成(Temporal Effective Token Generation, TETG)用于建模异常事件的时间动态性,从而显著提升模型对异常事件的空间-时间感知能力与响应准确性。

链接: https://arxiv.org/abs/2508.06350
作者: Yingxian Chen,Jiahui Liu,Ruifan Di,Yanwei Li,Chirui Chang,Shizhen Zhao,Wilton W.T. Fok,Xiaojuan Qi,Yik-Chung Wu
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.
zh

[CV-15] Street View Sociability: Interpretable Analysis of Urban Social Behavior Across 15 Cities

【速读】:该论文试图解决的问题是:如何量化评估城市街道中社会互动的质量,而非仅依赖传统的行人流量指标。现有研究缺乏对社会互动质量的客观测量手段,限制了城市设计理论与实践的深化。其解决方案的关键在于利用街景图像(street view imagery)这一低成本、全球覆盖的数据源,结合梅塔的社会性分类框架(Mehta’s taxonomy of passive, fleeting, and enduring sociability),通过多模态大语言模型提取图像中的潜在社会信息,并借助线性回归模型控制天气、时段和行人数量等变量,验证所推断的社会性指标与城市层面场所依恋分数及环境特征(如天空视图指数、绿色视图指数)之间的关联。结果表明,街景图像可有效反映不同类型的社交行为与建成环境的关系,为未来基于图像的大规模城市社会性研究提供了可行路径。

链接: https://arxiv.org/abs/2508.06342
作者: Kieran Elrod,Katherine Flanigan,Mario Bergés
机构: Carnegie Mellon University (卡内基梅隆大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Designing socially active streets has long been a goal of urban planning, yet existing quantitative research largely measures pedestrian volume rather than the quality of social interactions. We hypothesize that street view imagery – an inexpensive data source with global coverage – contains latent social information that can be extracted and interpreted through established social science theory. As a proof of concept, we analyzed 2,998 street view images from 15 cities using a multimodal large language model guided by Mehta’s taxonomy of passive, fleeting, and enduring sociability – one illustrative example of a theory grounded in urban design that could be substituted or complemented by other sociological frameworks. We then used linear regression models, controlling for factors like weather, time of day, and pedestrian counts, to test whether the inferred sociability measures correlate with city-level place attachment scores from the World Values Survey and with environmental predictors (e.g., green, sky, and water view indices) derived from individual street view images. Results aligned with long-standing urban planning theory: the sky view index was associated with all three sociability types, the green view index predicted enduring sociability, and place attachment was positively associated with fleeting sociability. These results provide preliminary evidence that street view images can be used to infer relationships between specific types of social interactions and built environment variables. Further research could establish street view imagery as a scalable, privacy-preserving tool for studying urban sociability, enabling cross-cultural theory testing and evidence-based design of socially vibrant cities.
zh

[CV-16] ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction IJCNN

【速读】:该论文旨在解决视频帧预测中因模型学习到“捷径”而无法从观测数据中准确推断状态的问题,尤其在初始符号状态(symbolic state)存在噪声时表现不佳。此前的ViPro模型依赖于给定的真值初始符号状态,导致其学习到的是观测与预测状态之间的间接关联而非真正的环境状态映射。解决方案的关键在于对ViPro进行多项改进,使其能够在无需提供完整真值状态的前提下,通过无监督方式从观测中正确推断状态,从而提升模型在真实复杂动态场景下的泛化能力,并通过扩展Orbits数据集为3D版本以更贴近现实世界应用。

链接: https://arxiv.org/abs/2508.06335
作者: Patrick Takenaka,Johannes Maucher,Marco F. Huber
机构: Stuttgart Media University (斯图加特媒体大学); University of Stuttgart (斯图加特大学); Fraunhofer IPA (弗劳恩霍夫研究所 IPA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in 2025 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.
zh

[CV-17] Can Diffusion Models Bridge the Domain Gap in Cardiac MR Imaging? ICONIP2025

【速读】:该论文旨在解决心脏磁共振(Cardiac MR)成像中因设备和采集协议差异导致的域偏移(domain shift)问题,该问题会显著降低已训练人工智能(AI)模型在真实世界场景中的性能。传统解决方案如数据增强或在线迁移学习存在局限性,而生成式合成数据虽具潜力,却受限于解剖结构一致性约束。论文提出的关键解决方案是:训练一个扩散模型(Diffusion Model, DM),使其基于源域数据生成与参考域高度相似的心脏MR图像,同时保持空间结构保真度及分割掩膜的兼容性。通过该方法,在多中心心脏MR分割任务中,无论是采用域泛化(domain generalisation)还是域自适应(domain adaptation)策略,均显著提升了未见目标域上的分割性能(表面距离指标,p < 0.01),有效减少了对迁移学习或在线训练的依赖,尤其适用于数据稀缺场景。

链接: https://arxiv.org/abs/2508.06327
作者: Xin Ci Wong,Duygu Sarikaya,Kieran Zucker,Marc De Kamps,Nishant Ravikumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICONIP 2025

点击查看摘要

Abstract:Magnetic resonance (MR) imaging, including cardiac MR, is prone to domain shift due to variations in imaging devices and acquisition protocols. This challenge limits the deployment of trained AI models in real-world scenarios, where performance degrades on unseen domains. Traditional solutions involve increasing the size of the dataset through ad-hoc image augmentation or additional online training/transfer learning, which have several limitations. Synthetic data offers a promising alternative, but anatomical/structural consistency constraints limit the effectiveness of generative models in creating image-label pairs. To address this, we propose a diffusion model (DM) trained on a source domain that generates synthetic cardiac MR images that resemble a given reference. The synthetic data maintains spatial and structural fidelity, ensuring similarity to the source domain and compatibility with the segmentation mask. We assess the utility of our generative approach in multi-centre cardiac MR segmentation, using the 2D nnU-Net, 3D nnU-Net and vanilla U-Net segmentation networks. We explore domain generalisation, where, domain-invariant segmentation models are trained on synthetic source domain data, and domain adaptation, where, we shift target domain data towards the source domain using the DM. Both strategies significantly improved segmentation performance on data from an unseen target domain, in terms of surface-based metrics (Welch’s t-test, p 0.01), compared to training segmentation models on real data alone. The proposed method ameliorates the need for transfer learning or online training to address domain shift challenges in cardiac MR image analysis, especially useful in data-scarce settings.
zh

[CV-18] Anti-Tamper Protection for Unauthorized Individual Image Generation ICCV’2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在个性化图像生成技术发展背景下,因伪造攻击(forgery attacks)对肖像权和隐私造成的威胁问题。现有保护扰动算法易被攻击者通过净化技术(purification techniques)绕过,导致防护失效。其解决方案的关键在于提出一种抗篡改扰动机制(Anti-Tamper Perturbation, ATP),该机制在频域中通过掩码引导同时引入“保护扰动”与“授权扰动”:前者用于抵御伪造攻击,后者用于检测基于净化的篡改行为;二者互不干扰且授权扰动可分布于全图像素,从而保持对净化操作的高度敏感性,显著提升防护鲁棒性。

链接: https://arxiv.org/abs/2508.06325
作者: Zelin Li,Ruohan Zong,Yifan Liu,Ruichen Yao,Yaokun Liu,Yang Zhang,Dong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages ,22 figures, Paper has been accepted by ICCV’2025

点击查看摘要

Abstract:With the advancement of personalized image generation technologies, concerns about forgery attacks that infringe on portrait rights and privacy are growing. To address these concerns, protection perturbation algorithms have been developed to disrupt forgery generation. However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, Anti-Tamper Perturbation (ATP). ATP introduces a tamper-proof mechanism within the perturbation. It consists of protection and authorization perturbations, where the protection perturbation defends against forgery attacks, while the authorization perturbation detects purification-based tampering. Both protection and authorization perturbations are applied in the frequency domain under the guidance of a mask, ensuring that the protection perturbation does not disrupt the authorization perturbation. This design also enables the authorization perturbation to be distributed across all image pixels, preserving its sensitivity to purification-based tampering. ATP demonstrates its effectiveness in defending forgery attacks across various attack settings through extensive experiments, providing a robust solution for protecting individuals’ portrait rights and privacy. Our code is available at: this https URL .
zh

[CV-19] Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

【速读】:该论文旨在解决弱监督视频异常检测(Weakly-Supervised Video Anomaly Detection, WSVAD)中面临的两大核心挑战:一是现有模型采用共享结构处理所有异常类别,忽略了类别特异性特征,难以应对多样化的异常类型;二是弱监督信号缺乏精确的时间信息,导致模型难以捕捉与正常行为混杂的细微异常模式。解决方案的关键在于提出高斯点绘(Gaussian Splatting)引导的专家混合模型(Gaussian Splatting-guided Mixture of Experts, GS-MoE),该框架通过一组专用专家模型分别学习特定异常类型的特征,并引入基于时间一致性的高斯点绘损失函数来增强弱监督信号的时空敏感性,从而提升对复杂真实场景异常的识别能力。最终,该方法在UCF-Crime、XD-Violence和MSAD等多个基准数据集上实现了领先性能。

链接: https://arxiv.org/abs/2508.06318
作者: Giacomo D’Amicantonio,Snehashis Majhi,Quan Kong,Lorenzo Garattoni,Gianpiero Francesca,François Bremond,Egor Bondarev
机构: Eindhoven University of Technology (埃因霍温理工大学); INRIA (法国国家信息与自动化研究院); Côte d’Azur University (蔚蓝海岸大学); Woven by Toyota (丰田编织公司); Toyota Motor Europe (丰田汽车欧洲公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.
zh

[CV-20] Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding

【速读】:该论文旨在解决视频时序定位(Video Temporal Grounding, VTG)任务中跨域适应的两大挑战:一是目标域缺乏标注数据,二是全量视频适配带来的高计算与存储开销,导致难以实时部署。解决方案的关键在于提出一种基于不确定性量化策略的滚动策略适应方法(Uncertainty-quantified Rollout Policy Adaptation, URPA),其核心思想是在无标签目标域上利用GRPO(Group Relative Policy Optimisation)生成多个候选预测,通过平均形成伪标签,并以预测方差估计置信度,进而加权训练奖励,引导模型聚焦于可靠监督信号。该方法仅需少量未标注目标域视频即可实现高效跨域知识迁移,显著降低资源消耗并支持实时推理。

链接: https://arxiv.org/abs/2508.06317
作者: Jian Hu,Zixu Cheng,Shaogang Gong,Isabel Guan,Jianye Hao,Jun Wang,Kun Shao
机构: Queen Mary University of London (伦敦玛丽女王大学); Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce. Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes will be released once published.
zh

[CV-21] FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields ICCV2025

【速读】:该论文旨在解决神经场(Neural Fields)在资源受限边缘设备上训练时面临的两大挑战:一是数据和计算资源需求高,二是传统联邦元学习(Federated Meta-Learning, FML)方法存在隐私泄露风险。其解决方案的关键在于提出了一种新型隐私保护的联邦元学习框架 FedMeNF,该框架引入了一种新的隐私约束损失函数,能够在本地元优化过程中有效控制隐私泄露,从而使得本地元学习器可在不保留客户端私有数据的前提下实现快速、高效的优化,同时在少样本(few-shot)和非独立同分布(non-IID)数据条件下仍保持鲁棒的重建性能。

链接: https://arxiv.org/abs/2508.06301
作者: Junhyeog Yun,Minui Hong,Gunhee Kim
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: ICCV 2025

点击查看摘要

Abstract:Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data. However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices. One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage. To address these issues, we introduce a novel FML approach called FedMeNF. FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client’s private data. Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.
zh

[CV-22] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视觉任务中表现不足的问题,尤其是空间理解与细粒度感知能力的局限性。现有方法虽尝试引入视觉推理机制,但未能有效利用空间线索对注意力进行迭代修正,导致模型难以聚焦于与提示(prompt)相关的目标区域。解决方案的关键在于提出一种空间感知的“思考-带图”框架 SIFThinker,其核心创新包括:一是设计了反向扩展-前向推理策略,用于生成交错式图像-文本思维链,从而构建 SIF-50K 数据集以实现过程级监督;二是提出 GRPO-SIF 强化训练范式,将深度信息引导的视觉定位整合进统一推理流程,使模型能够动态调整注意力并聚焦于提示相关的图像区域,显著提升空间理解和细粒度视觉感知性能。

链接: https://arxiv.org/abs/2508.06259
作者: Zhangquan Chen,Ruihui Zhao,Chuwei Luo,Mingze Sun,Xinlei Yu,Yangyang Kang,Ruqi Huang
机构: ByteDance(字节跳动); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures

点击查看摘要

Abstract:Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
zh

[CV-23] XAG-Net: A Cross-Slice Attention and Skip Gating Network for 2.5D Femur MRI Segmentation

【速读】:该论文旨在解决膝关节MRI图像中股骨(femur)结构分割精度不足的问题,现有基于2D和3D深度学习的方法在跨切片上下文建模和局部特征细化方面存在局限。其解决方案的关键在于提出XAG-Net,一种基于2.5D U-Net的新型架构,创新性地引入像素级跨切片注意力(cross-slice attention, CSA)机制与跳跃连接注意力门控(attention gating, AG)模块:CSA通过在每个空间位置对邻近切片进行像素级softmax注意力计算,实现细粒度的跨切片上下文建模;AG则用于增强单切片内特征表示的精细化程度。实验表明,该设计显著提升了分割准确性并保持了良好的计算效率。

链接: https://arxiv.org/abs/2508.06258
作者: Byunghyun Ko,Anning Tian,Jeongkyu Lee
机构: Khoury College of Computer Sciences, Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA). This is the preprint version of the paper

点击查看摘要

Abstract:Accurate segmentation of femur structures from Magnetic Resonance Imaging (MRI) is critical for orthopedic diagnosis and surgical planning but remains challenging due to the limitations of existing 2D and 3D deep learning-based segmentation approaches. In this study, we propose XAG-Net, a novel 2.5D U-Net-based architecture that incorporates pixel-wise cross-slice attention (CSA) and skip attention gating (AG) mechanisms to enhance inter-slice contextual modeling and intra-slice feature refinement. Unlike previous CSA-based models, XAG-Net applies pixel-wise softmax attention across adjacent slices at each spatial location for fine-grained inter-slice modeling. Extensive evaluations demonstrate that XAG-Net surpasses baseline 2D, 2.5D, and 3D U-Net models in femur segmentation accuracy while maintaining computational efficiency. Ablation studies further validate the critical role of the CSA and AG modules, establishing XAG-Net as a promising framework for efficient and accurate femur MRI segmentation.
zh

[CV-24] FedX: Explanation-Guided Pruning for Communication-Efficient Federated Learning in Remote Sensing

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在遥感(Remote Sensing, RS)图像分类任务中因频繁传输大型模型更新而导致的通信开销问题。解决方案的关键在于提出一种名为FedX的新策略,其核心是利用基于反向传播的解释方法(backpropagation-based explanation methods)来评估模型组件的任务相关重要性,并在中央服务器端对不重要的参数进行剪枝(pruning),从而生成稀疏的全局模型并发送至客户端,显著降低通信负担,同时保持甚至提升模型的泛化能力。

链接: https://arxiv.org/abs/2508.06256
作者: Barış Büyüktaş,Jonas Klotz,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients), where each client stores data locally and only shares model updates with a central server. This makes FL a suitable learning paradigm for remote sensing (RS) image classification tasks, where data centralization may be restricted due to legal and privacy constraints. However, a key challenge in applying FL to RS tasks is the communication overhead caused by the frequent exchange of large model updates between clients and the central server. To address this issue, in this paper we propose a novel strategy (denoted as FedX) that uses explanation-guided pruning to reduce communication overhead by minimizing the size of the transmitted models without compromising performance. FedX leverages backpropagation-based explanation methods to estimate the task-specific importance of model components and prunes the least relevant ones at the central server. The resulting sparse global model is then sent to clients, substantially reducing communication overhead. We evaluate FedX on multi-label scene classification using the BigEarthNet-S2 dataset and single-label scene classification using the EuroSAT dataset. Experimental results show the success of FedX in significantly reducing the number of shared model parameters while enhancing the generalization capability of the global model, compared to both unpruned model and state-of-the-art pruning methods. The code of FedX will be available at this https URL.
zh

[CV-25] Deepfake Detection that Generalizes Across Benchmarks

【速读】:该论文旨在解决深度伪造检测模型在面对未见过的篡改技术时泛化能力不足的问题,这是其实用部署中的关键挑战。解决方案的核心在于对预训练的CLIP视觉编码器进行参数高效的微调:仅调整层归一化(Layer Normalization)参数(占总参数量的0.03%),并通过L2归一化和潜在空间增强强制特征分布在超球面上,从而提升跨数据集的泛化性能。实验表明,该方法在13个基准数据集上达到SOTA效果,且显著优于更复杂的近期方法,同时验证了使用同源视频的配对真实-伪造数据训练对于避免捷径学习、提升泛化能力的重要性。

链接: https://arxiv.org/abs/2508.06248
作者: Andrii Yermakov,Jan Cech,Jiri Matas,Mario Fritz
机构: Czech Technical University in Prague (布拉格捷克技术大学); CISPA Helmholtz Center for Information Security ( CISPA亥姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations. We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.06248 [cs.CV] (or arXiv:2508.06248v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.06248 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] owards Unified Image Deblurring using a Mixture-of-Experts Decoder

【速读】:该论文旨在解决图像去模糊(image deblurring)任务中现有方法泛化能力不足的问题,即当前方法通常针对特定类型的模糊(如全局运动模糊、局部运动模糊、低光模糊和散焦模糊)设计专用模型,导致在实际应用中需要部署多个模型以覆盖不同模糊类型,难以实现高效统一的处理。其解决方案的关键在于提出一种基于混合专家(mixture-of-experts, MoE)的解码模块,该模块能够根据识别出的模糊退化类型动态路由图像特征,从而在端到端框架下实现对多种模糊类型的精确且高效的恢复,既达到与专用模型相当的性能,又展现出对未见过的模糊场景的强大鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.06228
作者: Daniel Feijoo,Paula Garrido-Mellado,Jaesung Rim,Alvaro Garcia,Marcos V. Conde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also demonstrates remarkable robustness and generalization capabilities on unseen blur degradation scenarios.
zh

[CV-27] Depth Jitter: Seeing through the Depth

【速读】:该论文旨在解决传统数据增强技术在深度敏感场景中忽视深度信息变化的问题,从而限制了模型在真实世界深度波动下的鲁棒性。其解决方案的关键在于提出了一种名为Depth-Jitter的新型基于深度的数据增强方法,通过自适应深度偏移(adaptive depth offsetting)并结合深度方差阈值进行引导,在保持结构完整性的同时生成合成的深度扰动,有效提升了模型在多样深度条件下的稳定性和泛化能力。

链接: https://arxiv.org/abs/2508.06227
作者: Md Sazidur Rahman,David Cabecinhas,Ricard Marxer
机构: Université de Toulon(图卢兹大学); Instituto Superior Técnico(理工学院); Institute for Systems and Robotics(系统与机器人研究所); Aix Marseille University(艾克斯-马赛大学); CNRS(法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \hrefthis https URLgithub.
zh

[CV-28] EFormer: Texture-Aware and Edge-Guided Transformer for Semantic Segmentation of Urban Remote Sensing Images

【速读】:该论文旨在解决城市遥感图像(Urban Remote Sensing Images, URSIs)语义分割中因地物纹理差异细微、空间结构相似而导致的语义模糊与误分类问题,以及不规则形状、边界模糊和语义对象重叠所引发的复杂边缘形态带来的分割精度下降难题。其解决方案的关键在于提出一种纹理感知且边缘引导的Transformer模型(Texture-aware and Edge-guided Transformer, TEFormer),通过三个核心模块实现:1)在编码器中引入纹理感知模块(Texture-aware Module, TaM),增强对视觉相似类别间细粒度纹理差异的捕捉能力,提升语义判别力;2)设计边缘引导的三分支解码器(Edge-guided Tri-branch Decoder, Eg3Head),保留局部边缘细节并实现多尺度上下文感知;3)构建边缘引导特征融合模块(Edge-guided Feature Fusion Module, EgFFM),将边缘信息与上下文及细节信息融合,从而实现精细化的语义分割。

链接: https://arxiv.org/abs/2508.06224
作者: Guoyu Zhou,Jing Zhang,Yi Yan,Hui Zhang,Li Zhuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to GRSL

点击查看摘要

Abstract:Semantic segmentation of urban remote sensing images (URSIs) is crucial for applications such as urban planning and environmental monitoring. However, geospatial objects often exhibit subtle texture differences and similar spatial structures, which can easily lead to semantic ambiguity and misclassification. Moreover, challenges such as irregular object shapes, blurred boundaries, and overlapping spatial distributions of semantic objects contribute to complex and diverse edge morphologies, further complicating accurate segmentation. To tackle these issues, we propose a texture-aware and edge-guided Transformer (TEFormer) that integrates texture awareness and edge-guidance mechanisms for semantic segmentation of URSIs. In the encoder, a texture-aware module (TaM) is designed to capture fine-grained texture differences between visually similar categories to enhance semantic discrimination. Then, an edge-guided tri-branch decoder (Eg3Head) is constructed to preserve local edges and details for multiscale context-awareness. Finally, an edge-guided feature fusion module (EgFFM) is to fuse contextual and detail information with edge information to realize refined semantic segmentation. Extensive experiments show that TEFormer achieves mIoU of 88.57%, 81.46%, and 53.55% on the Potsdam, Vaihingen, and LoveDA datasets, respectively, shows the effectiveness in URSI semantic segmentation.
zh

[CV-29] Interpretable Rheumatoid Arthritis Scoring via Anatomy-aware Multiple Instance Learning MICCAI

【速读】:该论文旨在解决类风湿关节炎(Rheumatoid Arthritis, RA)患者双侧手部X光片中Sharp/van der Heijde(SvdH)评分的自动化预测问题,以克服传统人工评分过程复杂、效率低下的局限性。其解决方案的关键在于提出了一种两阶段可解释的图像级SvdH评分预测管道:首先通过两种区域提取策略(基于异常可能性的图像块采样和基于疾病相关关节的病灶区域裁剪)定位关键病变区域;随后利用基于注意力机制的多实例学习(Attention-based Multiple Instance Learning, MIL)融合这些区域特征,生成可用于预测的图像级表示。实验表明,该方法在个体模型上达到PCC=0.943、RMSE=15.73,集成学习后进一步提升至PCC=0.945、RMSE=15.57,性能接近经验放射科医师水平(PCC=0.97, RMSE=18.75),且决策过程可解释,能准确识别临床关注的解剖结构。

链接: https://arxiv.org/abs/2508.06218
作者: Zhiyan Bo,Laura C. Coates,Bartlomiej W. Papiez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI AMAI Workshop 2025

点击查看摘要

Abstract:The Sharp/van der Heijde (SvdH) score has been widely used in clinical trials to quantify radiographic damage in Rheumatoid Arthritis (RA), but its complexity has limited its adoption in routine clinical practice. To address the inefficiency of manual scoring, this work proposes a two-stage pipeline for interpretable image-level SvdH score prediction using dual-hand radiographs. Our approach extracts disease-relevant image regions and integrates them using attention-based multiple instance learning to generate image-level features for prediction. We propose two region extraction schemes: 1) sampling image tiles most likely to contain abnormalities, and 2) cropping patches containing disease-relevant joints. With Scheme 2, our best individual score prediction model achieved a Pearson’s correlation coefficient (PCC) of 0.943 and a root mean squared error (RMSE) of 15.73. Ensemble learning further boosted prediction accuracy, yielding a PCC of 0.945 and RMSE of 15.57, achieving state-of-the-art performance that is comparable to that of experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, our pipeline effectively identified and made decisions based on anatomical structures which clinicians consider relevant to RA progression.
zh

[CV-30] Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

【速读】:该论文旨在解决现有 affordance grounding 模型在跨域泛化能力和显式推理能力方面的局限性,尤其是其未能有效捕捉不同物体间共享的可操作性(affordance)特征。解决方案的关键在于提出首个统一的 affordance grounding 框架 Affordance-R1,该框架将认知型 Chain-of-Thought (CoT) 引导的 Group Relative Policy Optimization (GRPO) 整合进强化学习(reinforcement learning, RL)范式中,并设计了一个包含格式、感知和认知奖励的复杂 affordance 函数,以引导优化方向。此外,研究构建了高质量的以 affordance 为中心的推理数据集 ReasonAff 支持训练,且模型仅通过 RL 训练即可实现零样本泛化与测试时涌现的推理能力,显著提升了开放世界场景下的性能表现。

链接: https://arxiv.org/abs/2508.06206
作者: Hanqing Wang,Shaoyang Wang,Yiming Zhong,Zemin Yang,Jiamin Wang,Zhiqing Cui,Jiahao Yuan,Yifan Han,Mingyu Liu,Yuexin Ma
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on this https URL.
zh

[CV-31] PA-HOI: A Physics-Aware Human and Object Interaction Dataset

【速读】:该论文旨在解决现有人-物体交互(Human-Object Interaction, HOI)数据集在建模人类长期运动时忽视物体物理属性影响的问题,尤其是物体的尺寸、形状和重量等特性如何动态改变人类的姿态、移动速度及交互策略。解决方案的关键在于构建一个名为PA-HOI Motion Capture的新数据集,其中包含562个由不同性别受试者与35种具有多样化物理属性的3D物体交互的运动序列,系统性地捕捉了物体物理属性对人类运动动力学的影响,从而为生成式AI(Generative AI)等方法提供更真实、可迁移的物理感知能力支持。

链接: https://arxiv.org/abs/2508.06205
作者: Ruiyan Wang,Lin Zuo,Zonghao Lin,Qiang Wang,Zhengxue Cheng,Rong Xie,Jun Ling,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); VisionStar Information Technology Co., Ltd. (视觉星信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects’ physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.
zh

[CV-32] AnomalyMoE: Towards a Language-free Generalist Model for Unified Visual Anomaly Detection

【速读】:该论文旨在解决现有异常检测方法普遍存在的泛化能力不足问题,即当前模型多针对特定类型的异常(如纹理缺陷或逻辑错误)进行设计,导致在跨领域或跨模态场景下性能显著下降。其解决方案的关键在于提出AnomalyMoE框架,该框架基于Mixture-of-Experts(MoE)架构,将复杂的异常检测任务分解为三个语义层级:局部结构异常、组件级语义异常和全局逻辑异常,并分别配置对应的专家网络进行特征重建与偏差识别。通过这种分层建模机制,单一模型能够同时理解并检测多种类型的异常;此外,引入Expert Information Repulsion(EIR)模块以增强专家多样性,以及Expert Selection Balancing(ESB)模块以确保所有专家被均衡利用,从而提升整体检测性能与鲁棒性。

链接: https://arxiv.org/abs/2508.06203
作者: Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Wei Ge,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Objecteye Inc. (北京物眼科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.
zh

[CV-33] LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在持续视觉指令微调(Continual Visual Instruction Tuning, CVIT)过程中因灾难性遗忘(catastrophic forgetting)导致的旧任务性能下降问题。现有架构扩展方法通常为每个新任务引入完整的层,造成参数冗余和可扩展性差。其解决方案的关键在于提出一种名为“LoRA in LoRA”(LiLoRA)的高效架构扩展机制:通过共享LoRA矩阵A降低冗余、对矩阵B进行额外低秩分解以最小化任务专属参数量,并引入余弦正则化稳定性损失(cosine-regularized stability loss)来保持共享表示的一致性,从而在保证序列任务学习性能的同时显著提升参数效率。

链接: https://arxiv.org/abs/2508.06202
作者: Chang Che,Ziqi Wang,Pengwan Yang,Qi Wang,Hui Ma,Zenglin Shi
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.
zh

[CV-34] A Semantic Segmentation Algorithm for Pleural Effusion Based on DBIF-AUNet

【速读】:该论文旨在解决胸腔积液(pleural effusion)CT图像语义分割中的难题,包括积液与周围组织灰度相似、边界模糊及形态多变等问题,这些问题导致现有方法在应对图像多样性与复杂边缘时性能受限。解决方案的关键在于提出双分支交互融合注意力网络(DBIF-AUNet),其核心创新为两个模块:一是双域特征解耦模块(Dual-Domain Feature Disentanglement, DDFD),通过正交解耦实现多尺度特征互补并增强不同层级的表征能力;二是分支交互注意力融合模块(Branch Interaction Attention Fusion, BIAF),动态加权融合全局、局部与频域特征以提升分割鲁棒性。此外,嵌套深度监督机制结合分层自适应混合损失函数有效缓解类别不平衡问题,最终在1,622张临床CT图像上实现80.1% IoU和89.0% Dice分数,显著优于U-Net++和Swin-UNet等先进模型。

链接: https://arxiv.org/abs/2508.06191
作者: Ruixiang Tang,Jianglong Qin,Mingda Zhang,Yan Song,Yi Wu,Wei Wu
机构: Yunnan University (云南大学); Army Medical University (陆军医科大学); Southwest Hospital (西南医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual-Branch Interactive Fusion Attention model (DBIF-AUNet). This model constructs a densely nested skip-connection network and innovatively refines the Dual-Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual-domain modules to achieve multi-scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF-AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state-of-the-art medical image segmentation models U-Net++ and Swin-UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.
zh

[CV-35] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration

【速读】:该论文旨在解决城市公共场景中犯罪行为检测的两大挑战:一是传统基于特征识别的异常检测方法难以从历史信息中捕捉高层行为语义,二是基于大语言模型(Large Language Models, LLMs)的生成式方法往往无法满足实时性要求。其解决方案的关键在于提出一种基于多智能体异步协作的犯罪行为预测框架(Multi-Agent Asynchronous Collaboration-based Criminal Behavior Prediction, MA-CBP),该框架将实时视频流转化为帧级语义描述,构建因果一致的历史摘要,并融合相邻图像帧以实现对长短期上下文的联合推理,从而输出包含事件主体、地点和原因等关键要素的行为决策,实现对潜在犯罪活动的早期预警。

链接: https://arxiv.org/abs/2508.06189
作者: Cheng Liu,Daou Zhang,Tingxu Liu,Yuhan Wang,Jinyang Chen,Yuexuan Li,Xinying Xiao,Chenbo Xin,Ziru Wang,Weichao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.
zh

[CV-36] Graph-based Robot Localization Using a Graph Neural Network with a Floor Camera and a Feature Rich Industrial Floor

【速读】:该论文旨在解决机器人导航中的精确定位问题,传统方法如激光雷达(Lidar)或基于二维码的系统在复杂环境中存在可扩展性和适应性不足的局限。其解决方案的关键在于提出一种基于图结构表示的定位框架,利用图卷积网络(Graph Convolutional Networks, GCNs)对地面特征进行建模,通过构建反映地板纹理和几何信息的图结构来实现高精度定位(误差仅为0.64cm),相比逐帧比较图像特征的方法更为高效;同时,该方法无需复杂滤波即可在每一帧中有效应对“被劫持机器人”(kidnapped robot)问题,从而提升了机器人在多样化环境下的鲁棒性与实用性。

链接: https://arxiv.org/abs/2508.06177
作者: Dominik Brämer,Diana Kleingarn,Oliver Urbann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at 28th RoboCup International Symposium, Salvador, Brasil

点击查看摘要

Abstract:Accurate localization represents a fundamental challenge in robotic navigation. Traditional methodologies, such as Lidar or QR-code based systems, suffer from inherent scalability and adaptability con straints, particularly in complex environments. In this work, we propose an innovative localization framework that harnesses flooring characteris tics by employing graph-based representations and Graph Convolutional Networks (GCNs). Our method uses graphs to represent floor features, which helps localize the robot more accurately (0.64cm error) and more efficiently than comparing individual image features. Additionally, this approach successfully addresses the kidnapped robot problem in every frame without requiring complex filtering processes. These advancements open up new possibilities for robotic navigation in diverse environments. Comments: Accepted at 28th RoboCup International Symposium, Salvador, Brasil Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2508.06177 [cs.CV] (or arXiv:2508.06177v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.06177 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dominik Brämer [view email] [v1] Fri, 8 Aug 2025 09:46:28 UTC (10,699 KB) Full-text links: Access Paper: View a PDF of the paper titled Graph-based Robot Localization Using a Graph Neural Network with a Floor Camera and a Feature Rich Industrial Floor, by Dominik Br"amer and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-08 Change to browse by: cs cs.RO References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-37] Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation

【速读】:该论文旨在解决结肠镜图像中息肉(polyp)自动检测的难题,尤其针对医疗数据集规模有限和标注复杂性高的挑战。其解决方案的关键在于构建一个融合生成式AI与深度学习检测分割算法的多向架构:首先利用增强后的Stable Diffusion模型生成合成数据以缓解数据稀缺问题;随后采用Faster R-CNN进行初始目标定位,再通过Segment Anything Model (SAM)优化分割掩膜;在分割阶段进一步对比五种主流模型(U-Net、PSPNet、FPN、LinkNet、MANet),最终确定FPN在PSNR(7.205893)和SSIM(0.492381)指标上表现最优,而U-Net在召回率(84.85%)上领先,LinkNet则在IoU(64.20%)和Dice分数(77.53%)上达到平衡性能,从而实现高精度的息肉检测与分割。

链接: https://arxiv.org/abs/2508.06170
作者: Ojonugwa Oluwafemi Ejiga Peter,Akingbola Oluwapemiisin,Amalahu Chetachi,Adeniran Opeyemi,Fahmi Khalifa,Md Mahmudur Rahman
机构: Morgan State University (摩根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).
zh

[CV-38] UW-3DGS: Underwater 3D Reconstruction with Physics-Aware Gaussian Splatting

【速读】:该论文旨在解决水下三维场景重建中因光吸收、散射和浑浊度导致的几何与色彩保真度下降问题,传统方法如NeRF(神经辐射场)在复杂水下环境中表现受限。其核心解决方案是提出UW-3DGS框架,关键创新在于:(1) 引入一个即插即用的可学习水下图像形成模块,通过体素级回归建模空间变化的衰减与后向散射;(2) 设计物理感知不确定性剪枝(PAUP)分支,基于不确定性评分自适应剔除噪声浮动高斯点,从而保障几何无伪影。该方法在训练阶段联合优化高斯点与水下参数,并通过PAUP引导去噪,在渲染阶段生成无介质影响的未衰减辐射图像(URI)及具真实光照传输特性的水下图像(UWI),显著提升重建质量并减少约65%的浮点伪影。

链接: https://arxiv.org/abs/2508.06169
作者: Wenpeng Xing,Jie Chen,Zaifeng Yang,Changting Lin,Jianfeng Dong,Chaochao Chen,Xun Zhou,Meng Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underwater 3D scene reconstruction faces severe challenges from light absorption, scattering, and turbidity, which degrade geometry and color fidelity in traditional methods like Neural Radiance Fields (NeRF). While NeRF extensions such as SeaThru-NeRF incorporate physics-based models, their MLP reliance limits efficiency and spatial resolution in hazy environments. We introduce UW-3DGS, a novel framework adapting 3D Gaussian Splatting (3DGS) for robust underwater reconstruction. Key innovations include: (1) a plug-and-play learnable underwater image formation module using voxel-based regression for spatially varying attenuation and backscatter; and (2) a Physics-Aware Uncertainty Pruning (PAUP) branch that adaptively removes noisy floating Gaussians via uncertainty scoring, ensuring artifact-free geometry. The pipeline operates in training and rendering stages. During training, noisy Gaussians are optimized end-to-end with underwater parameters, guided by PAUP pruning and scattering modeling. In rendering, refined Gaussians produce clean Unattenuated Radiance Images (URIs) free from media effects, while learned physics enable realistic Underwater Images (UWIs) with accurate light transport. Experiments on SeaThru-NeRF and UWBundle datasets show superior performance, achieving PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% reduction in floating artifacts.
zh

[CV-39] Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment ICCV2025

【速读】:该论文旨在解决预训练扩散模型在资源受限平台上的高效部署问题,核心挑战在于如何在不进行微调的前提下实现计算效率与生成质量之间的最优平衡。其解决方案的关键在于提出了一种无需训练的加速框架PostDiff,通过两个层面减少冗余:一是输入层面采用混合分辨率去噪策略,利用早期去噪步骤降低生成分辨率以增强低频成分,从而提升最终生成质量;二是模块层面实施混合模块缓存策略,复用跨去噪步骤的计算结果,降低每步推理成本。实验表明,在保持良好生成保真度的同时,降低每步推理开销比减少去噪步数更有效。

链接: https://arxiv.org/abs/2508.06160
作者: Zhenbang Du,Yonggan Fu,Lifu Wang,Jiayi Qian,Xiao Luo,Yingyan(Celine)Lin
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at this https URL.
zh

[CV-40] An Interpretable Multi-Plane Fusion Framework With Kolmogorov-Arnold Network Guided Attention Enhancement for Alzheimers Disease Diagnosis

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断中因脑部结构变化复杂且细微而导致的准确性不足问题,尤其针对现有深度学习方法仅依赖单一平面结构磁共振成像(structural magnetic resonance imaging, sMRI)难以捕捉病理性区域间非线性关系的局限性。解决方案的关键在于提出一种名为MPF-KANSC的创新框架:其一,通过多平面融合(multi-plane fusion, MPF)机制整合冠状面、矢状面和轴面sMRI特征,实现多角度结构信息的并行提取与互补;其二,引入基于Kolmogorov-Arnold Network(KAN)的空间-通道注意力机制(KANSC),利用其更强的非线性函数逼近能力,精准识别和定位与疾病相关的萎缩特征,从而提升诊断性能并增强模型可解释性。

链接: https://arxiv.org/abs/2508.06157
作者: Xiaoxiao Yang,Meiliang Liu,Yunfang Xu,Zijin Li,Zhengye Si,Xinyue Yang,Zhiwen Zhao
机构: Beijing Normal University (北京师范大学); Advanced Institute of Natural Sciences, Beijing Normal University (北京师范大学自然科学高等研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that severely impairs cognitive function and quality of life. Timely intervention in AD relies heavily on early and precise diagnosis, which remains challenging due to the complex and subtle structural changes in the brain. Most existing deep learning methods focus only on a single plane of structural magnetic resonance imaging (sMRI) and struggle to accurately capture the complex and nonlinear relationships among pathological regions of the brain, thus limiting their ability to precisely identify atrophic features. To overcome these limitations, we propose an innovative framework, MPF-KANSC, which integrates multi-plane fusion (MPF) for combining features from the coronal, sagittal, and axial planes, and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to more effectively learn and represent sMRI atrophy features. Specifically, the proposed model enables parallel feature extraction from multiple anatomical planes, thus capturing more comprehensive structural information. The KANSC attention mechanism further leverages a more flexible and accurate nonlinear function approximation technique, facilitating precise identification and localization of disease-related abnormalities. Experiments on the ADNI dataset confirm that the proposed MPF-KANSC achieves superior performance in AD diagnosis. Moreover, our findings provide new evidence of right-lateralized asymmetry in subcortical structural changes during AD progression, highlighting the model’s promising interpretability.
zh

[CV-41] VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation

【速读】:该论文旨在解决现有文本到图像(Text-to-Image, T2I)生成模型评估指标在多维性、用户导向性和抽象语义感知方面的局限性。传统指标往往仅关注可量化属性(如文本渲染或光照),难以有效衡量风格融合、文化契合度等抽象语义维度,且缺乏对不同用户角色需求的区分能力。其解决方案的关键在于提出VISTAR基准,采用两层混合范式:第一层为确定性、可脚本化的指标,用于物理可量化的属性评估;第二层引入新颖的分层加权正负样本提问(Hierarchical Weighted P/N Questioning, HWPQ)机制,利用约束型视觉语言模型对抽象语义进行精准评估,该机制在专家德尔菲研究基础上定义了七类用户角色和九个评估角度,并通过15,000次人工成对比较验证了2,845个提示的合理性,最终实现75%的人类一致性,其中HWPQ在抽象语义上达到85.9%准确率,显著优于传统VQA基线。

链接: https://arxiv.org/abs/2508.06152
作者: Kaiyuan Jiang,Ruoxi Sun,Ying Cao,Yuqi Xu,Xinran Zhang,Junyan Guo,ChengSheng Deng
机构: Peking University (北京大学); LinkSure; University of Glasgow (格拉斯哥大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages,8 figures

点击查看摘要

Abstract:We present VISTAR, a user-centric, multi-dimensional benchmark for text-to-image (T2I) evaluation that addresses the limitations of existing metrics. VISTAR introduces a two-tier hybrid paradigm: it employs deterministic, scriptable metrics for physically quantifiable attributes (e.g., text rendering, lighting) and a novel Hierarchical Weighted P/N Questioning (HWPQ) scheme that uses constrained vision-language models to assess abstract semantics (e.g., style fusion, cultural fidelity). Grounded in a Delphi study with 120 experts, we defined seven user roles and nine evaluation angles to construct the benchmark, which comprises 2,845 prompts validated by over 15,000 human pairwise comparisons. Our metrics achieve high human alignment (75%), with the HWPQ scheme reaching 85.9% accuracy on abstract semantics, significantly outperforming VQA baselines. Comprehensive evaluation of state-of-the-art models reveals no universal champion, as role-weighted scores reorder rankings and provide actionable guidance for domain-specific deployment. All resources are publicly released to foster reproducible T2I assessment.
zh

[CV-42] Improving Diagnostic Accuracy for Oral Cancer with inpainting Synthesis Lesions Generated Using Diffusion Models

【速读】:该论文旨在解决口腔癌诊断中因标注数据稀缺而导致的诊断模型性能受限问题,尤其是训练数据的变异性与不足对模型泛化能力的影响。其解决方案的关键在于利用微调后的扩散模型(diffusion model)结合图像修复(inpainting)技术,生成具有高度视觉保真度的合成口腔癌病灶图像,从而有效扩充训练数据集,提升诊断算法的准确性和定位能力。实验表明,该方法使分类模型的诊断准确率达到0.97,检测模型对病灶位置的识别准确率为0.85,验证了合成图像在医学诊断中的可行性与潜力。

链接: https://arxiv.org/abs/2508.06151
作者: Yong Oh Lee,JeeEun Kim,Jung Woo Lee
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In oral cancer diagnostics, the limited availability of annotated datasets frequently constrains the performance of diagnostic models, particularly due to the variability and insufficiency of training data. To address these challenges, this study proposed a novel approach to enhance diagnostic accuracy by synthesizing realistic oral cancer lesions using an inpainting technique with a fine-tuned diffusion model. We compiled a comprehensive dataset from multiple sources, featuring a variety of oral cancer images. Our method generated synthetic lesions that exhibit a high degree of visual fidelity to actual lesions, thereby significantly enhancing the performance of diagnostic algorithms. The results show that our classification model achieved a diagnostic accuracy of 0.97 in differentiating between cancerous and non-cancerous tissues, while our detection model accurately identified lesion locations with 0.85 accuracy. This method validates the potential for synthetic image generation in medical diagnostics and paves the way for further research into extending these methods to other types of cancer diagnostics.
zh

[CV-43] DSConv: Dynamic Splitting Convolution for Pansharpening

【速读】:该论文旨在解决高分辨率图像生成中的全色锐化(pansharpening)问题,即如何有效融合多光谱图像(MS)与全色图像(PAN),以提升空间细节保留和光谱保真度。现有方法主要依赖标准卷积操作,难以充分建模遥感图像中像素间的复杂相关性。解决方案的关键在于提出一种动态分割卷积(DSConv)机制,该机制通过注意力机制选择感兴趣区域,并将原始卷积核动态拆分为多个小卷积核,从而更精细地提取感受野内不同位置的特征,显著增强网络的泛化能力、优化效率和特征表达能力。

链接: https://arxiv.org/abs/2508.06147
作者: Xuanyu Liu,Bonan An
机构: State Key Laboratory of Photonics and Communications, School of Electronics, Peking University (北京大学电子学院光子学与通信国家重点实验室); National Mobile Communications Research Laboratory, Southeast University (东南大学移动通信国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aiming to obtain a high-resolution image, pansharpening involves the fusion of a multi-spectral image (MS) and a panchromatic image (PAN), the low-level vision task remaining significant and challenging in contemporary research. Most existing approaches rely predominantly on standard convolutions, few making the effort to adaptive convolutions, which are effective owing to the inter-pixel correlations of remote sensing images. In this paper, we propose a novel strategy for dynamically splitting convolution kernels in conjunction with attention, selecting positions of interest, and splitting the original convolution kernel into multiple smaller kernels, named DSConv. The proposed DSConv more effectively extracts features of different positions within the receptive field, enhancing the network’s generalization, optimization, and feature representation capabilities. Furthermore, we innovate and enrich concepts of dynamic splitting convolution and provide a novel network architecture for pansharpening capable of achieving the tasks more efficiently, building upon this methodology. Adequate fair experiments illustrate the effectiveness and the state-of-the-art performance attained by this http URL and rigorous discussions proved the superiority and optimal usage conditions of DSConv.
zh

[CV-44] xt-guided Visual Prompt DINO for Generic Segmentation

【速读】:该论文旨在解决多模态视觉模型在开放世界分割任务中面临的三大问题:晚期特征融合导致的跨模态信息交互不足、基于DETR架构的查询选择机制缺乏文本与视觉查询间的结构对齐,以及依赖图像描述词典带来的词汇覆盖受限问题。其解决方案的关键在于提出Prompt-DINO框架,包含三项核心创新:一是引入早期融合机制,在编码阶段统一文本/视觉提示与骨干网络特征,增强跨模态交互以缓解语义歧义;二是设计顺序对齐的查询选择策略,显式优化解码过程中文本与视觉查询的空间-语义一致性;三是构建基于RAP(Recognize Anything via Prompting)模型的生成式数据引擎,通过双路径交叉验证合成0.5B多样化训练样本,将标签噪声降低80.5%,从而突破固定词汇表限制并显著扩展语义覆盖范围。

链接: https://arxiv.org/abs/2508.06146
作者: Yuchen Guan,Chong Sun,Canmiao Fu,Zhipeng Huang,Chun Yuan,Chen Li
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); WeChat AI, Tencent Inc. (微信AI,腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. DataCode are available at this https URL.
zh

[CV-45] SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全评估中面临的两大核心问题:一是现有安全基准数据集易随模型演进而过时,二是存在数据污染(data contamination)风险。为此,作者提出首个动态安全评估框架SDEval,其关键在于引入三种可控的动态策略——文本动态、图像动态以及文本-图像联合动态,通过从原始基准中生成新的样本,实现对安全评估分布和复杂度的灵活调节。实验表明,该方法不仅能有效缓解数据污染问题,还能揭示MLLMs在安全性方面的潜在缺陷,且具有良好的通用性,适用于多种现有的安全与能力基准测试。

链接: https://arxiv.org/abs/2508.06142
作者: Hanqing Wang,Yuan Tian,Mingyu Liu,Zhenhao Zhang,Xiangyang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbfSDEval, the \textitfirst safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at this https URL
zh

[CV-46] DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera

【速读】:该论文旨在解决如何在实时人体动作捕捉中有效融合稀疏惯性测量单元(Inertial Measurement Units, IMUs)与单目相机信号的问题。其关键解决方案是提出一种基于扩散模型(Diffusion Model)的方法,通过分别建模两种模态的特性来实现无缝融合:将连续视觉信息作为一个整体转换为条件嵌入(condition embedding),以增强对遮挡或视角丢失等异常帧的鲁棒性;同时,将IMU数据逐帧拼接至噪声身体姿态上作为序列输入,以充分挖掘时间维度上的运动先验。这种设计实现了多模态信息的高效协同,显著提升了姿态估计的精度和稳定性。

链接: https://arxiv.org/abs/2508.06139
作者: Shaohua Pan,Xinyu Yi,Yan Zhou,Weihua Jian,Yuan Zhang,Pengfei Wan,Feng Xu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at this https URL.
zh

[CV-47] Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation

【速读】:该论文旨在解决现有注视重定向(gaze redirection)方法在生成高质量、真实感图像时存在的局限性,尤其是基于神经辐射场(NeRF)的方法无法显式建模3D表示的旋转与平移问题。其解决方案的关键在于引入一个显式的3D眼球结构,并采用3D高斯泼溅(3DGS)进行表示,从而能够精确控制眼球的旋转和平移以生成符合目标注视方向的图像;同时提出自适应形变模块,模拟眼周细微肌肉运动,进一步提升图像的真实感和细节表现力。

链接: https://arxiv.org/abs/2508.06136
作者: YoungChan Choi,HengFei Wang,YiHua Cheng,Boeun Kim,Hyung Jin Chang,YoungGeun Choi,Sang-Il Choi
机构: Dankook University ( Dankook 大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, ACM Multimeida 2025 accepted

点击查看摘要

Abstract:We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.
zh

[CV-48] SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures ICCV2025

【速读】:该论文旨在解决Segment Anything Model (SAM) 在下游任务中因固有脆弱性导致的可迁移性攻击问题,即现有对抗攻击方法对SAM生成的对抗样本在不同域和模型间转移能力有限,难以全面评估其潜在风险。解决方案的关键在于提出Vertex-Refining Simplicial Complex Attack (VeSCA),该方法仅利用SAM的编码器,通过参数化单纯复形(parametric simplicial complex)显式刻画SAM与下游模型之间的共享脆弱区域,并采用逐顶点迭代精炼策略识别高破坏力区域;同时引入轻量级域再适应机制,以最小参考数据缓解域差异,最终通过随机单纯复形采样生成高度可迁移的对抗样本,实验证明其在三大类下游模型、五个领域数据集上相较最先进方法提升12.7%的攻击成功率。

链接: https://arxiv.org/abs/2508.06127
作者: Yi Qin,Rui Wang,Tao Huang,Tong Xiao,Liping Jing
机构: Beijing Jiaotong University (北京交通大学); State Key Laboratory of Advanced Rail Autonomous Operation (国家铁路交通自动化运行重点实验室); Beijing Key Lab of Traffic Data Mining and Embodied Intelligence (北京市交通数据挖掘与具身智能重点实验室); Collaborative Innovation Center of Railway Traffic Safety (铁路交通安全协同创新中心); National Engineering Research Center of Rail Transportation Operation and Control System (国家铁路交通运营与控制系统工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,recived by ICCV2025

点击查看摘要

Abstract:While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of numerous downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose Vertex-Refining Simplicial Complex Attack (VeSCA), a novel method that leverages only the encoder of SAM for generating transferable adversarial examples. Specifically, it achieves this by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement. A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data during the initialization of simplicial complex. Ultimately, VeSCA generates consistently transferable adversarial examples through random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM’s vulnerabilities and emphasize the urgency of developing more robust foundation models.
zh

[CV-49] SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning ICCV2025

【速读】:该论文旨在解决图像描述生成模型在生成过程中缺乏自我修正能力的问题,从而提升生成 caption 的准确性与一致性。其解决方案的关键在于提出一种基于强化学习的框架 SC-Captioner,核心创新在于设计了一种精细化的奖励函数:通过场景图解析算法将初始预测和参考 caption 分解为对象(object)、属性(attribute)和关系(relation)集合,并计算自修正前后集合的差集,进而匹配参考集合以获得准确修正的奖励(correctness bonus)和错误修正的惩罚(mistake punishment),最终形成完整的奖励信号用于优化模型。

链接: https://arxiv.org/abs/2508.06125
作者: Lin Zhang,Xianfang Zeng,Kangcong Li,Gang Yu,Tao Chen
机构: Fudan University (复旦大学); StepFun; Shanghai Innovation Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
zh

[CV-50] Learning Representations of Satellite Images with Evaluations on Synoptic Weather Events

【速读】:该论文旨在解决如何通过表示学习算法从卫星图像中提取有效特征以提升天气事件分类性能的问题。其核心解决方案在于比较三种代表性学习方法——主成分分析(PCA)、卷积自编码器(CAE)和预训练残差网络(PT)在不同天气事件分类任务中的表现,发现CAE在所有分类任务中均表现出最高的威胁评分(threat score),且高分辨率数据训练的深度学习模型(CAE与PT)整体优于低分辨率模型;同时指出较小的潜在空间维度(<128)虽对命中率影响不大,但显著增加误报率,表明潜在空间维度需合理设计。研究还强调当前CAE缺乏物理可解释性,未来引入物理约束的“物理信息增强型CAE”是重要发展方向。

链接: https://arxiv.org/abs/2508.06122
作者: Ting-Shuo Yo,Shih-Hao Su,Chien-Ming Wu,Wei-Ting Chen,Jung-Lien Chu,Chiao-Wei Chang,Hung-Chi Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 37 pages, 6 figures, 3 tables

点击查看摘要

Abstract:This study applied representation learning algorithms to satellite images and evaluated the learned latent spaces with classifications of various weather events. The algorithms investigated include the classical linear transformation, i.e., principal component analysis (PCA), state-of-the-art deep learning method, i.e., convolutional autoencoder (CAE), and a residual network pre-trained with large image datasets (PT). The experiment results indicated that the latent space learned by CAE consistently showed higher threat scores for all classification tasks. The classifications with PCA yielded high hit rates but also high false-alarm rates. In addition, the PT performed exceptionally well at recognizing tropical cyclones but was inferior in other tasks. Further experiments suggested that representations learned from higher-resolution datasets are superior in all classification tasks for deep-learning algorithms, i.e., CAE and PT. We also found that smaller latent space sizes had minor impact on the classification task’s hit rate. Still, a latent space dimension smaller than 128 caused a significantly higher false alarm rate. Though the CAE can learn latent spaces effectively and efficiently, the interpretation of the learned representation lacks direct connections to physical attributions. Therefore, developing a physics-informed version of CAE can be a promising outlook for the current work.
zh

[CV-51] SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇场景下语义分割(semantic segmentation)中存在的挑战,即类别多样性与细粒度差异导致的语义错位(semantic misalignment)和性能下降问题。现有弱监督方法通常依赖类别特异性监督信号,并采用不适用于对比学习(contrastive learning)的特征构建方式,难以有效建模跨类别的语义关联。其解决方案的关键在于提出一种新颖的弱监督框架SynSeg,核心创新包括:(1)多类别对比学习(Multi-Category Contrastive Learning, MCCL),通过联合优化类内对齐与类间分离,增强模型对同一图像中不同类别间相关性的理解;(2)特征协同结构(Feature Synergy Structure, FSS),通过先验融合与语义激活图增强重建判别性特征,缓解视觉编码器引入的前景偏置(foreground bias)。这两个机制共同提升了模型在弱监督条件下的语义定位与区分能力。

链接: https://arxiv.org/abs/2508.06115
作者: Weichen Zhang,Kebin Liu,Fan Dang,Zhui Zhu,Xikai Sun,Yunhao Liu
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. For instance, SynSeg achieves higher accuracy than SOTA baselines by 4.5% on VOC, 8.9% on Context, 2.6% on Object and 2.0% on City.
zh

[CV-52] GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving

【速读】:该论文旨在解决当前基于扩散模型(Diffusion-based models)的端到端自动驾驶系统中,因依赖基于Transformer的融合架构所导致的性能瓶颈问题。具体而言,现有方法面临两个核心挑战:一是Transformer带来的二次计算复杂度限制了高分辨率特征的使用;二是缺乏空间先验信息,难以有效建模鸟瞰图(Bird’s Eye View, BEV)表示中的固有结构。解决方案的关键在于两项创新:其一,用几何增强的柱状体(pillar)格式替代信息受限的直方图式激光雷达(LiDAR)表征,以保留关键的3D几何细节;其二,提出一种分层门控Mamba融合(GM-Fusion)架构,采用线性复杂度的空间感知状态空间模型(State-Space Model, SSM)替代昂贵的Transformer,通过方向序列化和自适应融合机制捕捉长程依赖关系,同时显式尊重驾驶场景的空间特性。实验证明,该框架在NAVSIM基准上达到新的最先进性能,并显著优于DiffusionDrive。

链接: https://arxiv.org/abs/2508.06113
作者: Jian Wang,Chaokang Jiang,Haitao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird’s Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.
zh

[CV-53] FMCE-Net: Feature Map Convergence Evaluation and Training

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)因内部表征不透明而导致的可解释性难题,尤其是现有特征图收敛评估(Feature Map Convergence Evaluation, FMCE)方法缺乏实验验证和闭环集成的问题。解决方案的关键在于提出FMCE-Net++训练框架,其核心创新是将预训练冻结的FMCE模块作为辅助头(auxiliary head),生成特征图收敛分数(Feature Map Convergence Scores, FMCS),并结合任务标签通过Representation Auxiliary Loss(RAL)联合监督主干网络优化;其中,RAL引入可调的Representation Abstraction Factor动态平衡主分类损失与特征收敛优化,从而在不修改网络结构或增加数据的前提下,显著提升模型性能。

链接: https://arxiv.org/abs/2508.06109
作者: Zhibo Zhu,Renyu Huang,Lei He
机构: Tsinghua University (清华大学); Minzu University of China (中央民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) face interpretability challenges due to their opaque internal representations. While Feature Map Convergence Evaluation (FMCE) quantifies module-level convergence via Feature Map Convergence Scores (FMCS), it lacks experimental validation and closed-loop integration. To address this limitation, we propose FMCE-Net++, a novel training framework that integrates a pretrained, frozen FMCE-Net as an auxiliary head. This module generates FMCS predictions, which, combined with task labels, jointly supervise backbone optimization through a Representation Auxiliary Loss. The RAL dynamically balances the primary classification loss and feature convergence optimization via a tunable \Representation Abstraction Factor. Extensive experiments conducted on MNIST, CIFAR-10, FashionMNIST, and CIFAR-100 demonstrate that FMCE-Net++ consistently enhances model performance without architectural modifications or additional data. Key experimental outcomes include accuracy gains of +1.16 pp (ResNet-50/CIFAR-10) and +1.08 pp (ShuffleNet v2/CIFAR-100), validating that FMCE-Net++ can effectively elevate state-of-the-art performance ceilings.
zh

[CV-54] Mask Match: Learning to Recognize Handwritten Math with Self-Supervised Attention

【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)中因二维结构复杂、符号尺度变化多样及符号间空间关系错综而带来的挑战,尤其在标注数据昂贵且稀缺的情况下。其解决方案的关键在于提出一种自监督学习(Self-Supervised Learning, SSL)框架,核心创新包括:首先通过全局与局部对比损失联合预训练图像编码器,以学习整体与细粒度的表征;其次设计了一种新颖的自监督注意力网络,利用渐进式空间掩码策略训练模型自动聚焦于语义有意义的区域(如运算符、指数和嵌套结构),无需任何标注信息;最后结合Transformer解码器进行监督微调,生成LATEX序列。该方法显著提升了HMER性能,在CROHME基准上优于现有自监督与全监督基线,验证了渐进式注意力机制对结构理解能力的有效增强。

链接: https://arxiv.org/abs/2508.06107
作者: Shree Mitra,Ritabrata Chakraborty,Nilkanta Sahu
机构: International Institute of Information Technology Hyderabad(印度信息科技国际学院海得拉巴分校); Manipal University Jaipur(曼尼帕尔大学贾伊普尔分校); Indian Institute of Information Technology Guwahati(印度信息科技国际学院古瓦哈提分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
zh

[CV-55] MCA: 2D-3D Retrieval with Noisy Labels via Multi-level Adaptive Correction and Alignment ICME

【速读】:该论文旨在解决2D-3D跨模态检索中因标签噪声(noisy labels)导致的性能下降问题,尤其针对现有方法在单模态内独立处理噪声样本时易过拟合于错误标签的缺陷。其解决方案的关键在于提出一种多层级跨模态自适应校正与对齐框架(Multi-level cross-modal adaptive Correction and Alignment, MCA),其中包含两个核心机制:一是多模态联合标签校正(Multimodal Joint label Correction, MJC),通过利用双模态历史自预测信息联合建模模态间预测一致性以实现可靠标签精炼;二是多层级自适应对齐(Multi-level Adaptive Alignment, MAA),在不同抽象层级上增强跨模态特征语义一致性与判别力,从而提升模型在真实噪声环境下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2508.06104
作者: Gui Zou,Chaofan Gan,Chern Hong Lim,Supavadee Aramvith,Weiyao Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICMEW 2025

点击查看摘要

Abstract:With the increasing availability of 2D and 3D data, significant advancements have been made in the field of cross-modal retrieval. Nevertheless, the existence of imperfect annotations presents considerable challenges, demanding robust solutions for 2D-3D cross-modal retrieval in the presence of noisy label conditions. Existing methods generally address the issue of noise by dividing samples independently within each modality, making them susceptible to overfitting on corrupted labels. To address these issues, we propose a robust 2D-3D \textbfMulti-level cross-modal adaptive \textbfCorrection and \textbfAlignment framework (MCA). Specifically, we introduce a Multimodal Joint label Correction (MJC) mechanism that leverages multimodal historical self-predictions to jointly model the modality prediction consistency, enabling reliable label refinement. Additionally, we propose a Multi-level Adaptive Alignment (MAA) strategy to effectively enhance cross-modal feature semantics and discrimination across different levels. Extensive experiments demonstrate the superiority of our method, MCA, which achieves state-of-the-art performance on both conventional and realistic noisy 3D benchmarks, highlighting its generality and effectiveness.
zh

[CV-56] UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization

【速读】:该论文旨在解决图像伪造检测与定位(Image Manipulation Localization, IML)任务中对大规模高质量标注数据依赖性强、现有约束式图像篡改定位(Constrained IML, CIML)方法流程复杂且效率低的问题。其关键解决方案是提出一种基于扩散模型的生成式框架UGD-IML,首次在单一架构中统一了IML与CIML任务;通过学习数据分布减少对标注数据的依赖,并利用类别嵌入机制和参数共享设计实现两种模式间的无缝切换,同时采用端到端结构避免繁琐的数据标注步骤,从而显著提升性能与实用性。

链接: https://arxiv.org/abs/2508.06101
作者: Yachun Mi,Xingyang He,Shixin Sun,Yu Li,Yanting Li,Zhixuan Li,Jian Jin,Chen Hui,Shaohui Liu
机构: 1: Tsinghua University (清华大学); 2: Peking University (北京大学); 3: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.
zh

[CV-57] E-React: Towards Emotionally Controlled Synthesis of Human Reactions

【速读】:该论文旨在解决现有生成式AI(Generative AI)运动生成框架中忽视情绪影响的问题,从而限制了其在交互任务(如人类反应合成)中的自然性和应用性。核心挑战在于如何从有限的运动数据中学习情绪表征,并将其有效整合进运动生成模型中。解决方案的关键在于提出一种基于演员-反应者扩散模型(actor-reactor diffusion model)的半监督情绪先验机制:首先利用短序列内动作片段通常共享相同情绪的观察,设计半监督学习框架训练情绪先验;随后将该先验嵌入扩散模型中,使模型在生成反应动作时同时考虑空间交互关系与情绪响应,从而实现多样化、情绪驱动的反应动作合成。

链接: https://arxiv.org/abs/2508.06093
作者: Chen Zhu,Buzhen Huang,Zijing Wu,Binghui Zuo,Yangang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotion serves as an essential component in daily human interactions. Existing human motion generation frameworks do not consider the impact of emotions, which reduces naturalness and limits their application in interactive tasks, such as human reaction synthesis. In this work, we introduce a novel task: generating diverse reaction motions in response to different emotional cues. However, learning emotion representation from limited motion data and incorporating it into a motion generation framework remains a challenging problem. To address the above obstacles, we introduce a semi-supervised emotion prior in an actor-reactor diffusion model to facilitate emotion-driven reaction synthesis. Specifically, based on the observation that motion clips within a short sequence tend to share the same emotion, we first devise a semi-supervised learning framework to train an emotion prior. With this prior, we further train an actor-reactor diffusion model to generate reactions by considering both spatial interaction and emotional response. Finally, given a motion sequence of an actor, our approach can generate realistic reactions under various emotional conditions. Experimental results demonstrate that our model outperforms existing reaction generation methods. The code and data will be made publicly available at this https URL
zh

[CV-58] Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

【速读】:该论文旨在解决视频质量评估(Video Quality Assessment, VQA)中现有方法依赖大规模分类数据集预训练所带来的两个核心问题:一是仅迁移语义知识不足以全面捕捉视频质量的多维特征(如失真、运动、美学等),二是预训练阶段计算资源消耗巨大,远超直接在VQA数据集上训练的成本。解决方案的关键在于提出Q-CLIP,一个完全基于视觉-语言模型(Vision-Language Models, VLMs)的框架,其核心创新包括:通过共享跨模态适配器(Shared Cross-Modal Adapter, SCMA)增强视觉与文本表征,且仅需极少可训练参数以显著降低计算开销;引入五个可学习的质量等级提示(quality-level prompts)引导模型感知细微的质量差异,提升对视频质量变化的敏感性;并发现基于帧差的采样策略能带来更优的跨数据集泛化性能。

链接: https://arxiv.org/abs/2508.06092
作者: Yachun Mi,Yu Li,Yanting Li,Shixin Sun,Chen Hui,Tong Zhang,Yuanyuan Liu,Chenyue Song,Shaohui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model’s sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
zh

[CV-59] AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理阶段因处理大量视觉标记(vision tokens)而导致的高计算开销问题,尤其是在预填充(prefill)阶段。现有剪枝方法通常依赖于静态文本提示或注意力模式,未能充分利用推理过程中产生的动态内部信号。解决方案的关键在于提出一种名为AdaptInfer的即插即用框架,其核心创新包括:1)引入细粒度、动态文本引导的剪枝机制,通过复用层间文本到文本注意力图构建软先验,从而更精准地评估每个阶段视觉标记的重要性;2)通过对跨模态注意力转移的离线分析,识别出一致的推理拐点,据此设计更合理高效的剪枝调度策略。该方法在保持高精度的同时显著降低延迟,且适用于多种多模态任务。

链接: https://arxiv.org/abs/2508.06084
作者: Weichen Zhang,Zhui Zhu,Ningbo Li,Kebin Liu,Yunhao Liu
机构: 1: University of Science and Technology of China (中国科学技术大学); 2: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering (VQA), but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 92.9% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.
zh

[CV-60] SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment

【速读】:该论文旨在解决基于扩散或流模型的视频生成方法在推理阶段需要大量迭代步骤所带来的计算开销问题,同时克服现有蒸馏方法在少步数设置下性能下降或引入伪影的局限性。其解决方案的关键在于提出了一种统一且稳定的蒸馏框架 SwiftVideo,通过引入连续时间一致性蒸馏以精确保持常微分方程(ODE)轨迹,并设计双视角对齐机制,即在合成数据与真实数据之间进行分布对齐,以及在不同推理步长间进行轨迹对齐,从而在显著减少推理步数的同时维持高质量视频生成效果。

链接: https://arxiv.org/abs/2508.06082
作者: Yanxiao Sun,Jiafu Wu,Yun Cao,Chengming Xu,Yabiao Wang,Weijian Cao,Donghao Luo,Chengjie Wang,Yanwei Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf\emphSwiftVideo, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.
zh

[CV-61] DreamVE: Unified Instruction-based Image and Video Editing

【速读】:该论文旨在解决指令驱动的图像与视频编辑(instruction-based image and video editing)因训练数据稀缺而导致性能受限的问题,尤其在视频编辑场景中更为显著。其核心解决方案是提出一个统一模型 DreamVE,并采用两阶段训练策略:首先在大规模拼贴式合成数据(collage-based data synthesis)上预训练图像编辑能力,以获取高效且具泛化性的先验知识;随后在生成式模型合成数据(generative model-based data synthesis)基础上微调,弥补拼贴数据在属性编辑(attribute editing)方面的不足。关键创新在于结合两种数据合成方法的优势,实现高质量、多样化编辑对的高效生成,并设计基于 token 拼接与早期丢弃(token concatenation with early drop)的编辑框架,从而在保持强一致性的同时提升编辑可控性与效果。

链接: https://arxiv.org/abs/2508.06080
作者: Bin Xia,Jiyang Liu,Yuechen Zhang,Bohao Peng,Ruihang Chu,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
机构: CUHK (香港中文大学); ByteDance Inc (字节跳动公司); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.
zh

[CV-62] owards MR-Based Trochleoplasty Planning MICCAI

【速读】:该论文旨在解决膝关节外侧滑车发育不良(Trochlear Dysplasia, TD)手术规划中依赖低分辨率临床磁共振(MR)成像和术者经验导致的疗效不一致问题。现有方法难以实现微创手术的精准规划,且常需结合CT扫描以获取更高精度结构信息,从而增加患者辐射暴露。解决方案的关键在于提出一个端到端的影像处理流程:首先利用隐式神经表示(Implicit Neural Representation, INR)生成各向同性超分辨率MR体积;随后通过定制的多标签分割网络精确识别股骨、胫骨、髌骨及腓骨;最终训练小波扩散模型(Wavelet Diffusion Model, WDM)生成亚毫米级分辨率的个性化“伪健康”滑车区域目标形态。该方法无需CT辅助即可获得高保真3D解剖结构,可作为术前蓝图用于重塑股骨滑车沟并保持髌骨自然关节面匹配,显著改善滑车角(Sulcus Angle, SA)与滑车沟深度(Trochlear Groove Depth, TGD)等关键指标。

链接: https://arxiv.org/abs/2508.06076
作者: Michael Wehrli,Alicia Durrer,Paul Friedrich,Sidaty El Hadramy,Edwin Li,Luana Brahaj,Carol C. Hasler,Philippe C. Cattin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI COLAS Workshop 2025. Code: this https URL

点击查看摘要

Abstract:To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at this https URL.
zh

[CV-63] Can Large Models Fool the Eye? A New Turing Test for Biological Animation

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)评估中缺乏直观、即时且可感知性能差异的问题。现有基准测试或依赖静态数据集的基于真值的评分方法,或采用模糊的人类偏好文本反馈,难以提供清晰的视觉对比。其解决方案的关键在于提出 BioMotion Arena——一种通过视觉动画进行模型评估的新框架,利用生物运动模式(biological motion patterns)的点光源成像特性来放大不同模型之间的性能差异。该方法采用成对比较机制并收集超过 45,000 次人类投票,验证了其在判别性反馈上的有效性,并揭示了多数前沿模型在生成基本人形点光源群体及自然生物运动方面存在显著不足,从而为模型性能可视化提供了一个无需依赖真值的挑战性评估平台。

链接: https://arxiv.org/abs/2508.06072
作者: Zijian Chen,Lirong Deng,Zhengyu Chen,Kaiwei Zhang,Qi Jia,Yuan Tian,Yucheng Zhu,Guangtao Zhai
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.
zh

[CV-64] Distribution-Specific Learning for Joint Salient and Camouflaged Object Detection

【速读】:该论文旨在解决显著性目标检测(Salient Object Detection, SOD)与伪装目标检测(Camouflaged Object Detection, COD)任务在联合学习时因属性冲突导致模型性能下降的问题。传统观点认为,由于SOD关注图像中最具显著性的物体,而COD聚焦于与背景高度融合的伪装物体,二者存在强矛盾特性,联合训练会混淆网络,降低各自性能。本文提出SCJoint联合学习框架,其核心在于:通过在完全共享的网络结构中引入极少量任务特定可学习参数,分别学习SOD和COD解码过程的均值与方差,从而以最小代价解耦两者的矛盾属性;同时设计基于显著性的采样策略(Saliency-Based Sampling Strategy, SBSS),平衡两类任务的训练样本规模并提升数据质量、缩短训练时间。最终构建的JoNet网络实现了对“显著”与“伪装”目标的同时精准识别能力。

链接: https://arxiv.org/abs/2508.06063
作者: Chao Hao,Zitong Yu,Xin Liu,Yuhao Wang,Weicheng Xie,Jingang Shi,Huanjing Yue,Jingyu Yang
机构: Great Bay University (大湾区大学); Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-拉赫蒂理工大学); Shenzhen University (深圳大学); Xi’an Jiaotong University (西安交通大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both salient" and camouflaged". Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at this https URL.
zh

[CV-65] Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention

【速读】:该论文旨在解决基于事件的相机(如HybridEVS)在图像去马赛克(demosaicing)过程中因Quad Bayer Color Filter Array(CFA)与无色彩信息的事件像素结合而产生的混叠(aliasing)和伪影问题,尤其是在资源受限的移动设备上现有方法难以有效处理这一挑战。解决方案的关键在于提出一种轻量级两阶段网络TSANet,其核心创新是通过状态空间增强的交叉注意力机制(State space augmented cross-Attention),将事件像素修复(inpainting)与去马赛克任务分离处理,从而降低复杂度并提升性能;同时引入轻量级Cross-Swin State Block,利用位置先验优化去马赛克,并通过具有线性复杂度的状态空间模型增强全局依赖关系,最终在多个数据集上实现优于当前最优方法DemosaicFormer的PSNR和SSIM指标,且参数量和计算成本分别减少1.86倍和3.29倍。

链接: https://arxiv.org/abs/2508.06058
作者: Shiyang Zhou,Haijin Zeng,Yunfan Lu,Yongyong Chen,Jie Liu,Jingyong Su
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Harvard University (哈佛大学); HKUST(GZ) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous “events” instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbfTSANet, a lightweight \textbfTwo-stage network via \textbfState space augmented cross-\textbfAttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes positional prior for demosaicing and enhances global dependencies through the state space model with linear complexity. In summary, TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than the previous state-of-the-art method DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while respectively reducing parameter and computation costs by 1.86\times and 3.29\times . Our approach presents new possibilities for efficient image demosaicing on mobile devices. Code is available in the supplementary materials.
zh

[CV-66] AGI for the Earth the path possibilities and how to evaluate intelligence of models that work with Earth Observation Data?

【速读】:该论文试图解决当前生成式 AI (Generative AI) 在地球观测(Earth Observation, EO)数据模态上缺乏系统性评估基准的问题,尤其是在衡量基础模型在该领域泛化能力方面的不足。其解决方案的关键在于提出一套全面的任务集,旨在构建一个更综合的基准测试框架,以有效评估模型对地球观测数据的理解与交互能力,从而推动 AGI 在自然世界认知能力上的进步。

链接: https://arxiv.org/abs/2508.06057
作者: Mojtaba Valipour,Kelly Zheng,James Lowman,Spencer Szabados,Mike Gartner,Bobby Braswell
机构: HUM.AI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in IGARSS 2025!

点击查看摘要

Abstract:Artificial General Intelligence (AGI) is closer than ever to becoming a reality, sparking widespread enthusiasm in the research community to collect and work with various modalities, including text, image, video, and audio. Despite recent efforts, satellite spectral imagery, as an additional modality, has yet to receive the attention it deserves. This area presents unique challenges, but also holds great promise in advancing the capabilities of AGI in understanding the natural world. In this paper, we argue why Earth Observation data is useful for an intelligent model, and then we review existing benchmarks and highlight their limitations in evaluating the generalization ability of foundation models in this domain. This paper emphasizes the need for a more comprehensive benchmark to evaluate earth observation models. To facilitate this, we propose a comprehensive set of tasks that a benchmark should encompass to effectively assess a model’s ability to understand and interact with Earth observation data.
zh

[CV-67] LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimers disease the Australian Imaging Biomarkers and Lifestyle flagship study of ageing

【速读】:该论文旨在解决脑室(lateral ventricle, LV)形状分析作为神经疾病生物标志物时面临的两大挑战:个体间LV形状的显著变异以及由于MRI分辨率有限导致的分割困难。其解决方案的关键在于提出LV-Net框架,该框架通过变形一个融合了脑室与海马体(hippocampus)解剖先验信息的联合模板网格来生成个体化的三维LV网格。该方法利用模板中嵌入的解剖关系减少边界分割伪影并提升重建鲁棒性,同时基于顶点的解剖邻接性进行分类以增强跨被试点对应关系,从而提高LV形状统计的准确性。实验表明,LV-Net在存在分割误差的情况下仍能实现更优的重建精度,并在阿尔茨海默病(Alzheimer’s disease)分析中识别出与认知正常对照组显著相关的LV亚区。

链接: https://arxiv.org/abs/2508.06055
作者: Wonjung Park,Suhyun Ahn,Jinah Park(for the Alzheimer’s Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing)
机构: Alzheimer’s Disease Neuroimaging Initiative (阿尔茨海默病神经影像计划); Australian Imaging Biomarkers and Lifestyle flagship study of ageing (澳大利亚成像生物标志物和生活方式旗舰研究项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer’s disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at this https URL.
zh

[CV-68] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

【速读】:该论文旨在解决现有视频质量评估(Video Quality Assessment, VQA)模型在分布外(out-of-distribution, OOD)视频上泛化能力差以及可解释性有限的问题。解决方案的关键在于提出一种基于推理的VQA框架VQAThinker,其核心创新是利用大语言模型(Large Multimodal Models, LMMs)结合强化学习(Reinforcement Learning, RL),通过评分级监督实现对视频质量的理解与打分联合建模,从而模拟人类感知决策过程。具体而言,该方法采用规则引导的群体相对策略优化(Group Relative Policy Optimization, GRPO),并设计三种特定于VQA任务的奖励机制:钟形回归奖励、成对排序奖励和时序一致性奖励,有效提升了模型在跨域场景下的性能与可解释性。

链接: https://arxiv.org/abs/2508.06051
作者: Linhan Cao,Wei Sun,Weixia Zhang,Xiangyang Zhu,Jun Jia,Kaiwei Zhang,Dandan Zhu,Guangtao Zhai,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textitpoor generalization to out-of-distribution (OOD) videos and \textitlimited explainability, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbfVQAThinker, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbfbell-shaped regression reward that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbfpairwise ranking reward that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbftemporal consistency reward that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.
zh

[CV-69] NEP: Autoregressive Image Editing via Next Editing Token Prediction

【速读】:该论文旨在解决文本引导图像编辑(text-guided image editing)中现有方法存在的两个核心问题:一是生成整个目标图像而非仅局部编辑区域,导致计算资源浪费;二是非编辑区域被优先重建,影响编辑质量。解决方案的关键在于提出基于自回归图像生成的“下一编辑标记预测”(Next Editing-token Prediction, NEP)框架,通过预训练一个任意顺序自回归文本到图像(T2I)模型,实现仅对需编辑区域进行重生成,从而避免对非编辑区域的意外修改。该方法支持零样本图像编辑,并可通过测试时缩放(Test-time Scaling, TTS)机制迭代优化生成结果,显著提升编辑精度与效率。

链接: https://arxiv.org/abs/2508.06044
作者: Huimin Wu,Xiaojian Ma,Haozhe Zhao,Yanpeng Zhao,Qing Li
机构: BIGAI; University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is: this https URL

点击查看摘要

Abstract:Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: this https URL
zh

[CV-70] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中因图像编码器输出的视觉特征 token 数量庞大而导致的上下文长度过长问题,从而引发高计算开销和推理延迟。现有方法如选择重要视觉特征或引入可学习查询虽能减少 token 数量,但常以性能下降或增加额外参数为代价。论文提出 Fourier-VLM,其核心创新在于利用视觉特征在频域中能量主要集中在低频分量的特性,通过二维离散余弦变换(Discrete Cosine Transform, DCT)对视觉表示进行压缩:采用低通滤波器保留关键低频信息,并借助快速傅里叶变换(Fast Fourier Transform, FFT)高效实现 DCT 计算,时间复杂度仅为 O(nlogn)\mathcal{O}(n\log n),不引入额外参数。该方案显著降低推理浮点运算次数(FLOPs)达 83.8%,生成速度提升 31.2%,同时保持与主流架构(如 LLaVA 和 Qwen-VL)相当的性能表现,展现出卓越的效率与实用性。

链接: https://arxiv.org/abs/2508.06038
作者: Huanyu Wang,Jushi Kai,Haoli Bai,Lu Hou,Bo Jiang,Ziwei He,Zhouhan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) typically replace the predefined image placeholder token (image) in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimentional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of \mathcalO(n\log n) , minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.
zh

[CV-71] More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment

【速读】:该论文针对半监督情感识别(Semi-supervised Emotion Recognition, MER)任务中标签数据稀缺的问题,提出了一种基于“越多越好”原则的鲁棒混合专家(Mixture of Experts, MoE)系统框架。其解决方案的关键在于:首先,将多种输入模态(如来自大视觉语言模型 Vision-Language Models, VLMs 的知识和时间动作单元 Action Unit, AU 信息)作为独立专家进行集成;其次,设计了一种基于共识的伪标签策略,利用基线模型与 Gemini 模型预测结果的一致性生成高质量伪标签,并采用两阶段训练范式充分利用未标记数据;最后,通过多专家投票集成结合规则重排序机制校正预测偏差,使输出更符合人类偏好。该方法在 MER2025-SEMI 挑战数据集上取得 F1-score 0.8772,排名第二。

链接: https://arxiv.org/abs/2508.06036
作者: Jun Xie,Yingjian Zhu,Feng Chen,Zhenghao Zhang,Xiaohui Fan,Hongzhu Yi,Xinming Wang,Chen Yu,Yue Bi,Zhaoran Zhao,Xiongjun Guan,Zhepeng Wang
机构: Lenovo Research(联想研究院); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); University of Chinese Academy of Sciences(中国科学院大学); Tsinghua University(清华大学); Beijing Jiaotong University(北京交通大学); Shandong University(山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that “more is better,” to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at this https URL.
zh

[CV-72] InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow ICCV2025

【速读】:该论文旨在解决现有文本引导图像编辑方法在保持图像内容一致性与遵循文本指令之间难以平衡的问题,尤其是在少步编辑场景下如何实现高效且高质量的编辑效果。其核心解决方案在于提出一种基于RectifiedFlow框架的快速编辑方法InstantEdit,关键创新包括:(1) 引入专为逆向推理设计的PerRFI策略,利用RectifiedFlow的直线采样轨迹提升逆向过程稳定性;(2) 提出逆向潜在空间注入(Inversion Latent Injection)再生机制,有效复用逆向获得的潜在信息以增强再生结果的一致性与细节保真度;(3) 设计解耦式提示引导(Disentangled Prompt Guidance)技术,在编辑灵活性与细节保留之间取得更好权衡,并结合Canny条件控制网络(ControlNet)引入结构约束以抑制伪影。这些技术协同作用,使InstantEdit在PIE数据集上实现了比当前最优少步编辑方法更快的速度和更优的定性和定量性能。

链接: https://arxiv.org/abs/2508.06033
作者: Yiming Gong,Zhen Zhu,Minjia Zhang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI. To maintain consistent while editable results for RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.
zh

[CV-73] Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

【速读】:该论文旨在解决现有方法在人体解析中对衣物和身体部位细粒度识别不足的问题,特别是固定类别标签模糊了服装类型差异,以及开放词汇分割方法难以区分多样衣物和详细身体部位的局限性。解决方案的关键在于提出Spectrum网络,其核心创新是复用一个经过微调的图像到纹理(Image-to-Texture, I2Tx)扩散模型——该模型基于文本到图像(Text-to-Image, T2I)扩散模型,在3D人体纹理图上进一步训练以保持与输入图像的忠实对应关系——从而增强对衣物和身体部位的语义表征能力;通过该模型提取人体内部特征,并利用提示引导的接地机制生成与多样化衣物类别对齐的语义掩码,最终实现无需预定义类别即可对场景中任意数量人类的每个可见身体部分和衣物类别进行精准语义分割。

链接: https://arxiv.org/abs/2508.06032
作者: Kiran Chhatre,Christopher Peters,Srikrishna Karanam
机构: 1: Google(谷歌); 2: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model – obtained by fine-tuning a T2I model on 3D human texture maps – for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments – separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks – and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.
zh

[CV-74] Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis

【速读】:该论文旨在解决基于流式成像显微镜(flow imaging microscopy)与深度学习结合进行亚可见颗粒物分析时,因数据稀缺和类别严重不平衡导致多分类模型性能受限的问题,尤其针对意外出现且数量较少的颗粒类型(如硅油和气泡)难以有效识别的挑战。其解决方案的关键在于开发了一种先进的扩散模型(diffusion model),用于生成高质量、高保真度的合成颗粒图像以扩充训练数据集,从而提升多类深度神经网络的训练效果;实验验证表明,此类生成样本在视觉质量和结构上接近真实图像,并在包含50万张蛋白颗粒图像的大规模验证数据集上显著改善了分类性能,且无明显副作用。

链接: https://arxiv.org/abs/2508.06021
作者: Utku Ozbulak,Michaela Cohrs,Hristo L. Svilenov,Joris Vankerschaver,Wesley De Neve
机构: Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at this https URL.
zh

[CV-75] ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors ICCV2025

【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的新型视图合成(Novel View Synthesis, NVS)方法在渲染偏离训练轨迹的视角时出现伪影和缺失区域的问题,从而限制了场景探索的无缝性。解决方案的关键在于提出一个以信息增益驱动的虚拟相机布局策略,用于生成额外的训练视图以增强重建质量,并结合视频扩散先验(video diffusion priors)对渲染结果进行细化;通过在这些增强视图上微调3D高斯模型,显著提升了从任意视角渲染的保真度与鲁棒性。

链接: https://arxiv.org/abs/2508.06014
作者: Minsu Kim,Subin Jeon,In Cho,Mijin Yoo,Seon Joo Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 Figures, ICCV 2025

点击查看摘要

Abstract:Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering from viewpoints that deviate from the training trajectory, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints. this https URL Comments: 10 pages, 6 Figures, ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.06014 [cs.CV] (or arXiv:2508.06014v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.06014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-76] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉数学推理任务中普遍依赖于清洁或处理过的图像输入,而缺乏对真实教育场景下复杂图像质量的适应性问题。其解决方案的关键在于构建了一个名为MathReal的数据集,该数据集包含2000个由K-12教育用户手持移动设备拍摄的真实数学题目图像,系统地将这些图像按图像质量退化、视角变化和无关内容干扰三大类细分为14个子类别,并覆盖五个核心知识与能力维度及三种难度层级。通过六种实验设置对先进MLLMs进行系统评估,揭示了现有模型在真实教育环境中显著下降的推理性能,从而为提升模型在实际应用中的鲁棒性和准确性提供了关键洞见。

链接: https://arxiv.org/abs/2508.06009
作者: Jun Feng,Zixin Wang,Zhentao Zhang,Yue Guo,Zhihan Zhou,Xiuyi Chen,Zhenyang Li,Dawei Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 16 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: this https URL.
zh

[CV-77] KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training

【速读】:该论文旨在解决分布式训练扩散Transformer(Diffusion Transformers, DiT)时因输入序列长度差异导致的负载不均衡问题,尤其是在混合分辨率和图像-视频联合训练场景下,不同GPU间存在显著的token数量差异,进而引发计算资源利用率低下和训练速度受限的问题。解决方案的关键在于提出KnapFormer框架,其核心创新是将工作负载平衡与序列并行(sequence parallelism)相结合:首先在平衡组内收集各GPU的序列长度元数据,随后求解一个全局背包问题(knapsack problem),以最小化每GPU的总负载方差,并考虑序列并行对负载分布的影响;通过集成基于DeepSpeed-Ulysees的序列并行机制及一个简化的半经验负载模型,KnapFormer实现了极低的通信开销(<1%负载差异)并有效消除慢节点效应(straggler effects),在真实训练场景中达到2x至3x的加速比。

链接: https://arxiv.org/abs/2508.06001
作者: Kai Zhang,Peng Wang,Sai Bi,Jianming Zhang,Yuanjun Xiong
机构: Adobe(Adobe)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:We present KnapFormer, an efficient and versatile framework to combine workload balancing and sequence parallelism in distributed training of Diffusion Transformers (DiT). KnapFormer builds on the insight that strong synergy exists between sequence parallelism and the need to address the significant token imbalance across ranks. This imbalance arises from variable-length text inputs and varying visual token counts in mixed-resolution and image-video joint training. KnapFormer redistributes tokens by first gathering sequence length metadata across all ranks in a balancing group and solving a global knapsack problem. The solver aims to minimize the variances of total workload per-GPU, while accounting for the effect of sequence parallelism. By integrating DeepSpeed-Ulysees-based sequence parallelism in the load-balancing decision process and utilizing a simple semi-empirical workload model, KnapFormers achieves minimal communication overhead and less than 1% workload discrepancy in real-world training workloads with sequence length varying from a few hundred to tens of thousands. It eliminates straggler effects and achieves 2x to 3x speedup when training state-of-the-art diffusion models like FLUX on mixed-resolution and image-video joint data corpora. We open-source the KnapFormer implementation at this https URL
zh

[CV-78] EvoMakeup: High-Fidelity and Controllable Makeup Editing with MakeupQuad

【速读】:该论文旨在解决现有面部美妆编辑方法在生成质量上存在细节粗糙、身份保持与妆容保真度难以兼顾的问题,其根本原因在于缺乏结构化的成对数据(即源图像与结果图像共享同一身份,参考图像与结果图像具有完全相同的妆容)。解决方案的关键在于提出一个大规模高质量的数据集MakeupQuad,包含无妆人脸、参考妆容、编辑结果及文本描述,并构建EvoMakeup统一训练框架,通过多阶段蒸馏过程缓解图像退化问题,从而实现数据与模型质量的迭代提升。该方法仅使用合成数据训练,却能在真实世界基准测试中超越现有方法,支持高保真、可控的多任务美妆编辑(包括全脸/局部参考编辑和文本驱动编辑),并有效平衡妆容保真度与身份一致性。

链接: https://arxiv.org/abs/2508.05994
作者: Huadong Wu,Yi Fu,Yunhao Li,Yuan Gao,Kang Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial makeup editing aims to realistically transfer makeup from a reference to a target face. Existing methods often produce low-quality results with coarse makeup details and struggle to preserve both identity and makeup fidelity, mainly due to the lack of structured paired data – where source and result share identity, and reference and result share identical makeup. To address this, we introduce MakeupQuad, a large-scale, high-quality dataset with non-makeup faces, references, edited results, and textual makeup descriptions. Building on this, we propose EvoMakeup, a unified training framework that mitigates image degradation during multi-stage distillation, enabling iterative improvement of both data and model quality. Although trained solely on synthetic data, EvoMakeup generalizes well and outperforms prior methods on real-world benchmarks. It supports high-fidelity, controllable, multi-task makeup editing – including full-face and partial reference-based editing, as well as text-driven makeup editing – within a single model. Experimental results demonstrate that our method achieves superior makeup fidelity and identity preservation, effectively balancing both aspects. Code and dataset will be released upon acceptance.
zh

[CV-79] ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

【速读】:该论文旨在解决多模态情感识别(Multimodal Emotion Recognition, MER)中因标注数据稀缺导致的性能瓶颈问题。其核心解决方案在于构建一个融合视觉、音频与文本模态的端到端框架,关键创新包括:1)设计双分支视觉编码器以同时提取全局帧级特征与局部面部表征;2)利用大语言模型(Large Language Models, LLMs)增强文本输入中的情感线索;3)提出基于自注意力机制的动态模态加权融合策略与残差连接,实现多模态特征的有效整合;4)采用多源标签策略优化训练集中的噪声标签。该方案在MER2025-SEMI数据集上将加权F-score提升至87.49%,显著优于基线模型(78.63%),验证了方法的有效性。

链接: https://arxiv.org/abs/2508.05991
作者: Juewen Hu,Yexin Li,Jiulin Li,Shuo Chen,Pring Wong
机构: State Key Laboratory of General Artificial Intelligence, BIGAI(北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.
zh

[CV-80] Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision

【速读】:该论文旨在解决视频计算机视觉系统中因帧间时间冗余导致的计算效率低下问题,同时指出现有方法未能充分消除时间冗余且忽略了前端计算开销。其解决方案的关键在于:首先移除图像信号处理器(Image Signal Processor, ISP),直接将拜耳格式(Bayer-format)数据输入视频视觉模型以节省前端计算;其次提出一种基于快速块匹配的运动估计算法替代光流模型和视频编码器,并引入上下文感知的块精修网络(context-aware block refinement network)对大误差区域进行修正;最后采用帧选择策略在准确性和效率之间实现平衡,从而显著提升系统速度并仅带来轻微性能损失。

链接: https://arxiv.org/abs/2508.05990
作者: Haichao Wang,Xinyue Xi,Jiangtao Wen,Yuxing Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.
zh

[CV-81] ETA: Energy-based Test-time Adaptation for Depth Completion

【速读】:该论文旨在解决预训练深度补全(depth completion)模型在迁移到新环境条件下时因协变量偏移(covariate shift)而导致预测错误的问题。解决方案的关键在于提出一种基于能量的测试时自适应方法(Energy-based Test-time Adaptation, ETA),通过利用对抗扰动探索数据空间,训练一个能量模型来评估深度预测局部区域是否属于源数据分布(in-distribution 或 out-of-distribution)。该方法在不假设目标分布的前提下,动态调整预训练模型参数以最小化能量,从而将测试时的预测结果对齐到源分布,显著提升模型在不同场景下的鲁棒性和准确性。

链接: https://arxiv.org/abs/2508.05989
作者: Younjoon Chung,Hyoungseob Park,Patrick Rim,Xiaoran Zhang,Jihe He,Ziyao Zeng,Safa Cicek,Byung-Woo Hong,James S. Duncan,Alex Wong
机构: Yale University (耶鲁大学); UCLA (加州大学洛杉矶分校); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some source'' data, often predict erroneous outputs when transferred to target’’ data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation’', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: this https URL.
zh

[CV-82] AnimateScene: Camera-controllable Animation in Any Scene

【速读】:该论文旨在解决3D场景重建与4D人体动画在融合过程中存在的三大核心问题:(1)人体在场景中的位置与尺度难以准确放置,易产生不合理的穿插;(2)人体与背景之间因光照和风格差异导致视觉不一致;(3)动态镜头运动下视点重建困难,无法实现自然的相机轨迹渲染。解决方案的关键在于提出一个统一框架AnimateScene:首先设计了一个精确的人体定位模块,自动确定合理三维位置并避免与场景发生几何穿插;其次提出一种无需训练的风格对齐方法,使4D人体表示自适应匹配背景的光照与视觉风格,从而实现视觉一致性;最后引入联合后重建策略,同步优化人体与场景的几何结构,并支持插入指定相机轨迹,实现高质量、时空一致的动态场景视频生成。

链接: https://arxiv.org/abs/2508.05982
作者: Qingyang Liu,Bingjie Gao,Weiheng Huang,Jun Zhang,Zhongqian Sun,Yang Wei,Zelin Peng,Qianli Ma,Shuai Yang,Zhaohe Liao,Haonan Zhao,Li Niu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years. However, seamlessly integrating reconstructed scenes with 4D human animation to produce visually engaging results remains challenging. One key difficulty lies in placing the human at the correct location and scale within the scene while avoiding unrealistic interpenetration. Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites. In addition, appealing character motion videos are often accompanied by camera movements, which means that the viewpoints need to be reconstructed along a specified trajectory. We present AnimateScene, which addresses the above issues in a unified framework. First, we design an accurate placement module that automatically determines a plausible 3D position for the human and prevents any interpenetration within the scene during motion. Second, we propose a training-free style alignment method that adapts the 4D human representation to match the background’s lighting and style, achieving coherent visual integration. Finally, we design a joint post-reconstruction method for both the 4D human and the 3D scene that allows camera trajectories to be inserted, enabling the final rendered video to feature visually appealing camera movements. Extensive experiments show that AnimateScene generates dynamic scene videos with high geometric detail and spatiotemporal coherence across various camera and action combinations.
zh

[CV-83] PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation ICCV2025

【速读】:该论文旨在解决机器人操作中高层任务语义(high-level task semantics)与低层几何特征(low-level geometric features)之间存在的割裂问题,尤其针对视觉语言模型(VLMs)在标准空间中缺乏语义锚定(semantic grounding)且依赖人工标注导致难以捕捉动态语义-功能关系(semantic-affordance relationships)的局限性。其解决方案的关键在于提出一种闭环框架Primitive-Aware Semantic Grounding (PASG),核心包括:(1) 通过几何特征聚合实现自动化的几何基元(primitive)提取,支持跨类别的关键点与轴线检测;(2) 利用VLM驱动的语义锚定机制,动态关联几何基元与功能属性及任务相关描述;(3) 构建空间-语义推理基准并微调VLM(Qwen2.5VL-PA),从而实现对物体更细粒度的语义-功能理解,建立几何基元与任务语义之间的统一建模范式。

链接: https://arxiv.org/abs/2508.05976
作者: Zhihao Zhu,Yifan Zheng,Siyu Pan,Yaohui Jin,Yao Mu
机构: MoE key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (上海交通大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICCV 2025. 8 pages main paper, 8 figures, plus supplementary material

点击查看摘要

Abstract:The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semantic-affordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG’s effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
zh

[CV-84] A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image

【速读】:该论文旨在解决单图像法估计法向量时缺乏空间维度信息的问题,以及现有基于扩散模型的方法因依赖数据驱动统计先验而忽略光照-表面相互作用建模所导致的多视角法向量方向不一致问题。此外,扩散模型的离散采样机制引发可微渲染模块中的梯度不连续性,使得三维几何误差无法反向传播至法向量生成网络,从而迫使现有方法依赖密集法向标注数据。其解决方案的关键在于提出一种名为SINGAD(Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion)的新框架:通过融合物理驱动的光照-表面交互建模与基于可微渲染的重投影策略,将三维几何误差直接转化为法向优化信号;具体包括构建光照驱动的3D高斯泼溅(3D Gaussian Splatting, 3DGS)重参数化模型以生成符合光传输原理的多尺度几何特征,确保多视角法向一致性;设计跨域特征融合模块嵌入几何先验约束法向生成并保持几何误差的有效传播;引入可微重投影损失策略实现自监督优化,最小化重建图像与输入图像间的几何误差,从而摆脱对标注法向数据集的依赖。

链接: https://arxiv.org/abs/2508.05950
作者: Yanxing Liang,Yinghui Wang,Jinlong Yang,Wei Li
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
zh

[CV-85] Enhancing Construction Site Analysis and Understanding with 3D Segmentation

【速读】:该论文旨在解决施工进度监测中因传统数据采集方法在复杂、杂乱且动态变化的施工现场环境下表现不佳而导致的效率低与可扩展性差的问题。其解决方案的关键在于评估两种先进的3D分割方法——Segment Anything Model (SAM) 和 Mask3D 在真实施工场景(包括室内外环境)中的适应性和性能,从而揭示当前分割技术在户外场景缺乏基准测试所导致的局限性,并强调开发针对施工场景定制化分割工作流的重要性,以从现场数据中提取可操作的洞察信息,推动施工监测向自动化和高精度方向发展。

链接: https://arxiv.org/abs/2508.05922
作者: Sri Ramana Saketh Vasanthawada,Pengkun Liu,Pingbo Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Monitoring construction progress is crucial yet resource-intensive, prompting the exploration of computer-vision-based methodologies for enhanced efficiency and scalability. Traditional data acquisition methods, primarily focusing on indoor environments, falter in construction site’s complex, cluttered, and dynamically changing conditions. This paper critically evaluates the application of two advanced 3D segmentation methods, Segment Anything Model (SAM) and Mask3D, in challenging outdoor and indoor conditions. Trained initially on indoor datasets, both models’ adaptability and performance are assessed in real-world construction settings, highlighting the gap in current segmentation approaches due to the absence of benchmarks for outdoor scenarios. Through a comparative analysis, this study not only showcases the relative effectiveness of SAM and Mask3D but also addresses the critical need for tailored segmentation workflows capable of extracting actionable insights from construction site data, thereby advancing the field towards more automated and precise monitoring techniques.
zh

[CV-86] Neural Field Representations of Mobile Computational Photography

【速读】:该论文旨在解决移动成像中复杂场景的几何与光照重建问题,尤其是在缺乏显式数据表示(如像素阵列或点云)和传统监督学习先验的情况下,如何从野生环境下的手机拍摄数据中实现高精度的深度估计、图层分离与图像拼接等任务。其解决方案的关键在于设计精心构造的神经场(Neural Fields)模型,利用小规模神经网络将连续空间坐标映射为输出信号,通过随机梯度下降直接拟合智能手机原始测量数据,从而以自正则化的方式高效求解具有挑战性的逆问题,且无需复杂的预处理步骤或标注真值数据。

链接: https://arxiv.org/abs/2508.05907
作者: Ilya Chugunov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD thesis

点击查看摘要

Abstract:Over the past two decades, mobile imaging has experienced a profound transformation, with cell phones rapidly eclipsing all other forms of digital photography in popularity. Today’s cell phones are equipped with a diverse range of imaging technologies - laser depth ranging, multi-focal camera arrays, and split-pixel sensors - alongside non-visual sensors such as gyroscopes, accelerometers, and magnetometers. This, combined with on-board integrated chips for image and signal processing, makes the cell phone a versatile pocket-sized computational imaging platform. Parallel to this, we have seen in recent years how neural fields - small neural networks trained to map continuous spatial input coordinates to output signals - enable the reconstruction of complex scenes without explicit data representations such as pixel arrays or point clouds. In this thesis, I demonstrate how carefully designed neural field models can compactly represent complex geometry and lighting effects. Enabling applications such as depth estimation, layer separation, and image stitching directly from collected in-the-wild mobile photography data. These methods outperform state-of-the-art approaches without relying on complex pre-processing steps, labeled ground truth data, or machine learning priors. Instead, they leverage well-constructed, self-regularized models that tackle challenging inverse problems through stochastic gradient descent, fitting directly to raw measurements from a smartphone.
zh

[CV-87] Robust Image Stitching with Optimal Plane

【速读】:该论文旨在解决图像拼接(image stitching)中普遍存在的鲁棒性(robustness)与内容自然性(naturalness)之间的矛盾问题,尤其是在复杂多变的真实场景下,传统方法难以同时保证拼接结果的几何准确性与视觉一致性。解决方案的关键在于提出一种名为 RopStitch 的无监督深度图像拼接框架,其核心创新包括:(1)采用双分支架构(dual-branch architecture),分别利用预训练分支提取语义不变特征和可学习分支提取细粒度判别特征,并通过相关层级上的可控融合因子实现特征整合,从而提升模型在未见场景中的泛化能力;(2)引入“虚拟最优平面”(virtual optimal planes)概念,将拼接问题建模为单应性分解系数的估计过程,设计迭代系数预测器与最小语义失真约束,以识别最优投影平面,并通过双向投影到该平面来缓解内容对齐与结构保持之间的冲突。此方案显著提升了拼接质量,在多个数据集上优于现有方法,尤其在鲁棒性和内容自然性方面表现突出。

链接: https://arxiv.org/abs/2508.05903
作者: Lang Nie,Yuan Mei,Kang Liao,Yunqiu Xu,Chunyu Lin,Bin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: * Equal contribution

点击查看摘要

Abstract:We present \textitRopStitch, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textitRopStitch, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textitRopStitch by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textitRopStitch significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at \colorredthis https URL.
zh

[CV-88] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

【速读】:该论文旨在解决当前3D场景生成高度依赖人工设计、自动化方法难以支持开放域场景生成与灵活编辑的问题。其核心解决方案在于提出HOLODECK 2.0框架,该框架基于视觉-语言模型(Vision-Language Models, VLMs)解析文本描述并识别场景所需对象,结合先进的3D生成模型产出高质量资产,并通过迭代应用由VLM推导的空间约束条件,实现语义一致且物理合理的场景布局。此外,系统支持基于人类反馈的交互式编辑能力,可灵活调整布局和保持风格一致性,显著提升生成场景的准确性与实用性。

链接: https://arxiv.org/abs/2508.05899
作者: Zixuan Bian,Ruohan Ren,Yue Yang,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.
zh

[CV-89] ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates BMVC2025

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在分布偏移(distribution shifts)下泛化能力不足的问题,尤其关注测试时适应(Test-Time Adaptation, TTA)场景中现有缓存式方法因仅存储高置信度样本而导致决策边界受限、未能充分利用全部测试数据的缺陷。其解决方案的关键在于提出高效测试时适应(Efficient Test-Time Adaptation, ETTA),核心创新包括:1)引入递归更新模块(Recursive Updating module),通过动态整合所有输入测试样本,逐步优化上下文嵌入以逼近无界缓存效果,同时保持极低的内存与计算开销;2)设计自适应集成模块(Adaptive Ensemble module),基于类别置信度动态选择最优提示词(prompt),降低图像到文本得分对特定提示的依赖性;3)融合两个模块的输出,依据置信度自适应加权,充分发挥各自优势。实验证明ETTA在准确率和计算复杂度上均优于当前最优TTA方法。

链接: https://arxiv.org/abs/2508.05898
作者: Hamidreza Dastmalchi,Aijun An,Ali cheraghian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC2025

点击查看摘要

Abstract:Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at this https URL.
zh

[CV-90] Multi-view Gaze Target Estimation ICCV2025

【速读】:该论文旨在解决单视角眼动目标估计(Gaze Target Estimation, GTE)方法在面对人脸遮挡、目标模糊及视线超出视场等挑战时性能受限的问题。其核心解决方案是提出一种基于多视角图像的融合框架,关键创新包括:1)头部位姿聚合(Head Information Aggregation, HIA)模块,用于整合双视角下的头部信息以提升注视方向估计精度;2)基于不确定性的注视选择(Uncertainty-based Gaze Selection, UGS)机制,用于筛选最可靠的注视输出;3)基于极线约束的场景注意力(Epipolar-based Scene Attention, ESA)模块,实现跨视角背景信息的有效共享。该方法不仅显著优于单视角基线模型,还具备利用第二视角图像推测第一视角注视目标的能力,突破了传统单视角GTE的局限性。

链接: https://arxiv.org/abs/2508.05857
作者: Qiaomu Miao,Vivek Raju Golani,Jingyi Xu,Progga Paromita Dutta,Minh Hoai,Dimitris Samaras
机构: Stony Brook University (石溪大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person’s face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at this https URL
zh

[CV-91] VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

【速读】:该论文旨在解决自动驾驶与人机交互(HCI)研究中驾驶员视觉注意力预测的问题,尤其是如何从单帧RGB图像中准确捕捉和解释注意力的动态变化。传统方法多局限于静态场景下的注意力分配估计,难以反映注意力随时间演变的语义特征。其解决方案的关键在于提出一种基于视觉-语言模型(Vision-Language Model, VLM)的框架,通过少量样本(few-shot)甚至零样本(zero-shot)学习,将驾驶员的注视行为转化为自然语言描述。该方法利用人工反馈优化BDD-A数据集中的高质量图像字幕,并对LLaVA模型进行微调,使其能够融合低层次视觉线索与高层语义信息(如路径语义、风险预判),从而实现对注意力转移的可解释性建模。实验表明,该方法在注意力转移检测精度和解释性方面优于通用VLMs,为自动驾驶中的行为预测、人机协同等下游任务提供了新范式。

链接: https://arxiv.org/abs/2508.05852
作者: Kaiser Hamid,Khandakar Ashrafi Akbar,Nade Liang
机构: Texas Tech University (德克萨斯理工大学); The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers’ gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.
zh

[CV-92] mporal Cluster Assignment for Efficient Real-Time Video Segmentation

【速读】:该论文旨在解决视频分割模型中因Swin Transformer的窗口注意力机制导致的高计算开销问题,尤其是在资源受限场景下难以实现实时推理。尽管已有基于token剪枝或聚类的方法缓解计算负担,但传统方法在保持窗口内token数量固定的情况下难以有效压缩冗余信息,且未利用视频帧间的时序冗余性。解决方案的关键在于提出Temporal Cluster Assignment (TCA),一种无需微调的轻量级策略,通过引入帧间时序一致性对token聚类进行精细化调整,从而在保留细粒度语义细节的同时显著降低计算量。TCA能够自适应地融合跨帧时空相关性,提升现有聚类方法在准确率与速度之间的平衡能力,并在多个自然和特定领域视频数据集上验证了其泛化性能。

链接: https://arxiv.org/abs/2508.05851
作者: Ka-Wai Yung,Felix J. S. Bragman,Jialang Xu,Imanol Luengo,Danail Stoyanov,Evangelos B. Mazomenos
机构: Medtronic(美敦力)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.
zh

[CV-93] Integrating Vision Foundation Models with Reinforcement Learning for Enhanced Object Interaction

【速读】:该论文旨在解决模拟环境中智能体在复杂室内场景下进行物体交互能力不足的问题,尤其是在缺乏高级感知机制时导致的交互成功率低和导航效率差。解决方案的关键在于将视觉基础模型(vision foundation models)与强化学习相结合:具体而言,通过集成Segment Anything Model (SAM) 和 YOLOv5 实现精准的物体分割与检测,并将其作为感知输入馈入基于近端策略优化(Proximal Policy Optimization, PPO)的智能体,在AI2-THOR仿真环境中实现更高效的目标导向交互。实验表明,该方法显著提升了平均累积奖励(+68%)、物体交互成功率(+52.5%)及导航效率(+33%),验证了融合基础模型与强化学习在复杂机器人任务中的有效性。

链接: https://arxiv.org/abs/2508.05838
作者: Ahmad Farooq,Kamran Iqbal
机构: University of Arkansas at Little Rock (阿肯色大学小石城分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Published in the Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering (RCVE’25). 6 pages, 3 figures, 1 table

点击查看摘要

Abstract:This paper presents a novel approach that integrates vision foundation models with reinforcement learning to enhance object interaction capabilities in simulated environments. By combining the Segment Anything Model (SAM) and YOLOv5 with a Proximal Policy Optimization (PPO) agent operating in the AI2-THOR simulation environment, we enable the agent to perceive and interact with objects more effectively. Our comprehensive experiments, conducted across four diverse indoor kitchen settings, demonstrate significant improvements in object interaction success rates and navigation efficiency compared to a baseline agent without advanced perception. The results show a 68% increase in average cumulative reward, a 52.5% improvement in object interaction success rate, and a 33% increase in navigation efficiency. These findings highlight the potential of integrating foundation models with reinforcement learning for complex robotic tasks, paving the way for more sophisticated and capable autonomous agents.
zh

[CV-94] SMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios

【速读】:该论文旨在解决生成式AI在手术视频分析中应用时面临的两大挑战:一是复杂运动动态导致的分割不准确,二是SAM2模型中存在的记忆冗余问题,阻碍了有效学习。解决方案的关键在于提出TSMS-SAM2框架,其核心创新包括两个策略:一是多时间尺度视频采样增强(multi-temporal-scale video sampling augmentation),以提升对运动变化的鲁棒性;二是记忆分割与剪枝机制(memory splitting and pruning mechanism),用于组织和过滤历史帧特征,从而实现更高效、精准的分割。实验表明,该方法在EndoVis2017和EndoVis2018数据集上分别取得了95.24和86.73的最高平均Dice分数,显著优于基于SAM的方法及专用任务模型。

链接: https://arxiv.org/abs/2508.05829
作者: Guoping Xu,Hua-Chieh Shao,You Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework’s potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at this https URL.
zh

[CV-95] MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在工业检测场景中难以捕捉亚微米级细节的问题,尤其是在传感器分辨率固定且计算资源受限的情况下,传统NeRF因依赖多视角一致性而无法有效利用高倍率缩放图像(zoom-in images)。解决方案的关键在于提出多倍率增强NeRF(Multi-Zoom Enhanced NeRF, MZEN),其核心创新包括:(i) 在针孔相机模型中引入可学习的缩放标量以显式建模焦距变化;(ii) 设计一种新颖的位姿优化策略——先用广角图像建立全局度量坐标系,再通过缩放一致的裁剪与匹配过程将缩放图像位姿初始化至最近的广角图像,随后进行联合优化。此方法在八个前向场景(含合成TCAD模型、真实扫描电子显微镜SEM图像及BLEFF物体)中显著优于无位姿约束基线方法,PSNR提升达28%,SSIM提升10%,LPIPS降低高达222%,实现了工业现场下全局精度与微米级细节的同时保留。

链接: https://arxiv.org/abs/2508.05819
作者: Jong-Ik Park,Carlee Joe-Wong,Gary K. Fedder
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) methods excel at 3D reconstruction from multiple 2D images, even those taken with unknown camera poses. However, they still miss the fine-detailed structures that matter in industrial inspection, e.g., detecting sub-micron defects on a production line or analyzing chips with Scanning Electron Microscopy (SEM). In these scenarios, the sensor resolution is fixed and compute budgets are tight, so the only way to expose fine structure is to add zoom-in images; yet, this breaks the multi-view consistency that pose-free NeRF training relies on. We propose Multi-Zoom Enhanced NeRF (MZEN), the first NeRF framework that natively handles multi-zoom image sets. MZEN (i) augments the pin-hole camera model with an explicit, learnable zoom scalar that scales the focal length, and (ii) introduces a novel pose strategy: wide-field images are solved first to establish a global metric frame, and zoom-in images are then pose-primed to the nearest wide-field counterpart via a zoom-consistent crop-and-match procedure before joint refinement. Across eight forward-facing scenes \unicodex2013 synthetic TCAD models, real SEM of micro-structures, and BLEFF objects \unicodex2013 MZEN consistently outperforms pose-free baselines and even high-resolution variants, boosting PSNR by up to 28 % , SSIM by 10 % , and reducing LPIPS by up to 222 % . MZEN, therefore, extends NeRF to real-world factory settings, preserving global accuracy while capturing the micron-level details essential for industrial inspection.
zh

[CV-96] Optimization-Free Style Transfer for 3D Gaussian Splats

【速读】:该论文旨在解决3D Gaussian splats风格迁移(style transfer)过程中需要重建或微调splat表示、以及优化特征提取网络的问题,这些问题通常导致计算成本高且流程复杂。其解决方案的关键在于提出一种无需重建和优化的图结构方法:首先在splat表示的隐式表面生成一个图结构,随后使用前向传播的基于表面的风格迁移方法对图进行处理,并将结果插值回场景中的各个splat。该方法实现了任意风格图像与3D Gaussian splat的即插即用式风格迁移,无需额外训练或优化,同时在消费级硬件上可实现低于2分钟的快速风格化处理。

链接: https://arxiv.org/abs/2508.05813
作者: Raphael Du Sablon,David Hart
机构: East Carolina University (东卡罗来纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at this https URL.
zh

[CV-97] Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks

【速读】:该论文旨在解决医学影像中基于Transformer模型的少样本(few-shot)部署难题,尤其是在标注数据稀缺的临床环境中。其核心问题在于如何利用预训练模型在有限标注数据下实现高泛化性能。解决方案的关键在于采用大规模多队列脑部MRI数据集(超过3100万张切片)对Masked Autoencoder (MAE)进行预训练,从而获得高度可迁移的潜在表征;在此基础上,针对不同层级任务设计差异化策略:对于高层任务(如分类),使用冻结的MAE编码器搭配轻量级线性头即可实现最优性能;而对于低层任务(如分割),提出MAE-FUnet架构,融合多尺度CNN特征与预训练MAE嵌入,在数据受限条件下显著优于现有主流基线模型。

链接: https://arxiv.org/abs/2508.05783
作者: Mengyu Li,Guoyao Shen,Chad W. Farris,Xin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Machine learning using transformers has shown great potential in medical imaging, but its real-world applicability remains limited due to the scarcity of annotated data. In this study, we propose a practical framework for the few-shot deployment of pretrained MRI transformers in diverse brain imaging tasks. By utilizing the Masked Autoencoder (MAE) pretraining strategy on a large-scale, multi-cohort brain MRI dataset comprising over 31 million slices, we obtain highly transferable latent representations that generalize well across tasks and datasets. For high-level tasks such as classification, a frozen MAE encoder combined with a lightweight linear head achieves state-of-the-art accuracy in MRI sequence identification with minimal supervision. For low-level tasks such as segmentation, we propose MAE-FUnet, a hybrid architecture that fuses multiscale CNN features with pretrained MAE embeddings. This model consistently outperforms other strong baselines in both skull stripping and multi-class anatomical segmentation under data-limited conditions. With extensive quantitative and qualitative evaluations, our framework demonstrates efficiency, stability, and scalability, suggesting its suitability for low-resource clinical environments and broader neuroimaging applications.
zh

[CV-98] MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

【速读】:该论文旨在解决当前3D医学图像合成方法中存在的三大问题:(1)泛化能力有限,仅适用于特定身体部位或体素间距;(2)推理速度慢,这是扩散模型的普遍挑战;(3)输入条件对齐弱,这对医学成像至关重要。解决方案的关键在于提出MAISI-v2框架,首次将校正流(rectified flow)引入3D医学图像合成,显著加速生成过程(相比潜在扩散模型实现33倍提速),同时通过引入区域特异性对比损失(region-specific contrastive loss)提升感兴趣区域的条件一致性,从而在保证图像质量的同时增强生成结果与输入条件的匹配度。

链接: https://arxiv.org/abs/2508.05772
作者: Can Zhao,Pengfei Guo,Dong Yang,Yucheng Tang,Yufan He,Benjamin Simon,Mason Belue,Stephanie Harmon,Baris Turkbey,Daguang Xu
机构: 1. University of California, San Francisco (加州大学旧金山分校); 2. National Cancer Institute (国家癌症研究所); 3. NIH (美国国立卫生研究院); 4. Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with 33 \times acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.
zh

[CV-99] Improving Masked Style Transfer using Blended Partial Convolution

【速读】:该论文旨在解决图像风格迁移(style transfer)中仅对图像特定区域进行风格化处理时存在的问题,即传统方法在全局风格迁移后通过掩码(mask)限制风格应用范围,容易导致目标区域未能准确捕捉风格特征。其解决方案的关键在于提出一种基于局部卷积(partial convolution)的风格迁移网络,能够将风格特征精确地限定在用户指定的目标区域内,并引入网络内部的融合技术以应对区域选择不完美的情况,从而在视觉效果和量化指标上均实现显著提升。

链接: https://arxiv.org/abs/2508.05769
作者: Seyed Hadi Seyed,Ayberk Cansever,David Hart
机构: East Carolina University (东卡罗来纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at this https URL.
zh

[CV-100] UnGuide: Learning to Forget with LoRA-Guided Diffusion Models

【速读】:该论文旨在解决大规模文本到图像扩散模型在生成有害或误导性内容方面的潜在滥用风险,核心挑战在于如何实现有效的机器遗忘(machine unlearning),即在不损害模型整体性能的前提下移除特定知识或概念。现有方法如低秩适配(Low-Rank Adaptation, LoRA)虽能高效微调模型以实现目标遗忘,但常导致无关内容质量下降,影响图像保真度与真实感。本文提出UnGuide——一种基于无分类器指导(Classifier-Free Guidance, CFG)的动态推理机制,其关键创新在于通过分析去噪过程前几步的稳定性来动态调节引导尺度,从而实现选择性遗忘:对于包含被擦除概念的提示词,LoRA模块主导生成并被基础模型抵消;而对于无关提示词,则由基础模型主导生成,确保内容保真度。实验证明,UnGuide在对象擦除和显式内容删除任务中均优于现有基于LoRA的方法,实现了可控的概念移除与扩散模型表达能力的保留。

链接: https://arxiv.org/abs/2508.05755
作者: Agnieszka Polowczyk,Alicja Polowczyk,Dawid Malarz,Artur Kasymov,Marcin Mazur,Jacek Tabor,Przemysław Spurek
机构: 1. Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所); 2. Faculty of Mathematics and Computer Science, Jagiellonian University (雅盖隆大学数学与计算机科学学院); 3. Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide – a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.
zh

[CV-101] Generalized Few-Shot Out-of-Distribution Detection

【速读】:该论文旨在解决少样本(Few-shot)分布外检测(Out-of-Distribution, OOD)方法在开放世界场景下泛化能力不足的问题。现有方法因受限于少量训练数据,容易过拟合至特定训练样本,导致在多样化测试场景中性能不稳定且泛化效果差。其解决方案的关键在于提出一种广义少样本OOD检测(Generalized Few-shot OOD Detection, GOOD)框架,通过引入辅助的通用知识模型(General Knowledge Model, GKM)来赋予OOD检测模型更强的泛化能力,而非直接从有限样本中学习。作者进一步从泛化角度揭示了OOD检测中的“通用性-特异性平衡”(Generality-Specificity balance, GS-balance),并理论上证明该平衡可降低泛化误差上界;为此设计了知识动态嵌入(Knowledge Dynamic Embedding, KDE)机制,基于GKM的广义信念(Generalized Belief, G-Belief)自适应调节通用知识的引导强度,从而优化GS-balance,提升模型在不同场景下的稳定性和准确性。

链接: https://arxiv.org/abs/2508.05732
作者: Pinxuan Li,Bing Cao,Changqing Zhang,Qinghua Hu
机构: Tianjin University (天津大学); Tianjin Key Lab of Machine Learning (天津市机器学习重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot Out-of-Distribution (OOD) detection has emerged as a critical research direction in machine learning for practical deployment. Most existing Few-shot OOD detection methods suffer from insufficient generalization capability for the open world. Due to the few-shot learning paradigm, the OOD detection ability is often overfit to the limited training data itself, thus degrading the performance on generalized data and performing inconsistently across different scenarios. To address this challenge, we proposed a Generalized Few-shot OOD Detection (GOOD) framework, which empowers the general knowledge of the OOD detection model with an auxiliary General Knowledge Model (GKM), instead of directly learning from few-shot data. We proceed to reveal the few-shot OOD detection from a generalization perspective and theoretically derive the Generality-Specificity balance (GS-balance) for OOD detection, which provably reduces the upper bound of generalization error with a general knowledge model. Accordingly, we propose a Knowledge Dynamic Embedding (KDE) mechanism to adaptively modulate the guidance of general knowledge. KDE dynamically aligns the output distributions of the OOD detection model to the general knowledge model based on the Generalized Belief (G-Belief) of GKM, thereby boosting the GS-balance. Experiments on real-world OOD benchmarks demonstrate our superiority. Codes will be available.
zh

[CV-102] Boosting Adversarial Transferability via Residual Perturbation Attack ICCV2025

【速读】:该论文旨在解决深度神经网络在黑盒场景下基于迁移的攻击(transfer-based attacks)中,对抗样本(adversarial examples)转移能力有限的问题。现有方法虽已发现平坦损失景观(flat loss landscapes)有助于提升迁移性,但忽略了扰动方向(perturbation directions)对迁移性能的影响,导致效果受限。解决方案的关键在于提出一种新型攻击方法——残差扰动攻击(Residual Perturbation Attack, ResPA),其核心创新是利用残差梯度(residual gradient)作为扰动方向:通过指数移动平均获取历史梯度的第一矩作为参考梯度,再计算当前梯度与参考梯度之间的残差,从而捕捉全局扰动方向的变化,引导对抗样本向损失函数的平坦区域迁移。实验表明,ResPA相比现有典型迁移攻击方法具有更强的迁移能力,且可进一步结合输入变换方法优化性能。

链接: https://arxiv.org/abs/2508.05689
作者: Jinjia Peng,Zeze Tao,Huibing Wang,Meng Wang,Yang Wang
机构: Hebei University (河北大学); Dalian Maritime University (大连海事大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to ieee/cvf international conference on computer vision (ICCV2025)

点击查看摘要

Abstract:Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at this https URL.
zh

[CV-103] Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards ACM-MM2025

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中安全防护机制的脆弱性问题,特别是现有多模态越狱攻击(multimodal jailbreaks)在应对提示词过滤器(prompt filter)和图像安全检查器(safety checker)时存在泛化能力差、效率低的问题。解决方案的关键在于提出一种通用且未见过的越狱攻击方法——Universally Unfiltered and Unseen (U3)-Attack,其通过优化图像背景上的对抗补丁以普遍绕过安全检查器,并同时生成一个安全的同义词替换集来统一规避提示词过滤器,从而在不引入冗余内容的前提下显著提升攻击成功率,实验表明该方法在开源与商用T2I模型上均优于当前最先进方案。

链接: https://arxiv.org/abs/2508.05658
作者: Song Yan,Hui Wei,Jinlong Fei,Guoliang Yang,Zhengyu Zhao,Zheng Wamg
机构: Information Engineering University (信息工程大学); Wuhan University (武汉大学); Mathematical Engineering and Advanced Computing (数学工程与先进计算) ; Xi’an Jiaotong University (西安交通大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ACM MM 2025

点击查看摘要

Abstract:Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) this http URL order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been this http URL, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming this http URL address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I this http URL, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant this http URL experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I this http URL example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves ~4\times higher success rates than the state-of-the-art multimodal jailbreak attack, this http URL Warning: This paper includes examples of NSFW content.
zh

[CV-104] Multivariate Fields of Experts

【速读】:该论文旨在解决图像重建中先验建模的局限性问题,尤其是在处理图像去噪、去模糊、压缩感知磁共振成像(compressed-sensing magnetic-resonance imaging)和计算机断层扫描(computed tomography)等逆问题时,传统单变量场专家(Fields of Experts, FoE)模型在表达能力与计算效率之间难以平衡的问题。其解决方案的关键在于提出多变量场专家(multivariate fields of experts),通过引入基于ℓ∞-范数Moreau包络构造的多元势函数(multivariate potential functions),增强了模型对图像局部结构的刻画能力;该方法在保持结构化设计的同时显著提升了性能,实现了与深度学习正则化方法相当的效果,且训练数据需求更少、参数量更低、推理速度更快,并保留了较高的可解释性。

链接: https://arxiv.org/abs/2508.06490
作者: Stanislas Ducotterd,Michael Unser
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We introduce the multivariate fields of experts, a new framework for the learning of image priors. Our model generalizes existing fields of experts methods by incorporating multivariate potential functions constructed via Moreau envelopes of the \ell_\infty -norm. We demonstrate the effectiveness of our proposal across a range of inverse problems that include image denoising, deblurring, compressed-sensing magnetic-resonance imaging, and computed tomography. The proposed approach outperforms comparable univariate models and achieves performance close to that of deep-learning-based regularizers while being significantly faster, requiring fewer parameters, and being trained on substantially fewer data. In addition, our model retains a relatively high level of interpretability due to its structured design.
zh

[CV-105] Advanced Deep Learning Techniques for Accurate Lung Cancer Detection and Classification

【速读】:该论文旨在解决肺部CT图像中肺癌(Lung Cancer, LC)检测与分类因数据集小且不平衡导致的高假阳性率和低准确率问题。其解决方案的关键在于采用DenseNet201模型,并结合Focal Loss、数据增强(data augmentation)和正则化(regularization)等先进技术,有效缓解了类别不平衡带来的偏差并抑制了过拟合(overfitting),最终实现了98.95%的高精度检测性能。

链接: https://arxiv.org/abs/2508.06287
作者: Mobarak Abumohsen,Enrique Costa-Montenegro,Silvia García-Méndez,Amani Yousef Owda,Majdi Owda
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung cancer (LC) ranks among the most frequently diagnosed cancers and is one of the most common causes of death for men and women worldwide. Computed Tomography (CT) images are the most preferred diagnosis method because of their low cost and their faster processing times. Many researchers have proposed various ways of identifying lung cancer using CT images. However, such techniques suffer from significant false positives, leading to low accuracy. The fundamental reason results from employing a small and imbalanced dataset. This paper introduces an innovative approach for LC detection and classification from CT images based on the DenseNet201 model. Our approach comprises several advanced methods such as Focal Loss, data augmentation, and regularization to overcome the imbalanced data issue and overfitting challenge. The findings show the appropriateness of the proposal, attaining a promising performance of 98.95% accuracy.
zh

[CV-106] Clinically-guided Data Synthesis for Laryngeal Lesion Detection

【速读】:该论文旨在解决耳鼻喉科(otorhinolaryngology)领域中内窥镜辅助诊断(CADx/e)系统因高质量标注数据稀缺而导致的模型泛化能力不足问题。其关键解决方案是提出一种基于潜在扩散模型(Latent Diffusion Model, LDM)与ControlNet适配器相结合的图像-标注对生成方法,通过临床观察引导扩散过程,生成具有高真实感和临床相关性的喉镜图像及其对应标注,从而有效扩充训练数据集,提升下游检测任务性能。实验表明,仅引入10%合成数据即可在内部测试中将喉部病变检测率提高9%,在外域外部数据上提升达22.1%。

链接: https://arxiv.org/abs/2508.06182
作者: Chiara Baldini,Kaisar Kushibar,Richard Osuala,Simone Balocco,Oliver Diaz,Karim Lekadir,Leonardo S. Mattos
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.
zh

[CV-107] ransformer-Based Explainable Deep Learning for Breast Cancer Detection in Mammography: The MammoFormer Framework

【速读】:该论文旨在解决乳腺癌乳腺X线影像(mammography)诊断中因病灶微小、阅片者间判读差异大而导致的检测困难问题,同时克服卷积神经网络(CNN)在医学图像分析中存在的局部信息与全局上下文建模不足,以及缺乏可解释性人工智能(Explainable AI, XAI)支持难以被临床接受的局限。其解决方案的关键在于提出MammoFormer框架,该框架融合基于Transformer的架构与多特征增强组件,并集成多层次XAI功能;通过针对不同架构的特征增强优化(如自适应直方图均衡化AHE或方向梯度直方图HOG),显著提升模型性能(Swin Transformer在HOG增强下提升13.0%),并实现多视角诊断可解释性,最终构建一个结合CNN可靠性与Transformer全局建模能力的临床可用集成系统。

链接: https://arxiv.org/abs/2508.06137
作者: Ojonugwa Oluwafemi Ejiga Peter,Daniel Emakporuena,Bamidele Dayo Tunde,Maryam Abdulkarim,Abdullahi Bn Umar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer detection through mammography interpretation remains difficult because of the minimal nature of abnormalities that experts need to identify alongside the variable interpretations between readers. The potential of CNNs for medical image analysis faces two limitations: they fail to process both local information and wide contextual data adequately, and do not provide explainable AI (XAI) operations that doctors need to accept them in clinics. The researcher developed the MammoFormer framework, which unites transformer-based architecture with multi-feature enhancement components and XAI functionalities within one framework. Seven different architectures consisting of CNNs, Vision Transformer, Swin Transformer, and ConvNext were tested alongside four enhancement techniques, including original images, negative transformation, adaptive histogram equalization, and histogram of oriented gradients. The MammoFormer framework addresses critical clinical adoption barriers of AI mammography systems through: (1) systematic optimization of transformer architectures via architecture-specific feature enhancement, achieving up to 13% performance improvement, (2) comprehensive explainable AI integration providing multi-perspective diagnostic interpretability, and (3) a clinically deployable ensemble system combining CNN reliability with transformer global context modeling. The combination of transformer models with suitable feature enhancements enables them to achieve equal or better results than CNN approaches. ViT achieves 98.3% accuracy alongside AHE while Swin Transformer gains a 13.0% advantage through HOG enhancements
zh

[CV-108] Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

【速读】:该论文旨在解决传统扫描电子显微镜(SEM)成像中无法直接获取三维(3D)形貌信息的问题,尤其针对现有方法在复杂微结构重建中存在的离散3D表示局限、需依赖参考样品标定以及阴影引起的梯度误差等挑战。其解决方案的关键在于提出一种基于神经场(neural field)的混合型SEM 3D重建方法——NFH-SEM,该方法以多视角、多探测器的2D SEM图像为输入,通过端到端自标定机制消除人工标定步骤,并在训练过程中自动分离图像中的阴影成分,从而实现对复杂微结构的高保真重建。

链接: https://arxiv.org/abs/2508.04728
作者: Shuo Chen,Yijin Li,Xi Zheng,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China (国家重点实验室CAD&CG,浙江大学,杭州 310058,中国); Alibaba Group, Hangzhou 311121, China (阿里巴巴集团,杭州 311121,中国); Analysis Center of Agriculture, Life and Environment Sciences, Zhejiang University, Hangzhou 310058, China (农业、生命与环境科学分析中心,浙江大学,杭州 310058,中国)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
备注:

点击查看摘要

Abstract:The scanning electron microscope (SEM) is a widely used imaging device in scientific research and industrial applications. Conventional two-dimensional (2D) SEM images do not directly reveal the three-dimensional (3D) topography of micro samples, motivating the development of SEM 3D surface reconstruction methods. However, reconstruction of complex microstructures remains challenging for existing methods due to the limitations of discrete 3D representations, the need for calibration with reference samples, and shadow-induced gradient errors. Here, we introduce NFH-SEM, a neural field-based hybrid SEM 3D reconstruction method that takes multi-view, multi-detector 2D SEM images as input and fuses geometric and photometric information into a continuous neural field representation. NFH-SEM eliminates the manual calibration procedures through end-to-end self-calibration and automatically disentangles shadows from SEM images during training, enabling accurate reconstruction of intricate microstructures. We validate the effectiveness of NFH-SEM on real and simulated datasets. Our experiments show high-fidelity reconstructions of diverse, challenging samples, including two-photon lithography microstructures, peach pollen, and silicon carbide particle surfaces, demonstrating precise detail and broad applicability.
zh

人工智能

[AI-0] What Voting Rules Actually Do: A Data-Driven Analysis of Multi-Winner Voting

【速读】:该论文旨在解决多赢选举规则(multi-winner voting rules)在实际应用中对公理(axiom)违反频率的问题,传统研究多依赖最坏情况下的二元判断(即是否满足某公理),而忽略了真实偏好分布下的表现差异。其解决方案的关键在于提出一个数据驱动的评估框架,通过在多种偏好分布下量化不同投票规则违反公理的频率,从而更贴近现实场景地比较各规则的性能;进一步利用神经网络作为可学习的投票规则,发现其能在最小化公理违背方面优于传统规则,表明数据驱动方法可有效指导社会选择机制的设计与优化。

链接: https://arxiv.org/abs/2508.06454
作者: Joshua Caiata,Ben Armstrong,Kate Larson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 41 pages

点击查看摘要

Abstract:Committee-selection problems arise in many contexts and applications, and there has been increasing interest within the social choice research community on identifying which properties are satisfied by different multi-winner voting rules. In this work, we propose a data-driven framework to evaluate how frequently voting rules violate axioms across diverse preference distributions in practice, shifting away from the binary perspective of axiom satisfaction given by worst-case analysis. Using this framework, we analyze the relationship between multi-winner voting rules and their axiomatic performance under several preference distributions. We then show that neural networks, acting as voting rules, can outperform traditional rules in minimizing axiom violations. Our results suggest that data-driven approaches to social choice can inform the design of new voting systems and support the continuation of data-driven research in social choice.
zh

[AI-1] he Fair Game: Auditing Debiasing AI Algorithms Over Time

【速读】:该论文旨在解决当前公平机器学习(Fair Machine Learning)在动态社会环境中难以持续保障公平性的问题,即现有方法多基于静态、观测性的偏见定义,在部署后无法适应社会演变或缺乏真实标签时的反馈机制,导致实际应用中公平目标与现实需求之间存在显著差距。解决方案的关键在于提出一种名为“公平博弈”(Fair Game)的动态机制,其核心是通过强化学习(Reinforcement Learning, RL)将审计器(Auditor)与去偏算法(Debiasing algorithm)构建成闭环系统,围绕预训练的机器学习模型运行;该机制能够根据社会交互产生的反馈不断调整公平目标,从而模拟伦理与法律框架的社会演化过程,实现从预部署到后部署阶段的持续自适应公平性优化。

链接: https://arxiv.org/abs/2508.06443
作者: Debabrota Basu,Udvas Das
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,“Fair Game”,to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. “Fair Game” puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The “Fair Game” puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. “Fair Game” provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,“Fair Game” aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.
zh

[AI-2] Dimensional Characterization and Pathway Modeling for Catastrophic AI Risks

【速读】:该论文试图解决当前关于人工智能(Artificial Intelligence, AI)风险讨论中缺乏系统性、多维框架及明确因果路径的问题,尤其在从潜在危害到实际损害的映射上存在不足。其解决方案的关键在于提出两种互补的方法:一是基于七个核心维度(意图、能力、主体、极性、线性、影响范围和顺序)对六类常见AI灾难性风险进行特征化分析,从而实现系统的风险识别与可推广的缓解策略;二是通过构建风险路径模型,逐级追踪从初始危害到最终损害的演化过程,识别特定场景下的干预节点。这两种方法共同为全价值链上的AI灾难性风险管理提供了更结构化且可操作的基础。

链接: https://arxiv.org/abs/2508.06411
作者: Ze Shen Chin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages including references, 6 figures. To be presented in Technical AI Governance Forum 2025

点击查看摘要

Abstract:Although discourse around the risks of Artificial Intelligence (AI) has grown, it often lacks a comprehensive, multidimensional framework, and concrete causal pathways mapping hazard to harm. This paper aims to bridge this gap by examining six commonly discussed AI catastrophic risks: CBRN, cyber offense, sudden loss of control, gradual loss of control, environmental risk, and geopolitical risk. First, we characterize these risks across seven key dimensions, namely intent, competency, entity, polarity, linearity, reach, and order. Next, we conduct risk pathway modeling by mapping step-by-step progressions from the initial hazard to the resulting harms. The dimensional approach supports systematic risk identification and generalizable mitigation strategies, while risk pathway models help identify scenario-specific interventions. Together, these methods offer a more structured and actionable foundation for managing catastrophic AI risks across the value chain.
zh

[AI-3] Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling INTERSPEECH2025

【速读】:该论文旨在解决传统语音分离与说话人聚类(speaker diarization)方法依赖先验目标说话人信息或预设参与人数的问题,从而限制了其在无注册场景下的应用。解决方案的关键在于提出一种无需声纹注册(enrollment-free)的联合训练框架,通过自动识别混合语音中的目标说话人嵌入(speaker embeddings)来实现同时的语音分离与聚类;其核心创新包括一个双阶段训练流程,用于学习抗背景噪声干扰的鲁棒说话人表征特征,以及一种专为重叠语音帧设计的重叠谱损失函数(overlapping spectral loss),显著提升了聚类准确率,在 DER 和 cpWER 上分别取得 71% 和 69% 的相对提升。

链接: https://arxiv.org/abs/2508.06393
作者: Md Asif Jalal,Luca Remaggi,Vasileios Moschopoulos,Thanasis Kotsiopoulos,Vandana Rajan,Karthikeyan Saravanan,Anastasis Drosou,Junho Heo,Hyuk Oh,Seokyeong Jeong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.
zh

[AI-4] Identity Increases Stability in Neural Cellular Automata

【速读】:该论文旨在解决神经细胞自动机(Neural Cellular Automata, NCA)在生成二维人工生物体时存在的稳定性问题,即其自然边界常因异常增殖或形状失稳而破坏,类似于肿瘤生长。解决方案的关键在于引入一个带有简单约束的“身份”(identity)层进行训练,仅需单一身份值即可显著提升相邻NCA生长结构的稳定性;进一步发现,使用多个身份值可诱发涌现性运动行为,为研究人工生物体间的细胞级社会互动奠定了基础。

链接: https://arxiv.org/abs/2508.06389
作者: James Stovold
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to ALIFE 2025

点击查看摘要

Abstract:Neural Cellular Automata (NCAs) offer a way to study the growth of two-dimensional artificial organisms from a single seed cell. From the outset, NCA-grown organisms have had issues with stability, their natural boundary often breaking down and exhibiting tumour-like growth or failing to maintain the expected shape. In this paper, we present a method for improving the stability of NCA-grown organisms by introducing an ‘identity’ layer with simple constraints during training. Results show that NCAs grown in close proximity are more stable compared with the original NCA model. Moreover, only a single identity value is required to achieve this increase in stability. We observe emergent movement from the stable organisms, with increasing prevalence for models with multiple identity values. This work lays the foundation for further study of the interaction between NCA-grown organisms, paving the way for studying social interaction at a cellular level in artificial organisms. Comments: Accepted to ALIFE 2025 Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.06389 [cs.NE] (or arXiv:2508.06389v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2508.06389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] End-to-End Text-to-SQL with Dataset Selection: Leverag ing LLM s for Adaptive Query Generation IJCNN25

【速读】:该论文旨在解决多数据库场景下文本到SQL(Text-to-SQL)任务中数据库意图识别(database intent prediction)缺失的问题,即在未预设目标数据库的情况下,如何准确识别用户意图所对应的数据库(db_id),从而提升SQL生成的准确性。其解决方案的关键在于提出一个三阶段端到端框架:首先利用大语言模型(LLM)和提示工程从自然语言查询(NLQ)中提取隐式规则集(ruleset);其次构建基于RoBERTa微调的数据库标识符预测模型(db_id prediction model),结合NLQ与LLM生成的规则进行db_id预测;最后通过批评者代理(critic agents)对生成的SQL进行纠错优化。该方法显著提升了多数据库环境下数据库意图识别和SQL生成的性能。

链接: https://arxiv.org/abs/2508.06387
作者: Anurag Tripathi,Vaibhav Patle,Abhinav Jain,Ayush Pundir,Sairam Menon,Ajeet Kumar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in IJCNN25

点击查看摘要

Abstract:Text-to-SQL bridges the gap between natural language and structured database language, thus allowing non-technical users to easily query databases. Traditional approaches model text-to-SQL as a direct translation task, where a given Natural Language Query (NLQ) is mapped to an SQL command. Recent advances in large language models (LLMs) have significantly improved translation accuracy, however, these methods all require that the target database is pre-specified. This becomes problematic in scenarios with multiple extensive databases, where identifying the correct database becomes a crucial yet overlooked step. In this paper, we propose a three-stage end-to-end text-to-SQL framework to identify the user’s intended database before generating SQL queries. Our approach leverages LLMs and prompt engineering to extract implicit information from natural language queries (NLQs) in the form of a ruleset. We then train a large db_id prediction model, which includes a RoBERTa-based finetuned encoder, to predict the correct Database identifier (db_id) based on both the NLQ and the LLM-generated rules. Finally, we refine the generated SQL by using critic agents to correct errors. Experimental results demonstrate that our framework outperforms the current state-of-the-art models in both database intent prediction and SQL generation accuracy.
zh

[AI-6] SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

【速读】:该论文旨在解决传统语音说话人分割与识别(Speaker Diarization and Recognition, SDR)系统中存在的误差传播、重叠语音处理困难以及说话人分割(SD)与自动语音识别(ASR)任务缺乏联合优化等问题。其解决方案的关键在于提出了一种统一的多模态大语言模型 SpeakerLM,该模型以端到端方式联合执行 SD 和 ASR 任务,并引入灵活的说话人注册机制,支持不同注册场景下的鲁棒 SDR 性能,从而实现更高效、准确且适应性强的多说话人音频理解。

链接: https://arxiv.org/abs/2508.06372
作者: Han Yin,Yafeng Chen,Chong Deng,Luyao Cheng,Hui Wang,Chao-Hong Tan,Qian Chen,Wen Wang,Xiangang Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Speaker Diarization and Recognition (SDR) task aims to predict “who spoke when and what” within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.
zh

[AI-7] Automated Creation of the Legal Knowledge Graph Addressing Legislation on Violence Against Women: Resource Methodology and Lessons Learned

【速读】:该论文旨在解决法律领域中法律知识图谱(Legal Knowledge Graphs, LKGs)稀缺的问题,以支持法律决策过程中的信息获取、复杂查询及机器学习应用。其关键解决方案在于提出两种互补的自动化构建方法:一是针对法律领域的系统性自底向上方法,二是利用大语言模型(Large Language Models, LLMs)的新颖方案。两者均结合结构化数据抽取、本体开发与语义增强技术,从欧洲法院公开的法律判例中提取并构建面向家庭暴力案件的法律知识图谱,并通过能力问题验证其有效性,从而为预测司法等高级应用提供可扩展的知识组件。

链接: https://arxiv.org/abs/2508.06368
作者: Claudia dAmato,Giuseppe Rubini,Francesco Didio,Donato Francioso,Fatima Zahra Amara,Nicola Fanizzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legal decision-making process requires the availability of comprehensive and detailed legislative background knowledge and up-to-date information on legal cases and related sentences/decisions. Legal Knowledge Graphs (KGs) would be a valuable tool to facilitate access to legal information, to be queried and exploited for the purpose, and to enable advanced reasoning and machine learning applications. Indeed, legal KGs may act as knowledge intensive component to be used by pre-dictive machine learning solutions supporting the decision process of the legal expert. Nevertheless, a few KGs can be found in the legal domain. To fill this gap, we developed a legal KG targeting legal cases of violence against women, along with clear adopted methodologies. Specifically, the paper introduces two complementary approaches for automated legal KG construction; a systematic bottom-up approach, customized for the legal domain, and a new solution leveraging Large Language Models. Starting from legal sentences publicly available from the European Court of Justice, the solutions integrate structured data extraction, ontology development, and semantic enrichment to produce KGs tailored for legal cases involving violence against women. After analyzing and comparing the results of the two approaches, the developed KGs are validated via suitable competency questions. The obtained KG may be impactful for multiple purposes: can improve the accessibility to legal information both to humans and machine, can enable complex queries and may constitute an important knowledge component to be possibly exploited by machine learning tools tailored for predictive justice.
zh

[AI-8] ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design

【速读】:该论文旨在解决分子生物活性精准调控的问题,包括靶向激活/抑制、多靶点协同调节以及脱靶毒性缓解等关键挑战,这些问题在从头药物设计中尤为突出。现有生成式AI方法通常仅关注单一活性目标,缺乏对多个有意和无意分子相互作用的集成管理机制。其解决方案的关键在于提出ActivityDiff——一种基于扩散模型分类器引导(classifier-guidance)技术的生成方法,通过分别训练的药物-靶标分类器实现正向(增强所需活性)与负向(抑制脱靶效应)双重引导,从而在分子生成过程中同时优化疗效与安全性。实验验证表明,该方法可有效完成单靶点、双靶点生成、片段约束下的双靶点设计、选择性增强及脱靶效应降低等任务,为分子活性的综合调控提供了新范式。

链接: https://arxiv.org/abs/2508.06364
作者: Renyi Zhou,Huimin Zhu,Jing Tang,Min Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Achieving precise control over a molecule’s biological activity-encompassing targeted activation/inhibition, cooperative multi-target modulation, and off-target toxicity mitigation-remains a critical challenge in de novo drug design. However, existing generative methods primarily focus on producing molecules with a single desired activity, lacking integrated mechanisms for the simultaneous management of multiple intended and unintended molecular interactions. Here, we propose ActivityDiff, a generative approach based on the classifier-guidance technique of diffusion models. It leverages separately trained drug-target classifiers for both positive and negative guidance, enabling the model to enhance desired activities while minimizing harmful off-target effects. Experimental results show that ActivityDiff effectively handles essential drug design tasks, including single-/dual-target generation, fragment-constrained dual-target design, selective generation to enhance target specificity, and reduction of off-target effects. These results demonstrate the effectiveness of classifier-guided diffusion in balancing efficacy and safety in molecular design. Overall, our work introduces a novel paradigm for achieving integrated control over molecular activity, and provides ActivityDiff as a versatile and extensible framework.
zh

[AI-9] Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在无明确人类诱导的情况下可能自发产生欺骗行为的问题,即模型在面对良性提示时,仍可能出于隐藏目标而故意编造或隐瞒信息。这一问题在现实应用场景中具有高度风险,但此前研究多依赖人为设定“隐藏目标”来诱发欺骗,难以反映真实人机交互场景。论文的关键解决方案是提出一种基于心理学原理的新型评估框架——通过设计“接触搜索问题”(contact searching questions),引入两个统计指标:欺骗意图得分(Deceptive Intention Score)用于量化模型对隐藏目标的倾向性,以及欺骗行为得分(Deceptive Behavior Score)用于衡量模型内部信念与其输出之间的不一致性。该框架无需依赖真实标签即可有效识别和量化LLM的自发起欺骗行为,并揭示任务难度提升会显著加剧此类行为,为高风险领域部署LLM代理提供了重要警示与理论依据。

链接: https://arxiv.org/abs/2508.06361
作者: Zhaomin Wu,Mingzhe Du,See-Kiong Ng,Bingsheng He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective, remains a significant and underexplored threat. Existing studies typically induce such deception by explicitly setting a “hidden” objective through prompting or fine-tuning, which may not fully reflect real-world human-LLM interactions. Moving beyond this human-induced deception, we investigate LLMs’ self-initiated deception on benign prompts. To address the absence of ground truth in this evaluation, we propose a novel framework using “contact searching questions.” This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model’s bias towards a hidden objective. The second, Deceptive Behavior Score, measures the inconsistency between the LLM’s internal belief and its expressed output. Upon evaluating 14 leading LLMs, we find that both metrics escalate as task difficulty increases, rising in parallel for most models. Building on these findings, we formulate a mathematical model to explain this behavior. These results reveal that even the most advanced LLMs exhibit an increasing tendency toward deception when handling complex problems, raising critical concerns for the deployment of LLM agents in complex and crucial domains.
zh

[AI-10] From Explainable to Explanatory Artificial Intelligence: Toward a New Paradigm for Human-Centered Explanations through Generative AI

【速读】:该论文试图解决当前可解释人工智能(Explainable AI, XAI)方法过于侧重算法透明性,而忽视用户在实际应用场景中对理解与决策支持需求的问题。其解决方案的关键在于提出“解释性人工智能”(Explanatory AI)这一新范式,利用生成式人工智能(Generative AI)的能力作为人类理解的解释伙伴,而非仅提供技术层面的算法透明度。该范式强调通过叙事传播、自适应个性化和渐进式披露等原则,实现情境化、多模态且面向人类认知的解释机制,并通过快速情境设计方法在医疗专业人员中的实证验证表明,用户更偏好上下文敏感的解释方式,从而推动AI系统从算法内省转向以人类理解为中心的设计路径。

链接: https://arxiv.org/abs/2508.06352
作者: Christian Meske,Justin Brenne,Erdi Uenal,Sabahat Oelcer,Ayseguel Doganguen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Current explainable AI (XAI) approaches prioritize algorithmic transparency and present explanations in abstract, non-adaptive formats that often fail to support meaningful end-user understanding. This paper introduces “Explanatory AI” as a complementary paradigm that leverages generative AI capabilities to serve as explanatory partners for human understanding rather than providers of algorithmic transparency. While XAI reveals algorithmic decision processes for model validation, Explanatory AI addresses contextual reasoning to support human decision-making in sociotechnical contexts. We develop a definition and systematic eight-dimensional conceptual model distinguishing Explanatory AI through narrative communication, adaptive personalization, and progressive disclosure principles. Empirical validation through Rapid Contextual Design methodology with healthcare professionals demonstrates that users consistently prefer context-sensitive, multimodal explanations over technical transparency. Our findings reveal the practical urgency for AI systems designed for human comprehension rather than algorithmic introspection, establishing a comprehensive research agenda for advancing user-centered AI explanation approaches across diverse domains and cultural contexts.
zh

[AI-11] AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games

【速读】:该论文旨在解决在线视频游戏中作弊行为对游戏体验完整性的威胁问题,尤其是传统反作弊系统(如VAC)在应对不断演变的作弊手段时难以兼顾检测效果与用户系统侵入性之间的平衡。其解决方案的关键在于提出一种基于Transformer架构的机器学习模型AntiCheatPT_256,利用Counter-Strike 2的 gameplay 数据进行训练,并通过构建和公开发布CS2CD数据集(包含795场标注比赛)以及对上下文窗口进行增强以缓解类别不平衡问题,最终在未增强测试集上实现了89.17%的准确率和93.36%的AUC,为数据驱动的作弊检测提供了可复现且具备实际应用价值的基准方法。

链接: https://arxiv.org/abs/2508.06348
作者: Mille Mei Zhen Loo,Gert Luzkov,Paolo Burelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cheating in online video games compromises the integrity of gaming experiences. Anti-cheat systems, such as VAC (Valve Anti-Cheat), face significant challenges in keeping pace with evolving cheating methods without imposing invasive measures on users’ systems. This paper presents AntiCheatPT_256, a transformer-based machine learning model designed to detect cheating behaviour in Counter-Strike 2 using gameplay data. To support this, we introduce and publicly release CS2CD: A labelled dataset of 795 matches. Using this dataset, 90,707 context windows were created and subsequently augmented to address class imbalance. The transformer model, trained on these windows, achieved an accuracy of 89.17% and an AUC of 93.36% on an unaugmented test set. This approach emphasizes reproducibility and real-world applicability, offering a robust baseline for future research in data-driven cheat detection.
zh

[AI-12] Structural Equation-VAE: Disentangled Latent Representations for Tabular Data

【速读】:该论文旨在解决从表格数据中学习可解释的潜在表示(latent representations)这一挑战,尤其在深度生成建模领域中,如何实现结构化、可解释且对干扰变量鲁棒的潜在空间建模。其解决方案的关键在于提出SE-VAE(Structural Equation-Variational Autoencoder)架构,该架构将已知的测量结构(measurement structure)直接嵌入变分自编码器的设计中:通过引入与指标分组相对应的潜在子空间,并设计一个全局冗余潜在变量(global nuisance latent)以分离特定构念(construct-specific)的混杂变异,从而实现通过结构设计而非仅依赖统计正则化项来达成潜在因子的解耦(disentanglement)。实验表明,这种模块化设计显著提升了因子恢复能力、可解释性及对干扰因素的鲁棒性。

链接: https://arxiv.org/abs/2508.06347
作者: Ruiyu Zhang,Ce Zhao,Xin Zhao,Lin Nie,Wai-Fung Lam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Learning interpretable latent representations from tabular data remains a challenge in deep generative modeling. We introduce SE-VAE (Structural Equation-Variational Autoencoder), a novel architecture that embeds measurement structure directly into the design of a variational autoencoder. Inspired by structural equation modeling, SE-VAE aligns latent subspaces with known indicator groupings and introduces a global nuisance latent to isolate construct-specific confounding variation. This modular architecture enables disentanglement through design rather than through statistical regularizers alone. We evaluate SE-VAE on a suite of simulated tabular datasets and benchmark its performance against a series of leading baselines using standard disentanglement metrics. SE-VAE consistently outperforms alternatives in factor recovery, interpretability, and robustness to nuisance variation. Ablation results reveal that architectural structure, rather than regularization strength, is the key driver of performance. SE-VAE offers a principled framework for white-box generative modeling in scientific and social domains where latent constructs are theory-driven and measurement validity is essential.
zh

[AI-13] On Approximate MMS Allocations on Restricted Graph Classes

【速读】:该论文致力于解决在连通性约束下对不可分物品进行公平分配的问题,即要求分配给每个代理的物品集合必须构成图中一个连通子图。研究聚焦于广受关注的“最大最小份额”(maximin share, MMS)公平性标准,旨在寻找近似分配方案,使得每个代理获得的连通物品集合的价值至少为其MMS值的一个固定比例。论文的关键贡献在于证明了对于若干重要图类(包括块图、仙人掌图、完全多部图和分裂图),此类近似公平分配确实存在,从而扩展了已知可保证近似MMS分配的图类范围,并推动了在一般图类上是否存在此类分配这一开放问题的研究进展。

链接: https://arxiv.org/abs/2508.06343
作者: Václav Blažej,Michał Dębski ad Zbigniew Lonc,Marta Piecyk,Paweł Rzążewski
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of fair division of a set of indivisible goods with connectivity constraints. Specifically, we assume that the goods are represented as vertices of a connected graph, and sets of goods allocated to the agents are connected subgraphs of this graph. We focus on the widely-studied maximin share criterion of fairness. It has been shown that an allocation satisfying this criterion may not exist even without connectivity constraints, i.e., if the graph of goods is complete. In view of this, it is natural to seek approximate allocations that guarantee each agent a connected bundle of goods with value at least a constant fraction of the maximin share value to the agent. It is known that for some classes of graphs, such as complete graphs, cycles, and d -claw-free graphs for any fixed d , such approximate allocations indeed exist. However, it is an open problem whether they exist for the class of all graphs. In this paper, we continue the systematic study of the existence of approximate allocations on restricted graph classes. In particular, we show that such allocations exist for several well-studied classes, including block graphs, cacti, complete multipartite graphs, and split graphs. Subjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.06343 [cs.DM] (or arXiv:2508.06343v1 [cs.DM] for this version) https://doi.org/10.48550/arXiv.2508.06343 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

【速读】:该论文旨在解决多智能体强化学习中**即兴团队协作(ad-hoc teamwork)的鲁棒性问题,即如何在不依赖预训练伙伴或人工调参的情况下,让智能体能够与未知、动态变化的伙伴有效协作。解决方案的关键在于提出无监督伙伴设计(Unsupervised Partner Design, UPD)**框架:它通过随机混合代理策略与有偏随机行为来生成多样化训练伙伴,并利用基于方差的学习能力指标评估伙伴质量,优先选择位于代理当前学习前沿附近的伙伴。这一机制使得UPD无需种群或先验知识即可自适应地构建高效伙伴课程,从而显著提升协作性能和人类感知的自然性与适应性。

链接: https://arxiv.org/abs/2508.06336
作者: Constantin Ruhdorfer,Matteo Bortoletto,Victor Oei,Anna Penzkofer,Andreas Bulling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 16 pages

点击查看摘要

Abstract:We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent’s policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent’s current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.
zh

[AI-15] A “good regulator theorem” for embodied agents

【速读】:该论文试图解决Conant和Ashby提出的“良好调节器必须是系统的一个模型”这一理论在人工生命(Artificial Life)中遇到的适用性局限问题,即许多看似无需显式模型即可完成调控任务的系统如何与该理论兼容。解决方案的关键在于引入一种新的“信念更新”视角:只要一个代理能够执行调控任务,观察者就可以将其解释为具有关于环境的“信念”,并根据感官输入进行“更新”。这种信念更新机制构成了比原理论更复杂且更具普适性的“模型”概念,其核心突破在于将模型视为由观察者赋予系统的外部属性,而非系统本身的固有特征;由此得出的定理适用于经典控制论场景或内部状态调节场景,且能通过模型可能的平凡性来解释先前看似矛盾的反例。

链接: https://arxiv.org/abs/2508.06326
作者: Nathaniel Virgo,Martin Biehl,Manuel Baltieri,Matteo Capucci
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted at the Artificial Life conference 2025 (ALife 2025). 10 pages, 1 figure

点击查看摘要

Abstract:In a classic paper, Conant and Ashby claimed that “every good regulator of a system must be a model of that system.” Artificial Life has produced many examples of systems that perform tasks with apparently no model in sight; these suggest Conant and Ashby’s theorem doesn’t easily generalise beyond its restricted setup. Nevertheless, here we show that a similar intuition can be fleshed out in a different way: whenever an agent is able to perform a regulation task, it is possible for an observer to interpret it as having “beliefs” about its environment, which it “updates” in response to sensory input. This notion of belief updating provides a notion of model that is more sophisticated than Conant and Ashby’s, as well as a theorem that is more broadly applicable. However, it necessitates a change in perspective, in that the observer plays an essential role in the theory: models are not a mere property of the system but are imposed on it from outside. Our theorem holds regardless of whether the system is regulating its environment in a classic control theory setup, or whether it’s regulating its own internal state; the model is of its environment either way. The model might be trivial, however, and this is how the apparent counterexamples are resolved.
zh

[AI-16] LLM Robustness Leaderboard v1 --Technical report

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对恶意攻击时的鲁棒性评估问题,特别是如何系统性地识别和量化模型对有害行为诱导的脆弱性。其解决方案的关键在于提出PRISM Eval Behavior Elicitation Tool (BET),该工具通过动态对抗优化(Dynamic Adversarial Optimization)实现自动化红队测试,并在41个主流LLM中实现了100%攻击成功率(Attack Success Rate, ASR),同时引入细粒度鲁棒性指标——平均诱发有害行为所需尝试次数,揭示了不同模型间攻击难度存在超过300倍的差异;此外,还开展原语级漏洞分析(primitive-level vulnerability analysis),精准定位针对特定危害类别最有效的越狱技术,从而为社区提供可扩展、可协作的分布式鲁棒性评估路径。

链接: https://arxiv.org/abs/2508.06296
作者: Pierre Peigné - Lefebvre,Quentin Feuillade-Montixi,Tom David,Nicolas Miailhe
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.
zh

[AI-17] OM2P: Offline Multi-Agent Mean-Flow Policy

【速读】:该论文旨在解决生成式策略模型(如扩散模型和流模型)在离线多智能体强化学习(Offline Multi-Agent Reinforcement Learning, Offline MARL)中因迭代采样过程导致的低采样效率问题,从而限制其在时间敏感或资源受限场景下的应用。解决方案的关键在于提出OM2P(Offline Multi-Agent Mean-Flow Policy),通过引入一种奖励感知优化机制,将精心设计的均值流匹配损失(mean-flow matching loss)与Q函数监督相结合,缓解生成目标与奖励最大化之间的偏差;同时,采用广义的时间步分布和无导数估计策略,在降低内存开销的同时提升训练稳定性,最终实现单步动作采样,显著提升训练效率与实用性。

链接: https://arxiv.org/abs/2508.06269
作者: Zhuoran Li,Xun Wang,Hai Zhong,Longbo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.
zh

[AI-18] Numerical Considerations in Weighted Model Counting

【速读】:该论文旨在解决加权模型计数(Weighted Model Counting, WMC)中精度与效率难以兼顾的问题。WMC在概率推理和定量风险评估等领域具有重要应用,传统方法通常依赖浮点数运算以提高效率,但存在精度不可控的问题;而使用有理数运算虽能保证精确性,却带来高昂的时间和空间开销。论文的关键解决方案在于结合多种数值表示方式:对于非负权重问题,通过引入扩展范围双精度(Extended-Range Double, ERD)格式——即在标准IEEE双精度基础上增加一个独立的64位指数域——有效避免了下溢和上溢,同时可严格界定浮点运算的精度损失;对于含正负混合权重的问题,则采用区间浮点数与有理数运算相结合的方法,在保障结果精度的前提下显著提升计算效率。实验设计了极具挑战性的公式和权重分配,验证了该方案的鲁棒性和实用性。

链接: https://arxiv.org/abs/2508.06264
作者: Randal E. Bryant
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Weighted model counting computes the sum of the rational-valued weights associated with the satisfying assignments for a Boolean formula, where the weight of an assignment is given by the product of the weights assigned to the positive and negated variables comprising the assignment. Weighted model counting finds applications across a variety of domains including probabilistic reasoning and quantitative risk assessment. Most weighted model counting programs operate by (explicitly or implicitly) converting the input formula into a form that enables arithmetic evaluation, using multiplication for conjunctions and addition for disjunctions. Performing this evaluation using floating-point arithmetic can yield inaccurate results, and it cannot quantify the level of precision achieved. Computing with rational arithmetic gives exact results, but it is costly in both time and space. This paper describes how to combine multiple numeric representations to efficiently compute weighted model counts that are guaranteed to achieve a user-specified precision. When all weights are nonnegative, we prove that the precision loss of arithmetic evaluation using floating-point arithmetic can be tightly bounded. We show that supplementing a standard IEEE double-precision representation with a separate 64-bit exponent, a format we call extended-range double (ERD), avoids the underflow and overflow issues commonly encountered in weighted model counting. For problems with mixed negative and positive weights, we show that a combination of interval floating-point arithmetic and rational arithmetic can achieve the twin goals of efficiency and guaranteed precision. For our evaluations, we have devised especially challenging formulas and weight assignments, demonstrating the robustness of our approach. Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2508.06264 [math.NA] (or arXiv:2508.06264v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2508.06264 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-19] Symmetry breaking for inductive logic programming

【速读】:该论文旨在解决归纳逻辑编程(Inductive Logic Programming, ILP)中因假设空间庞大且存在大量逻辑等价假设而导致的搜索效率低下问题。其解决方案的关键在于引入一种打破假设空间对称性的方法,并通过答案集编程(Answer Set Programming, ASP)实现该思想,从而显著减少求解时间——实验表明,在视觉推理和游戏博弈等多个领域,求解时间可从超过一小时缩短至17秒。

链接: https://arxiv.org/abs/2508.06263
作者: Andrew Cropper,David M. Cerna,Matti Järvisalo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The goal of inductive logic programming is to search for a hypothesis that generalises training data and background knowledge. The challenge is searching vast hypothesis spaces, which is exacerbated because many logically equivalent hypotheses exist. To address this challenge, we introduce a method to break symmetries in the hypothesis space. We implement our idea in answer set programming. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce solving times from over an hour to just 17 seconds.
zh

[AI-20] Synthetic Data Generation and Differential Privacy using Tensor Networks Matrix Product States (MPS)

【速读】:该论文旨在解决现代人工智能中因数据稀缺、隐私保护需求及模型训练对多样化数据集依赖而带来的挑战,尤其是如何在保障数据隐私的前提下生成高质量的合成表格数据。其解决方案的关键在于利用张量网络(Tensor Networks)中的矩阵乘积态(Matrix Product States, MPS)构建生成模型,并结合差分隐私(Differential Privacy, DP)机制——通过训练过程中引入噪声注入和梯度裁剪,并采用Rényi差分隐私计数方法实现形式化隐私保证。实验表明,该方法在严格隐私约束下仍能保持优于CTGAN、VAE和PrivBayes等主流模型的数据保真度与下游任务性能,体现出MPS在隐私感知合成数据生成中的可解释性与可扩展优势。

链接: https://arxiv.org/abs/2508.06251
作者: Alejandro Moreno R.,Desale Fentaw,Samuel Palmer,Raúl Salles de Padua,Ninad Dixit,Samuel Mugel,Roman Orús,Manuel Radons,Josef Menter,Ali Abedi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Quantum Physics (quant-ph)
备注: 10 pages

点击查看摘要

Abstract:Synthetic data generation is a key technique in modern artificial intelligence, addressing data scarcity, privacy constraints, and the need for diverse datasets in training robust models. In this work, we propose a method for generating privacy-preserving high-quality synthetic tabular data using Tensor Networks, specifically Matrix Product States (MPS). We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes, focusing on both fidelity and privacy-preserving capabilities. To ensure differential privacy (DP), we integrate noise injection and gradient clipping during training, enabling privacy guarantees via Rényi Differential Privacy accounting. Across multiple metrics analyzing data fidelity and downstream machine learning task performance, our results show that MPS outperforms classical models, particularly under strict privacy constraints. This work highlights MPS as a promising tool for privacy-aware synthetic data generation. By combining the expressive power of tensor network representations with formal privacy mechanisms, the proposed approach offers an interpretable and scalable alternative for secure data sharing. Its structured design facilitates integration into sensitive domains where both data quality and confidentiality are critical.
zh

[AI-21] In-Training Defenses against Emergent Misalignment in Language Models

【速读】:该论文旨在解决生成式 AI (Generative AI) 在进行领域特定微调(fine-tuning)时可能出现的**涌现式失准(Emergent Misalignment, EMA)**问题,即微调过程可能无意中诱发模型在目标域之外产生有害行为,即使微调数据本身看似无害。解决方案的关键在于设计并实证评估四类可部署于API服务场景下的训练正则化干预策略:(i) 基于KL散度的参考安全模型约束、(ii) 特征空间中的ℓ₂距离正则化、(iii) 安全子空间投影(SafeLoRA)、(iv) 插入通用指令微调数据样本。这些方法在抑制EMA的同时保持良性任务性能,为模型提供方提供了实用的内建防护机制。

链接: https://arxiv.org/abs/2508.06249
作者: David Kaczér,Magnus Jørgenvåg,Clemens Vetter,Lucie Flek,Florian Mai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) \ell_2 distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods’ emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods’ impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.
zh

[AI-22] Membership Inference Attack with Partial Features

【速读】:该论文旨在解决部分特征成员推理问题(Partial Feature Membership Inference, PFMI),即在攻击者仅能观测到样本部分特征的情况下,判断该观测子集是否出现在目标模型的训练数据中。传统成员推理攻击通常假设攻击者拥有完整特征信息,这一假设在现实场景中往往不成立,限制了现有方法的应用范围。为应对该挑战,作者提出了一种两阶段攻击框架MRAD(Memory-guided Reconstruction and Anomaly Detection),其关键在于:第一阶段通过优化缺失特征以最小化样本损失,重建出尽可能接近真实数据的完整样本;第二阶段利用异常检测技术衡量重构样本与训练分布之间的偏离程度,从而实现成员推理。实验表明,MRAD在多种数据集上均有效,且兼容多种现成的异常检测方法,在STL-10上即使40%特征缺失仍能达到约0.6的AUC值。

链接: https://arxiv.org/abs/2508.06244
作者: Xurun Wang,Guangrui Liu,Xinjie Li,Haoyu He,Lin Yao,Weizhe Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning models have been shown to be susceptible to membership inference attack, which can be used to determine whether a given sample appears in the training data. Existing membership inference methods commonly assume that the adversary has full access to the features of the target sample. This assumption, however, does not hold in many real-world scenarios where only partial features information is available, thereby limiting the applicability of these methods. In this work, we study an inference scenario where the adversary observes only partial features of each sample and aims to infer whether this observed subset was present in the training set of the target model. We define this problem as Partial Feature Membership Inference (PFMI). To address this problem, we propose MRAD (Memory-guided Reconstruction and Anomaly Detection), a two-stage attack framework. In the first stage, MRAD optimizes the unknown feature values to minimize the loss of the sample. In the second stage, it measures the deviation between the reconstructed sample and the training distribution using anomaly detection. Empirical results demonstrate that MRAD is effective across a range of datasets, and maintains compatibility with various off-the-shelf anomaly detection techniques. For example, on STL-10, our attack achieves an AUC of around 0.6 even with 40% of the missing features.
zh

[AI-23] Learning Logical Rules using Minimum Message Length

【速读】:该论文旨在解决概率学习与逻辑学习统一的问题,这是人工智能(Artificial Intelligence, AI)领域的一个关键挑战。其解决方案的核心在于提出了一种贝叶斯归纳逻辑编程(Bayesian Inductive Logic Programming)方法,通过在先验中显式偏好更一般的程序、在似然函数中偏好更准确的程序,实现对最小消息长度(Minimum Message Length, MML)程序的学习。该方法在噪声数据上表现优异,且在游戏博弈和药物设计等多个领域显著优于以往基于最小描述长度(Minimum Description Length, MDL)的方法,同时展现出数据高效性和对样本平衡不敏感的特性,甚至可仅从正例中学习。

链接: https://arxiv.org/abs/2508.06230
作者: Ruben Sharma,Sebastijan Dumančić,Ross D. King,Andrew Cropper
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unifying probabilistic and logical learning is a key challenge in AI. We introduce a Bayesian inductive logic programming approach that learns minimum message length programs from noisy data. Our approach balances hypothesis complexity and data fit through priors, which explicitly favour more general programs, and a likelihood that favours accurate programs. Our experiments on several domains, including game playing and drug design, show that our method significantly outperforms previous methods, notably those that learn minimum description length programs. Our results also show that our approach is data-efficient and insensitive to example balance, including the ability to learn from exclusively positive examples.
zh

[AI-24] GeoLaux: A Benchmark for Evaluating MLLM s Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何问题求解(Geometry Problem Solving, GPS)中面临的关键短板,即对辅助线构造(auxiliary line construction)的忽视以及缺乏细粒度的过程评估机制,导致现有基准无法有效衡量MLLMs的长步骤推理能力。解决方案的核心是提出GeoLaux基准,包含2,186道涵盖计算与证明类别的几何题,平均每题需6.51步推理(最多24步),其中41.8%的问题要求构造辅助线;并设计了一种五维评估策略,从答案正确性、过程正确性、过程质量、辅助线影响及错误原因五个维度进行系统评测。实验证明,该方案不仅可精准刻画MLLMs在复杂几何推理中的表现差异,还揭示了辅助线意识缺失是制约性能提升的关键瓶颈,为模型优化提供了明确方向。

链接: https://arxiv.org/abs/2508.06226
作者: Yumeng Fu,Jiayin Zhu,Lingling Zhang,Bo Zhao,Shaoxuan Ma,Yushun Zhang,Yanrui Wu,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geometry problem solving (GPS) requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction. This presents a significant challenge for Multimodal Large Language Models (MLLMs). However, existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation, making them insufficient for assessing MLLMs’ long-step reasoning abilities. To bridge these gaps, we present the GeoLaux benchmark, comprising 2,186 geometry problems, incorporating both calculation and proving questions. Notably, the problems require an average of 6.51 reasoning steps, with a maximum of 24 steps, and 41.8% of them need auxiliary line construction. Building on the dataset, we design a novel five-dimensional evaluation strategy assessing answer correctness, process correctness, process quality, auxiliary line impact, and error causes. Extensive experiments on 13 leading MLLMs (including thinking models and non-thinking models) yield three pivotal findings: First, models exhibit substantial performance degradation in extended reasoning steps (nine models demonstrate over 50% performance drop). Second, compared to calculation problems, MLLMs tend to take shortcuts when solving proving problems. Third, models lack auxiliary line awareness, and enhancing this capability proves particularly beneficial for overall geometry reasoning improvement. These findings establish GeoLaux as both a benchmark for evaluating MLLMs’ long-step geometric reasoning with auxiliary lines and a guide for capability advancement. Our dataset and code are included in supplementary materials and will be released.
zh

[AI-25] Overconfidence in LLM -as-a-Judge: Diagnosis and Confidence-Driven Solution

【速读】:该论文旨在解决当前大语言模型作为自动评判者(LLM-as-a-Judge)时存在的可靠性问题,特别是现有方法过度关注准确率而忽视了置信度校准(confidence calibration)的重要性,导致模型在实际部署中因过自信(Overconfidence Phenomenon)而产生不可靠的判断。其解决方案的关键在于提出一种新的评估指标——TH-Score,用于量化置信度与准确性之间的对齐程度,并设计了一个名为LLM-as-a-Fuser的集成框架,通过融合多个模型的输出来提升置信度的校准性与风险感知能力,从而实现更可靠、自适应的评估流程。

链接: https://arxiv.org/abs/2508.06225
作者: Zailong Tian,Zhuoheng Han,Yanzhe Chen,Haozhe Xu,Xi Yang,richeng xuan,Hongfeng Wang,Lizi Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.
zh

[AI-26] Reparameterization Proximal Policy Optimization

【速读】:该论文旨在解决基于重参数化策略梯度(Reparameterization Policy Gradient, RPG)方法在训练过程中因高方差梯度导致的不稳定性问题,从而提升样本效率。其解决方案的关键在于将近端策略优化(Proximal Policy Optimization, PPO)中的代理目标(surrogate objective)与RPG相结合,通过证明RPG框架下可高效计算PPO类代理目标的重参数化梯度(即利用时间反向传播,Backpropagation Through Time, BPTT),提出了一种新的稳定且样本高效的算法——重参数化近端策略优化(Reparameterization Proximal Policy Optimization, RPO)。RPO通过优化裁剪后的代理目标实现多轮稳定的数据复用,并辅以KL散度正则化进一步增强稳定性,同时保持与现有方差减少技术的兼容性。

链接: https://arxiv.org/abs/2508.06214
作者: Hai Zhong,Xun Wang,Zhuoran Li,Longbo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables multiple epochs of stable sample reuse by optimizing a clipped surrogate objective tailored for RPG, while being further stabilized by Kullback-Leibler (KL) divergence regularization and remaining fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.
zh

[AI-27] Graph Federated Learning for Personalized Privacy Recommendation

【速读】:该论文旨在解决现有联邦推荐系统(Federated Recommendation Systems, FedRecs)中假设所有用户具有相同隐私保护需求的问题,即默认所有用户均不上传任何交互数据,从而忽略了利用公开用户数据提升推荐性能的可能性。为应对这一挑战,论文提出了一种新型图联邦学习框架——面向个性化隐私推荐的图联邦学习(Graph Federated Learning for Personalized Privacy Recommendation, GFed-PP)。其关键在于:通过整合公开用户的交互数据构建用户-物品交互图,并进一步生成用户关系图,借助轻量级图卷积网络(Graph Convolutional Network, GCN)学习每个用户的个性化物品嵌入;同时,在客户端本地训练用户嵌入和评分函数以保障隐私,并通过在客户端初始化物品嵌入、服务器端聚合用户关系图来优化整个联邦推荐框架。该方案有效平衡了不同隐私偏好下的推荐性能与隐私保护。

链接: https://arxiv.org/abs/2508.06208
作者: Ce Na,Kai Yang,Dengzhao Fang,Yu Li,Jingtong Gao,Chengcheng Zhu,Jiale Zhang,Xiaobing Sun,Yi Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated recommendation systems (FedRecs) have gained significant attention for providing privacy-preserving recommendation services. However, existing FedRecs assume that all users have the same requirements for privacy protection, i.e., they do not upload any data to the server. The approaches overlook the potential to enhance the recommendation service by utilizing publicly available user data. In real-world applications, users can choose to be private or public. Private users’ interaction data is not shared, while public users’ interaction data can be shared. Inspired by the issue, this paper proposes a novel Graph Federated Learning for Personalized Privacy Recommendation (GFed-PP) that adapts to different privacy requirements while improving recommendation performance. GFed-PP incorporates the interaction data of public users to build a user-item interaction graph, which is then used to form a user relationship graph. A lightweight graph convolutional network (GCN) is employed to learn each user’s user-specific personalized item embedding. To protect user privacy, each client learns the user embedding and the scoring function locally. Additionally, GFed-PP achieves optimization of the federated recommendation framework through the initialization of item embedding on clients and the aggregation of the user relationship graph on the server. Experimental results demonstrate that GFed-PP significantly outperforms existing methods for five datasets, offering superior recommendation accuracy without compromising privacy. This framework provides a practical solution for accommodating varying privacy preferences in federated recommendation systems.
zh

[AI-28] Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

【速读】:该论文旨在解决当前预训练神经网络模型在分子化学领域(如小分子药物设计)中性能评估缺乏严谨性的问题,特别是针对现有研究中普遍存在的“模型表现优于传统分子指纹(ECFP)”这一假设的验证不足。其关键解决方案在于构建了一个公平、系统且统计严谨的比较框架:通过在25个不同数据集上对25种预训练模型进行横向对比,并采用专用的分层贝叶斯统计检验模型来量化性能差异,从而得出可靠结论——绝大多数神经模型并未显著优于基础ECFP指纹,仅有CLAMP模型表现出统计学意义上的优越性。这一发现揭示了当前文献中可能存在的评估偏差,并为后续研究提供了方法论改进方向和实践建议。

链接: https://arxiv.org/abs/2508.06199
作者: Mateusz Praski,Jakub Adamczyk,Wojciech Czech
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
zh

[AI-29] Differentially Private Federated Clustering with Random Rebalancing

【速读】:该论文旨在解决联邦聚类(federated clustering)中隐私保护与模型效用之间的权衡问题。传统方法在客户端层面应用差分隐私(differentially private, DP)机制时,由于聚类后各簇内客户端数量不可控,导致隐私噪声难以有效平均,从而显著降低模型性能。解决方案的关键在于提出一种轻量级插件式技术 RR-Cluster,通过随机重平衡簇分配策略,确保每个簇至少包含一定数量的客户端,从而降低隐私噪声方差;同时,作者分析了因错误分配可能引入的偏差与隐私噪声减少之间的权衡关系,并给出了收敛性边界。实验证明,将 RR-Cluster 集成到主流联邦聚类算法中可显著提升隐私-效用平衡表现。

链接: https://arxiv.org/abs/2508.06183
作者: Xiyuan Yang,Shengyuan Hu,Soyeon Kim,Tian Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Federated clustering aims to group similar clients into clusters and produce one model for each cluster. Such a personalization approach typically improves model performance compared with training a single model to serve all clients, but can be more vulnerable to privacy leakage. Directly applying client-level differentially private (DP) mechanisms to federated clustering could degrade the utilities significantly. We identify that such deficiencies are mainly due to the difficulties of averaging privacy noise within each cluster (following standard privacy mechanisms), as the number of clients assigned to the same clusters is uncontrolled. To this end, we propose a simple and effective technique, named RR-Cluster, that can be viewed as a light-weight add-on to many federated clustering algorithms. RR-Cluster achieves reduced privacy noise via randomly rebalancing cluster assignments, guaranteeing a minimum number of clients assigned to each cluster. We analyze the tradeoffs between decreased privacy noise variance and potentially increased bias from incorrect assignments and provide convergence bounds for RR-Clsuter. Empirically, we demonstrate the RR-Cluster plugged into strong federated clustering algorithms results in significantly improved privacy/utility tradeoffs across both synthetic and real-world datasets.
zh

[AI-30] Semantic Item Graph Enhancement for Multimodal Recommendation

【速读】:该论文旨在解决多模态推荐系统中因模态特定项目语义图(modality-specific item semantic graph)存在的语义不足问题,包括:(1) 项目间协作信号建模不足;(2) 原始模态特征中的噪声引入结构失真,从而影响用户偏好学习效果。解决方案的关键在于三个方面:首先,从用户-项目交互图中提取协作信号并注入到各模态语义图中以增强语义建模;其次,设计基于模量的个性化嵌入扰动机制,通过模量引导的个性化强度注入扰动生成对比视图,利用对比学习提升模型对噪声的鲁棒性,降低语义图结构噪声的影响;最后,提出双层表示对齐机制,先以行为表示为锚点,通过锚定InfoNCE损失对齐多种语义表示,再以标准InfoNCE将行为表示与融合语义对齐,保障表示一致性。

链接: https://arxiv.org/abs/2508.06154
作者: Xiaoxiong Zhang,Xin Zhou,Zhiwei Zeng,Dusit Niyato,Zhiqi Shen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items’ multimodal information. Prior methods often build modality-specific item-item semantic graphs from raw modality features and use them as supplementary structures alongside the user-item interaction graph to enhance user preference learning. However, these semantic graphs suffer from semantic deficiencies, including (1) insufficient modeling of collaborative signals among items and (2) structural distortions introduced by noise in raw modality features, ultimately compromising performance. To address these issues, we first extract collaborative signals from the interaction graph and infuse them into each modality-specific item semantic graph to enhance semantic modeling. Then, we design a modulus-based personalized embedding perturbation mechanism that injects perturbations with modulus-guided personalized intensity into embeddings to generate contrastive views. This enables the model to learn noise-robust representations through contrastive learning, thereby reducing the effect of structural noise in semantic graphs. Besides, we propose a dual representation alignment mechanism that first aligns multiple semantic representations via a designed Anchor-based InfoNCE loss using behavior representations as anchors, and then aligns behavior representations with the fused semantics by standard InfoNCE, to ensure representation consistency. Extensive experiments on four benchmark datasets validate the effectiveness of our framework.
zh

[AI-31] Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域,尤其是药品禁忌症(Pharmaceutical Contraindications)信息提供中的准确性与可靠性问题。由于该领域对信息的精确性要求极高,传统LLMs在缺乏专业医学知识支撑时易产生误导性回答。解决方案的关键在于构建一个基于检索增强生成(Retrieval Augmented Generation, RAG)的框架,通过整合来自公共数据库的药物使用审查(Drug Utilization Review, DUR)数据,并利用Langchain实现混合检索与重排序机制,从而显著提升模型在特定场景下(如年龄组、妊娠状态及联合用药)的判断准确率,最终将准确率从基线0.49–0.57提升至0.87–0.94,有效降低处方决策中的不确定性。

链接: https://arxiv.org/abs/2508.06145
作者: Byeonghun Bang,Jongsuk Yoon,Dong-Jin Chang,Seho Park,Yong Oh Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI’s GPT-4o-mini as the base model, and the text-embedding-3-small model for embeddings, our approach integrates Langchain to orchestrate a hybrid retrieval system with re-ranking. This system leverages Drug Utilization Review (DUR) data from public databases, focusing on contraindications for specific age groups, pregnancy, and concomitant drug use. The dataset includes 300 question-answer pairs across three categories, with baseline model accuracy ranging from 0.49 to 0.57. Post-integration of the RAG pipeline, we observed a significant improvement in model accuracy, achieving rates of 0.94, 0.87, and 0.89 for contraindications related to age groups, pregnancy, and concomitant drug use, respectively. The results indicate that augmenting LLMs with a RAG framework can substantially reduce uncertainty in prescription and drug intake decisions by providing more precise and reliable drug contraindication information.
zh

[AI-32] Study of Robust Features in Formulating Guidance for Heuristic Algorithms for Solving the Vehicle Routing Problem

【速读】:该论文旨在解决车辆路径问题(Vehicle Routing Problem, VRP)求解效率低下的难题,传统元启发式算法依赖人工设计且性能受限。其解决方案的关键在于利用可解释人工智能(Explainable AI)技术,通过多分类器模型对VRP解的质量进行预测,并开展敏感性分析以识别影响解质量的核心特征。研究发现,尽管不同场景下特征重要性存在差异,但某些特征始终表现出强预测能力;进而提出一个统一框架用于跨场景比较特征影响力,从而为元启发式算法提供基于特征重要性的指导机制,提升算法设计的自动化与智能化水平。

链接: https://arxiv.org/abs/2508.06129
作者: Bachtiar Herdianto,Romain Billot,Flavien Lucas,Marc Sevaux
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:The Vehicle Routing Problem (VRP) is a complex optimization problem with numerous real-world applications, mostly solved using metaheuristic algorithms due to its \mathcalNP -Hard nature. Traditionally, these metaheuristics rely on human-crafted designs developed through empirical studies. However, recent research shows that machine learning methods can be used the structural characteristics of solutions in combinatorial optimization, thereby aiding in designing more efficient algorithms, particularly for solving VRP. Building on this advancement, this study extends the previous research by conducting a sensitivity analysis using multiple classifier models that are capable of predicting the quality of VRP solutions. Hence, by leveraging explainable AI, this research is able to extend the understanding of how these models make decisions. Finally, our findings indicate that while feature importance varies, certain features consistently emerge as strong predictors. Furthermore, we propose a unified framework able of ranking feature impact across different scenarios to illustrate this finding. These insights highlight the potential of feature importance analysis as a foundation for developing a guidance mechanism of metaheuristic algorithms for solving the VRP.
zh

[AI-33] SKATE a Scalable Tournament Eval: Weaker LLM s differentiate between stronger ones using verifiable challenges

【速读】:该论文旨在解决当前基础模型(foundation models)评估方法依赖大量领域专业知识、难以随模型快速演进而扩展的问题。现有评估手段在可扩展性、开放性和客观性之间难以平衡,限制了对生成式 AI(Generative AI)能力与风险的高效衡量。其解决方案的关键在于提出 SKATE 框架——一种基于大语言模型(LLMs)相互竞争的自动化评估机制:模型作为任务设定者和求解者双重角色参与游戏化对抗,通过生成可验证的任务来凸显自身优势并暴露对手弱点。该设计实现了无需人工标注或特定领域知识的全自动化评估,且利用可验证任务替代 LLM 判官以确保评分客观性,同时借助 LLM 的创造性挑战生成实现开放式的、可扩展的能力评测。

链接: https://arxiv.org/abs/2508.06111
作者: Dewi S. W. Gould,Bruno Mlodozeniec,Samuel F. Brown
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages and appendices

点击查看摘要

Abstract:Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others’ weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.
zh

[AI-34] PanelTR: Zero-Shot Table Reasoning Framework Through Multi-Agent Scientific Discussion IJCNN2025

【速读】:该论文旨在解决表格推理(table reasoning)任务中对标注数据或复杂数据增强的依赖问题,以及大型语言模型(LLM)在该任务上表现不如简单监督模型的局限性。其解决方案的关键在于提出PanelTR框架,通过引入由五类科学家角色(scientist personas)驱动的结构化科学方法,使LLM代理科学家进行独立调查、自我审查和协作同行评审,从而实现无需数据增强或参数优化的语义级迁移,最终在零样本(zero-shot)条件下显著提升表格问答与事实验证等任务的性能。

链接: https://arxiv.org/abs/2508.06110
作者: Yiran Rex Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Table reasoning, including tabular QA and fact verification, often depends on annotated data or complex data augmentation, limiting flexibility and generalization. LLMs, despite their versatility, often underperform compared to simple supervised models. To approach these issues, we introduce PanelTR, a framework utilizing LLM agent scientists for robust table reasoning through a structured scientific approach. PanelTR’s workflow involves agent scientists conducting individual investigations, engaging in self-review, and participating in collaborative peer-review discussions. This process, driven by five scientist personas, enables semantic-level transfer without relying on data augmentation or parametric optimization. Experiments across four benchmarks show that PanelTR outperforms vanilla LLMs and rivals fully supervised models, all while remaining independent of training data. Our findings indicate that structured scientific methodology can effectively handle complex tasks beyond table reasoning with flexible semantic understanding in a zero-shot context.
zh

[AI-35] GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning

【速读】:该论文旨在解决目标条件强化学习(Goal-conditioned Reinforcement Learning, GCRL)中稀疏奖励环境下样本效率低下的问题。现有方法如Hindsight Experience Replay(HER)通过重标注轨迹中的目标来提升学习效果,但作者指出其未能充分挖掘离线策略GCRL方法中可用经验的价值。解决方案的关键在于提出一种新的正则化技术——Hindsight Goal-conditioned Regularization(HGR),该方法基于回溯目标生成动作正则化先验,并与Hindsight Self-Imitation Regularization(HSR)结合,从而最大化经验利用率,显著提升样本重用效率并实现更优性能。

链接: https://arxiv.org/abs/2508.06108
作者: Xing Lei,Wenyan Yang,Kaiqiang Ke,Shentao Yang,Xuetao Zhang,Joni Pajarinen,Donglin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.
zh

[AI-36] MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

【速读】:该论文旨在解决文本到音频生成(Text-to-Audio Generation, TTA)系统中推理速度慢的问题,该问题严重限制了现有模型的实际应用。解决方案的关键在于提出一种基于均值流(MeanFlow)的新型模型 MeanAudio,其核心创新包括:1)在训练中回归平均速度场(average velocity field),从而实现从流轨迹起点直接映射到终点的单步生成,显著提升推理效率;2)将无分类器引导(Classifier-Free Guidance, CFG)嵌入训练目标,避免引导采样时引入额外计算开销;3)设计瞬时到均值课程学习策略(instantaneous-to-mean curriculum)与流场混合(flow field mix-up),以稳定训练并提升生成质量。实验表明,MeanAudio 在单步生成上实现了 0.013 的实时因子(RTF),相较当前最优扩散模型提速 100 倍,同时在多步生成中仍保持良好的连贯性。

链接: https://arxiv.org/abs/2508.06098
作者: Xiquan Li,Junxi Liu,Yuzhe Liang,Zhikang Niu,Wenxi Chen,Xie Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Recent developments in diffusion- and flow- based models have significantly advanced Text-to-Audio Generation (TTA). While achieving great synthesis quality and controllability, current TTA systems still suffer from slow inference speed, which significantly limits their practical applicability. This paper presents MeanAudio, a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. Built on a Flux-style latent transformer, MeanAudio regresses the average velocity field during training, enabling fast generation by mapping directly from the start to the endpoint of the flow trajectory. By incorporating classifier-free guidance (CFG) into the training target, MeanAudio incurs no additional cost in the guided sampling process. To further stabilize training, we propose an instantaneous-to-mean curriculum with flow field mix-up, which encourages the model to first learn the foundational instantaneous dynamics, and then gradually adapt to mean flows. This strategy proves critical for enhancing training efficiency and generation quality. Experimental results demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also demonstrates strong performance in multi-step generation, enabling smooth and coherent transitions across successive synthesis steps.
zh

[AI-37] Bounding Distributional Shifts in World Modeling through Novelty Detection

【速读】:该论文旨在解决基于模型的规划算法对世界模型训练质量敏感的问题,即当前方法在训练数据未能充分覆盖动作和状态空间时,容易在推理阶段发生发散。解决方案的关键在于引入变分自编码器(Variational Autoencoder, VAE)作为新颖性检测器,在模型预测控制(Model Predictive Control, MPC)策略循环中识别并规避可能导致模型偏离训练数据分布的动作轨迹,从而提升模型在低数据效率场景下的鲁棒性。

链接: https://arxiv.org/abs/2508.06096
作者: Eric Jing,Abdeslam Boularias
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model-predictive control policy loop extending the DINO-WM architecture. The results clearly show that the proposed method improves over state-of-the-art solutions in terms of data efficiency.
zh

[AI-38] Aggregate-Combine-Readout GNNs Are More Expressive Than Logic C2

【速读】:该论文旨在解决一个关于图神经网络(Graph Neural Networks, GNNs)逻辑表达能力的开放性问题:即全二阶逻辑(C2)是否能够刻画包含读出(readout)机制的聚合-组合-读出型GNNs(aggregate-combine-readout GNNs)的逻辑表达能力。此前,Barceló等人(2020)证明了梯度模态逻辑或C2的受限片段可表征仅含聚合与组合步骤的GNNs的表达能力,但“是否C2能完整刻画包含读出层的GNNs”这一问题长期未解。本文的关键突破在于证明:包含读出机制的GNNs的逻辑表达能力严格强于C2,无论在无向图还是有向图上均成立。该结论不仅澄清了GNNs与逻辑语言之间的关系,还为无穷逻辑(infinitary logics)的表达能力提供了新的理论洞见。

链接: https://arxiv.org/abs/2508.06091
作者: Stan P Hauke,Przemysław Andrzej Wałęga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:In recent years, there has been growing interest in understanding the expressive power of graph neural networks (GNNs) by relating them to logical languages. This research has been been initialised by an influential result of Barceló et al. (2020), who showed that the graded modal logic (or a guarded fragment of the logic C2), characterises the logical expressiveness of aggregate-combine GNNs. As a ``challenging open problem’’ they left the question whether full C2 characterises the logical expressiveness of aggregate-combine-readout GNNs. This question has remained unresolved despite several attempts. In this paper, we solve the above open problem by proving that the logical expressiveness of aggregate-combine-readout GNNs strictly exceeds that of C2. This result holds over both undirected and directed graphs. Beyond its implications for GNNs, our work also leads to purely logical insights on the expressive power of infinitary logics.
zh

[AI-39] ME3-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception

【速读】:该论文旨在解决自动驾驶系统在复杂环境感知与实时决策中存在的挑战,尤其是传统模块化方法因误差传播和协调问题导致性能受限,以及端到端学习方法面临的计算瓶颈。其解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的新型架构——ME³-BEV,该框架融合了鸟瞰图(Bird’s-eye View, BEV)感知与Mamba架构的时空特征提取能力,通过Mamba-BEV模型实现对车辆周围环境和道路特征的统一坐标系建模,并有效捕捉长距离时序依赖关系,从而显著提升动态城市驾驶场景下的决策效率与准确性。

链接: https://arxiv.org/abs/2508.06074
作者: Siyi Lu,Run Liu,Dongsheng Yang,Lei He
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird’s-eye view (BEV) perception for enhanced real-time decision-making. We introduce the \textttMamba-BEV model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the \textttME ^3 -BEV framework, which utilizes the \textttMamba-BEV model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \textttME ^3 -BEV outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.
zh

[AI-40] Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

【速读】:该论文旨在解决深度时序模型(如时序卷积网络,Temporal Convolutional Networks, TCNs)在序列数据上预测性能优异但其泛化能力缺乏理论保障的问题。其核心贡献在于首次提出了适用于深度时序架构的非平凡(non-vacuous)、结构感知的泛化边界,并设计了一种基于延迟反馈分块(delayed-feedback blocking)机制的方法来处理序列依赖性。该机制通过将强相关样本转化为近似独立样本,仅丢弃 O(1/logN)O(1/\log N) 的数据量,从而实现泛化界中关于网络深度 DDD\sqrt{D} 标度而非指数级增长,表明增加深度需约四倍训练数据以维持相同泛化性能。这一发现揭示了时间依赖性在固定信息预算下可能提升学习效率,挑战了“依赖性总是有害”的直觉,同时指出现有理论与实际收敛速率之间的差距,为未来研究提供了方向。

链接: https://arxiv.org/abs/2508.06066
作者: Barak Gahtan,Alex M. Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep temporal architectures such as Temporal Convolutional Networks (TCNs) achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap by providing both the first non-vacuous, architecture-aware generalization bounds for deep temporal models and a principled evaluation methodology. For exponentially \beta -mixing sequences, we derive bounds scaling as O!\Bigl(R,\sqrt\tfracD,p,n,\log NN\Bigr), where D is network depth, p kernel size, n input dimension, and R weight norm. Our delayed-feedback blocking mechanism transforms dependent samples into effectively independent ones while discarding only O(1/\log N) of the data, yielding \sqrtD scaling instead of exponential, implying that doubling depth requires approximately quadrupling the training data. We also introduce a fair-comparison methodology that fixes the effective sample size to isolate the effect of temporal structure from information content. Under N_\texteff=2,000 , strongly dependent sequences ( \rho=0.8 ) exhibit \approx76% smaller generalization gaps than weakly dependent ones ( \rho=0.2 ), challenging the intuition that dependence is purely detrimental. Yet convergence rates diverge from theory: weak dependencies follow N_\texteff^-1.21 scaling and strong dependencies follow N_\texteff^-0.89 , both steeper than the predicted N^-0.5 . These findings reveal that temporal dependence can enhance learning under fixed information budgets, while highlighting gaps between theory and practice that motivate future research. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.06066 [cs.LG] (or arXiv:2508.06066v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.06066 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] A Generic Complete Anytime Beam Search for Optimal Decision Tree

【速读】:该论文旨在解决决策树学习中最优决策树搜索的 anytime 行为不佳问题,即在计算资源受限或时间紧迫时,现有精确算法(如基于混合整数线性规划 MILP、约束规划 CP、布尔可满足性 SAT 或动态规划的方法)难以快速生成高质量的决策树,其根本原因在于搜索空间探索不均衡。解决方案的关键在于提出一种通用、完整且具有 anytime 特性的束搜索算法 CA-DL8.5,它扩展了 DL8.5 框架,并通过模块化设计整合多种启发式策略与松弛机制;其核心创新是采用基于重启的束搜索结构,在迭代过程中逐步放宽剪枝条件,从而在保持完备性和最优性保证的同时,显著提升解的质量随时间演进的能力。实验表明,使用 LDS(Limited Discrepancy)启发式的 CA-DL8.5 在标准分类基准上表现最优,优于其他变体及 Blossom 算法。

链接: https://arxiv.org/abs/2508.06064
作者: Harold Silvère Kiossou,Siegfried Nijssen,Pierre Schaus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finding an optimal decision tree that minimizes classification error is known to be NP-hard. While exact algorithms based on MILP, CP, SAT, or dynamic programming guarantee optimality, they often suffer from poor anytime behavior – meaning they struggle to find high-quality decision trees quickly when the search is stopped before completion – due to unbalanced search space exploration. To address this, several anytime extensions of exact methods have been proposed, such as LDS-DL8.5, Top-k-DL8.5, and Blossom, but they have not been systematically compared, making it difficult to assess their relative effectiveness. In this paper, we propose CA-DL8.5, a generic, complete, and anytime beam search algorithm that extends the DL8.5 framework and unifies some existing anytime strategies. In particular, CA-DL8.5 generalizes previous approaches LDS-DL8.5 and Top-k-DL8.5, by allowing the integration of various heuristics and relaxation mechanisms through a modular design. The algorithm reuses DL8.5’s efficient branch-and-bound pruning and trie-based caching, combined with a restart-based beam search that gradually relaxes pruning criteria to improve solution quality over time. Our contributions are twofold: (1) We introduce this new generic framework for exact and anytime decision tree learning, enabling the incorporation of diverse heuristics and search strategies; (2) We conduct a rigorous empirical comparison of several instantiations of CA-DL8.5 – based on Purity, Gain, Discrepancy, and Top-k heuristics – using an anytime evaluation metric called the primal gap integral. Experimental results on standard classification benchmarks show that CA-DL8.5 using LDS (limited discrepancy) consistently provides the best anytime performance, outperforming both other CA-DL8.5 variants and the Blossom algorithm while maintaining completeness and optimality guarantees.
zh

[AI-42] Dont Forget Imagination!

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在推理能力上的局限性问题,尤其是由于缺乏对认知想象力(cognitive imagination)的建模而导致的语义上下文一致性不足与可解释性缺失。其核心问题是:现有AI系统在进行推理时无法像人类一样依赖想象中的语义背景来验证逻辑合理性,从而导致“盲式推理”。解决方案的关键在于提出一种名为“语义模型”(semantic models)的新数学建模方法,该方法通过学习概率因果关系实现对认知想象力的模拟,确保虚构情境的一致性,并采用“玻璃盒”(glass-box)机制使整个语义上下文作为一个由因果关系连接的统一系统可被显式操控和理解。

链接: https://arxiv.org/abs/2508.06062
作者: Evgenii E. Vityaev,Andrei Mantsivoda
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Cognitive imagination is a type of imagination that plays a key role in human thinking. It is not a ``picture-in-the-head’’ imagination. It is a faculty to mentally visualize coherent and holistic systems of concepts and causal links that serve as semantic contexts for reasoning, decision making and prediction. Our position is that the role of cognitive imagination is still greatly underestimated, and this creates numerous problems and diminishes the current capabilities of AI. For instance, when reasoning, humans rely on imaginary contexts to retrieve background info. They also constantly return to the context for semantic verification that their reasoning is still reasonable. Thus, reasoning without imagination is blind. This paper is a call for greater attention to cognitive imagination as the next promising breakthrough in artificial intelligence. As an instrument for simulating cognitive imagination, we propose semantic models – a new approach to mathematical models that can learn, like neural networks, and are based on probabilistic causal relationships. Semantic models can simulate cognitive imagination because they ensure the consistency of imaginary contexts and implement a glass-box approach that allows the context to be manipulated as a holistic and coherent system of interrelated facts glued together with causal relations.
zh

[AI-43] LLM s for Resource Allocation: A Participatory Budgeting Approach to Inferring Preferences ECAI2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂决策任务中进行结构化资源分配的能力不足问题,以及现有评估基准因数据污染和静态特性而难以有效衡量其推理能力的局限。解决方案的关键在于提出一个双重用途框架,利用参与式预算(Participatory Budgeting, PB)作为实际应用场景来测试LLMs的资源分配表现,并同时作为动态适应性基准来评估其推理能力。该框架通过三种提示策略(贪婪选择、直接优化和受爬山算法启发的精炼)让LLMs在预算约束下选择项目子集,并与效用最大化的理想基准(oracle)进行对比;此外,还考察LLMs是否能从自然语言投票输入或元数据中推断出结构化偏好,从而评估其从非结构化输入中提取偏好的能力。结果表明,提示设计对LLM性能至关重要,且LLMs具备处理机制设计任务的潜力。

链接: https://arxiv.org/abs/2508.06060
作者: Sankarshan Damle,Boi Faltings
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly expected to handle complex decision-making tasks, yet their ability to perform structured resource allocation remains underexplored. Evaluating their reasoning is also difficult due to data contamination and the static nature of existing benchmarks. We present a dual-purpose framework leveraging Participatory Budgeting (PB) both as (i) a practical setting for LLM-based resource allocation and (ii) an adaptive benchmark for evaluating their reasoning capabilities. We task LLMs with selecting project subsets under feasibility (e.g., budget) constraints via three prompting strategies: greedy selection, direct optimization, and a hill-climbing-inspired refinement. We benchmark LLMs’ allocations against a utility-maximizing oracle. Interestingly, we also test whether LLMs can infer structured preferences from natural-language voter input or metadata, without explicit votes. By comparing allocations based on inferred preferences to those from ground-truth votes, we evaluate LLMs’ ability to extract preferences from open-ended input. Our results underscore the role of prompt design and show that LLMs hold promise for mechanism design with unstructured inputs.
zh

[AI-44] Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态、长时程任务中表现不足的问题,特别是在部分可观测环境下的实时战略游戏(如《星际争霸II》SC2)中,LLMs难以有效管理资源约束并适应战场局势变化。解决方案的关键在于提出一种分层多智能体框架——HIMA(Hierarchical Imitation Multi-Agent),其核心是通过一个称为“战略规划器”(Strategic Planner, SP)的元控制器协调多个基于模仿学习(imitation learning)的专用智能体,每个智能体从专家示范中学习特定策略(如空中支援或防御动作),生成结构化的多步动作序列;SP则将这些局部策略整合为全局自适应计划,确保局部决策与长期战略一致,从而提升战略清晰度、适应性与计算效率。

链接: https://arxiv.org/abs/2508.06042
作者: Daechul Ahn,San Kim,Jonghyun Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: COLM 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated impressive action sequence prediction capabilities but often struggle with dynamic, long-horizon tasks such as real-time strategic games. In a game such as StarCraftII (SC2), agents need to manage resource constraints and adapt to evolving battlefield situations in a partially observable environment. This often overwhelms exisiting LLM-based approaches. To address these challenges, we propose a hierarchical multi-agent framework that employs specialized imitation learning agents under a meta-controller called Strategic Planner (SP). By expert demonstrations, each specialized agent learns a distinctive strategy, such as aerial support or defensive maneuvers, and produces coherent, structured multistep action sequences. The SP then orchestrates these proposals into a single, environmentally adaptive plan that ensures local decisions aligning with long-term strategies. We call this HIMA (Hierarchical Imitation Multi-Agent). We also present TEXTSCII-ALL, a comprehensive SC2 testbed that encompasses all race match combinations in SC2. Our empirical results show that HIMA outperforms state of the arts in strategic clarity, adaptability, and computational efficiency, underscoring the potential of combining specialized imitation modules with meta-level orchestration to develop more robust, general-purpose AI agents.
zh

[AI-45] DP-LLM : Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

【速读】:该论文旨在解决在设备端运行大型语言模型(Large Language Models, LLMs)时,如何有效应对不同运行时约束(如延迟和精度)的问题。现有方法中,多尺度量化(multi-scale quantization)虽能通过叠加不同位宽的模型变体实现内存高效的运行时模型适配,但缺乏对目标精度或延迟的精准配置机制。论文提出的关键解决方案是DP-LLM,其核心创新在于利用每个层在解码迭代过程中动态变化的敏感性,设计了一个轻量级误差估计器与阈值学习机制,使每个线性层在运行时可根据输入值动态分配比特宽度(bitwidth),从而实现更优的性能-延迟权衡。

链接: https://arxiv.org/abs/2508.06041
作者: Sangwoo Kwon,Seong Hoon Seo,Jae W. Lee,Yeonhong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
zh

[AI-46] Adaptive Heterogeneous Graph Neural Networks: Bridging Heterophily and Heterogeneity CIKM2025

【速读】:该论文旨在解决当前图神经网络在处理异质性图(Heterogeneous Graph, HG)中异构性(heterophily)时存在的性能下降问题,尤其是在高异构性场景下,现有方法因忽视异构性分布随跳数(hops)和元路径(meta-paths)变化的复杂性而导致建模失效。其解决方案的关键在于提出自适应异质图神经网络(Adaptive Heterogeneous Graph Neural Network, AHGNN),该模型引入一种感知异构性的卷积机制,显式建模不同跳数与元路径下的异构性分布;同时采用粗粒度到细粒度的注意力机制,从多语义空间中融合信息,有效过滤噪声并强化关键信号,从而显著提升在真实异质图上的表征学习性能。

链接: https://arxiv.org/abs/2508.06034
作者: Qin Chen,Guojie Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted tp CIKM 2025

点击查看摘要

Abstract:Heterogeneous graphs (HGs) are common in real-world scenarios and often exhibit heterophily. However, most existing studies focus on either heterogeneity or heterophily in isolation, overlooking the prevalence of heterophilic HGs in practical applications. Such ignorance leads to their performance degradation. In this work, we first identify two main challenges in modeling heterophily HGs: (1) varying heterophily distributions across hops and meta-paths; (2) the intricate and often heterophily-driven diversity of semantic information across different meta-paths. Then, we propose the Adaptive Heterogeneous Graph Neural Network (AHGNN) to tackle these challenges. AHGNN employs a heterophily-aware convolution that accounts for heterophily distributions specific to both hops and meta-paths. It then integrates messages from diverse semantic spaces using a coarse-to-fine attention mechanism, which filters out noise and emphasizes informative signals. Experiments on seven real-world graphs and twenty baselines demonstrate the superior performance of AHGNN, particularly in high-heterophily situations.
zh

[AI-47] Hand by Hand: LLM Driving EMS Assistant for Operational Skill Learning IJCAI2025

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在操作技能(Operational Skill)训练中缺乏身体感知反馈的问题,即当前LLM辅助训练主要依赖文本反馈,未能有效整合运动感知(kinesthetic feedback)这一关键学习模态。其解决方案的关键在于提出“Align-Analyze-Adjust”策略,并开发FlightAxis工具,将LLM与电肌肉刺激(Electrical Muscle Stimulation, EMS)技术融合,通过引导前臂运动实现飞行技能的模拟训练。该方法不仅显著提升了任务完成效率,还增强了受训者对操作错误的认知和训练参与度,验证了基于体感反馈的LLM协同训练在操作技能习得中的可行性与有效性。

链接: https://arxiv.org/abs/2508.06000
作者: Wei Xiang,Ziyue Lei,Haoyuan Che,Fangyuan Ye,Xueting Wu,Lingyun Sun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Operational skill learning, inherently physical and reliant on hands-on practice and kinesthetic feedback, has yet to be effectively replicated in large language model (LLM)-supported training. Current LLM training assistants primarily generate customized textual feedback, neglecting the crucial kinesthetic modality. This gap derives from the textual and uncertain nature of LLMs, compounded by concerns on user acceptance of LLM driven body control. To bridge this gap and realize the potential of collaborative human-LLM action, this work explores human experience of LLM driven kinesthetic assistance. Specifically, we introduced an “Align-Analyze-Adjust” strategy and developed FlightAxis, a tool that integrates LLM with Electrical Muscle Stimulation (EMS) for flight skill acquisition, a representative operational skill domain. FlightAxis learns flight skills from manuals and guides forearm movements during simulated flight tasks. Our results demonstrate high user acceptance of LLM-mediated body control and significantly reduced task completion times. Crucially, trainees reported that this kinesthetic assistance enhanced their awareness of operation flaws and fostered increased engagement in the training process, rather than relieving perceived load. This work demonstrated the potential of kinesthetic LLM training in operational skill acquisition.
zh

[AI-48] Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

【速读】:该论文旨在解决医疗多模态决策中因视觉-语言模型(Vision-Language Models, VLMs)协作能力不足而导致的决策效率与准确性受限问题。现有研究多集中于纯文本任务,而将VLM扩展至多模态场景时面临模型组合不当易放大错误解读、指令遵循能力弱及缺乏自我反思机制等挑战,从而限制其在协同工作流程中的应用。解决方案的关键在于提出MedOrch框架——一个基于大语言模型(Large Language Models, LLMs)作为中介代理(mediator agent)的多智能体协作架构,通过引导多个基于VLM的专家代理进行输出交换与反思,实现高效协同;同时采用开源通用和领域专用VLM构建异构模型体系,无需额外训练即可显著提升整体性能,验证了中介引导式多智能体协作在医学多模态智能中的有效性。

链接: https://arxiv.org/abs/2508.05996
作者: Kaitao Chen,Mianxin Liu,Daoming Zong,Chaoyue Ding,Shaohao Rui,Yankai Jiang,Mu Zhou,Xiaosong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs’ ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence. Our code will be made publicly available.
zh

[AI-49] Learning by Teaching: Engaging Students as Instructors of Large Language Models in Computer Science Education

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在计算机科学(Computer Science, CS)教育中被用作虚拟助教时所导致的被动学习和过度依赖问题。其解决方案的关键在于提出一种反转式教学范式:学生扮演教师角色,主动指导LLM解决特定问题。为实现这一目标,作者设计了具有“知识缺口”的问题策略,这些缺口仅能由学生填补,并开发了名为Socrates的系统以低开销部署该方法。实证结果表明,该主动学习机制显著提升了学生的学业表现,验证了该框架在提升学习参与度与掌握深度方面的有效性。

链接: https://arxiv.org/abs/2508.05979
作者: Xinming Yang,Haasil Pujara,Jun Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at COLM 2025

点击查看摘要

Abstract:While Large Language Models (LLMs) are often used as virtual tutors in computer science (CS) education, this approach can foster passive learning and over-reliance. This paper presents a novel pedagogical paradigm that inverts this model: students act as instructors who must teach an LLM to solve problems. To facilitate this, we developed strategies for designing questions with engineered knowledge gaps that only a student can bridge, and we introduce Socrates, a system for deploying this method with minimal overhead. We evaluated our approach in an undergraduate course and found that this active-learning method led to statistically significant improvements in student performance compared to historical cohorts. Our work demonstrates a practical, cost-effective framework for using LLMs to deepen student engagement and mastery.
zh

[AI-50] DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching INTERSPEECH2025

【速读】:该论文旨在解决任意到任意的歌声转换(Singing Voice Conversion, SVC)中未见说话人音色迁移时存在的音色泄漏(timbre leakage)以及生成音频音质下降的问题。解决方案的关键在于提出DAFMSVC框架,其核心创新包括:1)利用自监督学习(Self-Supervised Learning, SSL)特征匹配机制,将源音频的SSL特征替换为与目标音频最相似的SSL特征,从而有效防止音色泄漏;2)引入双交叉注意力机制(dual cross-attention mechanism),实现说话人嵌入、旋律和语言内容的自适应融合;3)设计流匹配模块(flow matching module),从融合特征中生成高质量音频。该方法在主观和客观评估上均优于现有最优方法。

链接: https://arxiv.org/abs/2508.05978
作者: Wei Chen,Binzhu Sha,Dan Luo,Jing Yang,Zhuo Wang,Fan Fan,Zhiyong Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Singing Voice Conversion (SVC) transfers a source singer’s timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state-of-the-art methods in both subjective and objective evaluations.
zh

[AI-51] Impact-driven Context Filtering For Cross-file Code Completion

【速读】:该论文旨在解决仓库级代码补全(repository-level code completion)中因引入大量检索到的跨文件代码片段而导致的性能下降问题,尤其是其中部分片段不仅无益反而会干扰生成效果。其核心解决方案是提出一种基于似然度的度量方法,用于评估每个检索到的代码块对最终补全结果的贡献,并据此构建一个标注数据集(正例、中性、负例),进而设计出自适应检索上下文过滤框架CODEFILTER,通过训练该框架识别并剔除有害的负向上下文,从而提升补全准确性与效率。

链接: https://arxiv.org/abs/2508.05970
作者: Yanzhou Li,Shangqing Liu,Kangjie Chen,Tianwei Zhang,Yang Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has recently demonstrated considerable potential for repository-level code completion, as it integrates cross-file knowledge with in-file preceding code to provide comprehensive contexts for generation. To better understand the contribution of the retrieved cross-file contexts, we introduce a likelihood-based metric to evaluate the impact of each retrieved code chunk on the completion. Our analysis reveals that, despite retrieving numerous chunks, only a small subset positively contributes to the completion, while some chunks even degrade performance. To address this issue, we leverage this metric to construct a repository-level dataset where each retrieved chunk is labeled as positive, neutral, or negative based on its relevance to the target completion. We then propose an adaptive retrieval context filtering framework, CODEFILTER, trained on this dataset to mitigate the harmful effects of negative retrieved contexts in code completion. Extensive evaluation on the RepoEval and CrossCodeLongEval benchmarks demonstrates that CODEFILTER consistently improves completion accuracy compared to approaches without filtering operations across various tasks. Additionally, CODEFILTER significantly reduces the length of the input prompt, enhancing computational efficiency while exhibiting strong generalizability across different models. These results underscore the potential of CODEFILTER to enhance the accuracy, efficiency, and attributability of repository-level code completion.
zh

[AI-52] Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因策略分布偏移导致的分布外(Out-of-Distribution, OOD)动作和价值函数过估计问题。其核心挑战在于如何在保持价值函数保守性以避免过度估计的同时,不因过度保守而抑制策略性能的提升。解决方案的关键在于提出温和保守正则化评估(Mildly Conservative Regularized Evaluation, MCRE)框架,通过在贝尔曼更新(Bellman backup)中融合时序差分(Temporal Difference, TD)误差与行为克隆(Behavior Cloning)项,实现保守性与性能之间的平衡;在此基础上进一步设计了温和保守正则化Q学习(MCRQ)算法,将其集成到离线策略演员-评论家(Actor-Critic)框架中,实验表明该方法在基准数据集上显著优于现有强基线和先进离线RL算法。

链接: https://arxiv.org/abs/2508.05960
作者: Haohui Chen,Zhiyong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets.
zh

[AI-53] Multi-Armed Bandits-Based Optimization of Decision Trees

【速读】:该论文旨在解决决策树在缺乏适当约束时容易过度复杂化并产生过拟合的问题,尤其是在小规模和复杂数据集上,传统剪枝方法(如代价复杂度剪枝 Cost-Complexity Pruning, CCP 和减少误差剪枝 Reduced Error Pruning, REP)因采用贪心策略仅关注局部性能提升,导致模型长期泛化能力下降。解决方案的关键在于提出一种基于多臂赌博机(Multi-Armed Bandits, MAB)的剪枝方法,将剪枝过程建模为探索与利用的强化学习问题,通过MAB算法动态选择最优剪枝节点,并根据每次剪枝动作的反馈调整策略,从而实现更优的泛化性能。实验表明,该方法相较于传统剪枝技术能显著提升预测准确性,验证了MAB在决策树动态、概率性剪枝中的潜力。

链接: https://arxiv.org/abs/2508.05957
作者: Hasibul Karim Shanto,Umme Ayman Koana,Shadikur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision trees, without appropriate constraints, can easily become overly complex and prone to overfit, capturing noise rather than generalizable patterns. To resolve this problem,pruning operation is a crucial part in optimizing decision trees, as it not only reduces the complexity of trees but also decreases the probability of generating overfit models. The conventional pruning techniques like Cost-Complexity Pruning (CCP) and Reduced Error Pruning (REP) are mostly based on greedy approaches that focus on immediate gains in performance while pruning nodes of the decision tree. However, this might result in a lower generalization in the long run, compromising the robust ability of the tree model when introduced to unseen data samples, particularly when trained with small and complex datasets. To address this challenge, we are proposing a Multi-Armed Bandits (MAB)-based pruning approach, a reinforcement learning (RL)-based technique, that will dynamically prune the tree to generate an optimal decision tree with better generalization. Our proposed approach assumes the pruning process as an exploration-exploitation problem, where we are utilizing the MAB algorithms to find optimal branch nodes to prune based on feedback from each pruning actions. Experimental evaluation on several benchmark datasets, demonstrated that our proposed approach results in better predictive performance compared to the traditional ones. This suggests the potential of utilizing MAB for a dynamic and probabilistic way of decision tree pruning, in turn optimizing the decision tree-based model.
zh

[AI-54] ASLSL: Adaptive shared latent structure learning with incomplete multi-modal physiological data for multi-dimensional emotional feature selection

【速读】:该论文旨在解决不完整多模态生理信号在情感识别中因高维特征包含无关、冗余和噪声信息而导致的过拟合、性能下降及计算复杂度高的问题。其解决方案的关键在于提出一种名为自适应共享潜在结构学习(Adaptive Shared Latent Structure Learning, ASLSL)的方法,该方法利用相似特征具有相似情感标签的特性,通过自适应地学习一个共享的潜在空间,同时建模不完整多模态生理信号与多维情感标签之间的关系,从而缓解缺失数据的影响并挖掘跨模态共识信息。

链接: https://arxiv.org/abs/2508.05934
作者: Xueyuan Xu,Tianze Yu,Wenjia Dong,Fulin Wei,Li Zhuo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, multi-modal physiological signals based emotion recognition has garnered increasing attention in the field of brain-computer interfaces. Nevertheness, the associated multi-modal physiological features are often high-dimensional and inevitably include irrelevant, redundant, and noisy representation, which can easily lead to overfitting, poor performance, and high computational complexity in emotion classifiers. Feature selection has been widely applied to address these challenges. However, previous studies generally assumed that multi-modal physiological data are complete, whereas in reality, the data are often incomplete due to the openness of the acquisition and operational environment. For example, a part of samples are available in several modalities but not in others. To address this issue, we propose a novel method for incomplete multi-modal physiological signal feature selection called adaptive shared latent structure learning (ASLSL). Based on the property that similar features share similar emotional labels, ASLSL employs adaptive shared latent structure learning to explore a common latent space shared for incomplete multi-modal physiological signals and multi-dimensional emotional labels, thereby mitigating the impact of missing information and mining consensus information. Two most popular multi-modal physiological emotion datasets (DEAP and DREAMER) with multi-dimensional emotional labels were utilized to compare the performance between compare ASLSL and seventeen feature selection methods. Comprehensive experimental results on these datasets demonstrate the effectiveness of ASLSL.
zh

[AI-55] REFS: Robust EEG feature selection with missing multi-dimensional annotation for emotion recognition

【速读】:该论文旨在解决多维情绪识别中因高维脑电(EEG)特征与有限高质量样本导致的分类器过拟合及实时性能不佳问题,以及实际应用中由于采集环境开放性引发的多维情绪标签缺失、个体情绪感知模糊与变异等挑战。其解决方案的关键在于提出一种新颖的EEG特征选择方法:首先利用自适应正交非负矩阵分解(adaptive orthogonal non-negative matrix factorization)通过二阶及以上相关性重构多维情绪标签空间,以缓解缺失值和异常值对标签重建的负面影响;同时结合基于图流形学习的最小二乘回归与全局特征冗余最小化正则化项,在存在信息缺失的情况下实现鲁棒的EEG特征子集选择,从而提升多维情绪识别的稳定性与准确性。

链接: https://arxiv.org/abs/2508.05933
作者: Xueyuan Xu,Wenjia Dong,Fulin Wei,Li Zhuo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The affective brain-computer interface is a crucial technology for affective interaction and emotional intelligence, emerging as a significant area of research in the human-computer interaction. Compared to single-type features, multi-type EEG features provide a multi-level representation for analyzing multi-dimensional emotions. However, the high dimensionality of multi-type EEG features, combined with the relatively small number of high-quality EEG samples, poses challenges such as classifier overfitting and suboptimal real-time performance in multi-dimensional emotion recognition. Moreover, practical applications of affective brain-computer interface frequently encounters partial absence of multi-dimensional emotional labels due to the open nature of the acquisition environment, and ambiguity and variability in individual emotion perception. To address these challenges, this study proposes a novel EEG feature selection method for missing multi-dimensional emotion recognition. The method leverages adaptive orthogonal non-negative matrix factorization to reconstruct the multi-dimensional emotional label space through second-order and higher-order correlations, which could reduce the negative impact of missing values and outliers on label reconstruction. Simultaneously, it employs least squares regression with graph-based manifold learning regularization and global feature redundancy minimization regularization to enable EEG feature subset selection despite missing information, ultimately achieving robust EEG-based multi-dimensional emotion recognition. Simulation experiments on three widely used multi-dimensional emotional datasets, DREAMER, DEAP and HDED, reveal that the proposed method outperforms thirteen advanced feature selection methods in terms of robustness for EEG emotional feature selection.
zh

[AI-56] Enhancing Software Vulnerability Detection Through Adaptive Test Input Generation Using Genetic Algorithm

【速读】:该论文旨在解决现代软件系统中因复杂性不断增长而难以有效检测漏洞的问题,传统检测方法已无法满足当前需求。其解决方案的关键在于提出一种基于遗传算法的测试输入生成方法,通过创新性地融合遗传算子(如交叉操作)与自适应学习机制,实现对测试输入空间的高效探索与利用:其中交叉操作扩大了潜在输入的搜索范围,而自适应反馈机制则根据系统执行行为动态调整输入生成方向,从而在进化过程中持续优化测试用例结构,提升代码覆盖率和漏洞挖掘深度。

链接: https://arxiv.org/abs/2508.05923
作者: Yanusha Mehendran,Maolin Tang,Yi Lu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 26 Pages, 3 figures, 6 Tables, Submitted to Empirical Software Engineering and it is under review

点击查看摘要

Abstract:Software vulnerabilities continue to undermine the reliability and security of modern systems, particularly as software complexity outpaces the capabilities of traditional detection methods. This study introduces a genetic algorithm-based method for test input generation that innovatively integrates genetic operators and adaptive learning to enhance software vulnerability detection. A key contribution is the application of the crossover operator, which facilitates exploration by searching across a broader space of potential test inputs. Complementing this, an adaptive feedback mechanism continuously learns from the system’s execution behavior and dynamically guides input generation toward promising areas of the input space. Rather than relying on fixed or randomly selected inputs, the approach evolves a population of structurally valid test cases using feedback-driven selection, enabling deeper and more effective code traversal. This strategic integration of exploration and exploitation ensures that both diverse and targeted test inputs are developed over time. Evaluation was conducted across nine open-source JSON-processing libraries. The proposed method achieved substantial improvements in coverage compared to a benchmark evolutionary fuzzing method, with average gains of 39.8% in class coverage, 62.4% in method coverage, 105.0% in line coverage, 114.0% in instruction coverage, and 166.0% in branch coverage. These results highlight the method’s capacity to detect deeper and more complex vulnerabilities, offering a scalable and adaptive solution to software security testing.
zh

[AI-57] Planning Agents on an Ego-Trip: Leverag ing Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning

【速读】:该论文旨在解决AI代理在面对复杂用户查询时,如何从大量工具中高效准确地检索出所需工具的问题。传统方法主要依赖用户查询与工具描述之间的语义或词法相似度,难以应对多步骤任务中的工具组合需求。解决方案的关键在于提出一种基于知识图谱(Knowledge Graph, KG)的工具检索框架,通过构建1跳邻接工具图(ego tool graphs)的集成模型,捕捉工具间的直接与间接功能依赖关系,从而提升多步任务中工具选择的上下文感知能力与覆盖度。实验表明,该方法在微平均完整召回率(Complete Recall)上达到91.85%,优于最强非KG基线方法(89.26%),验证了结构化知识对纯相似性匹配的补充价值。

链接: https://arxiv.org/abs/2508.05888
作者: Sahil Bansal,Sai Shruthi Sistla,Aarti Arikatala,Sebastian Schreiber
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Effective tool retrieval is essential for AI agents to select from a vast array of tools when identifying and planning actions in the context of complex user queries. Despite its central role in planning, this aspect remains underexplored in the literature. Traditional approaches rely primarily on similarities between user queries and tool descriptions, which significantly limits retrieval accuracy, specifically when handling multi-step user requests. To address these limitations, we propose a Knowledge Graph (KG)-based tool retrieval framework that captures the semantic relationships between tools and their functional dependencies. Our retrieval algorithm leverages ensembles of 1-hop ego tool graphs to model direct and indirect connections between tools, enabling more comprehensive and contextual tool selection for multi-step tasks. We evaluate our approach on a synthetically generated internal dataset across six defined user classes, extending previous work on coherent dialogue synthesis and too retrieval benchmarks. Results demonstrate that our tool graph-based method achieves 91.85% tool coverage on the micro-average Complete Recall metric, compared to 89.26% for re-ranked semantic-lexical hybrid retrieval, the strongest non-KG baseline in our experiments. These findings support our hypothesis that the structural information in the KG provides complementary signals to pure similarity matching, particularly for queries requiring sequential tool composition.
zh

[AI-58] Safety of Embodied Navigation: A Survey

【速读】:该论文旨在解决具身导航(embodied navigation)系统在实际部署中面临的安全性问题,包括潜在的攻击策略、防御机制不足以及评估方法不完善等挑战。其核心解决方案在于通过多维度分析现有安全风险与应对技术,提出系统性的研究框架,涵盖攻击类型识别、防御策略优化、可靠评估指标设计及验证机制构建,并指出未来需重点关注的方向,如更先进的攻击模拟、鲁棒性增强方法和标准化验证体系,从而推动具身导航系统向更安全、可信的方向发展。

链接: https://arxiv.org/abs/2508.05855
作者: Zixia Wang,Jia Hu,Ronghui Mu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance and gain influence, the development of embodied AI has accelerated, drawing significant attention, particularly in navigation scenarios. Embodied navigation requires an agent to perceive, interact with, and adapt to its environment while moving toward a specified target in unfamiliar settings. However, the integration of embodied navigation into critical applications raises substantial safety concerns. Given their deployment in dynamic, real-world environments, ensuring the safety of such systems is critical. This survey provides a comprehensive analysis of safety in embodied navigation from multiple perspectives, encompassing attack strategies, defense mechanisms, and evaluation methodologies. Beyond conducting a comprehensive examination of existing safety challenges, mitigation technologies, and various datasets and metrics that assess effectiveness and robustness, we explore unresolved issues and future research directions in embodied navigation safety. These include potential attack methods, mitigation strategies, more reliable evaluation techniques, and the implementation of verification frameworks. By addressing these critical gaps, this survey aims to provide valuable insights that can guide future research toward the development of safer and more reliable embodied navigation systems. Furthermore, the findings of this study have broader implications for enhancing societal safety and increasing industrial efficiency.
zh

[AI-59] owards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems

【速读】:该论文旨在解决人工智能(AI)与机器人系统在社会中广泛应用背景下,如何确保其伦理行为的问题。核心挑战在于,缺乏透明度会削弱公众信任、阻碍责任追溯,并使伦理算法的调试变得困难。解决方案的关键在于将透明性作为构建可信且符合伦理的机器人系统的基础,通过标准化指标、可解释人工智能(Explainable AI, XAI)技术以及用户友好的界面等手段,实现技术实施与伦理考量的有机结合,尤其针对动态现实场景中的透明性难题提供系统性框架。

链接: https://arxiv.org/abs/2508.05846
作者: Ahmad Farooq,Kamran Iqbal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Published in the Proceedings of the 2025 3rd International Conference on Robotics, Control and Vision Engineering (RCVE’25). 6 pages, 3 tables

点击查看摘要

Abstract:As artificial intelligence (AI) and robotics increasingly permeate society, ensuring the ethical behavior of these systems has become paramount. This paper contends that transparency in AI decision-making processes is fundamental to developing trustworthy and ethically aligned robotic systems. We explore how transparency facilitates accountability, enables informed consent, and supports the debugging of ethical algorithms. The paper outlines technical, ethical, and practical challenges in implementing transparency and proposes novel approaches to enhance it, including standardized metrics, explainable AI techniques, and user-friendly interfaces. This paper introduces a framework that connects technical implementation with ethical considerations in robotic systems, focusing on the specific challenges of achieving transparency in dynamic, real-world contexts. We analyze how prioritizing transparency can impact public trust, regulatory policies, and avenues for future research. By positioning transparency as a fundamental element in ethical AI system design, we aim to add to the ongoing discussion on responsible AI and robotics, providing direction for future advancements in this vital field.
zh

[AI-60] AI-Guided Exploration of Large-Scale Codebases

【速读】:该论文旨在解决开发者在理解和维护大规模、复杂软件系统时面临的程序理解(program comprehension)难题。传统工具如静态可视化和逆向工程方法虽能提供结构信息,但缺乏交互性、适应性和与上下文信息的集成能力;而当前大语言模型(LLM)虽然提升了代码探索潜力,却因缺乏结构化引导和与可视化视图的融合而效果受限。解决方案的关键在于提出一种混合方法:将确定性的逆向工程(deterministic reverse engineering)与LLM驱动的意图感知(intent-aware)视觉探索相结合,构建一个融合UML可视化、动态用户界面、历史上下文和协作功能的自适应代码理解系统,通过解析用户查询与交互模式,使LLM辅助开发者更高效地导航和理解复杂代码库。

链接: https://arxiv.org/abs/2508.05799
作者: Yoseph Berhanu Alebachew
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Understanding large-scale, complex software systems is a major challenge for developers, who spend a significant portion of their time on program comprehension. Traditional tools such as static visualizations and reverse engineering techniques provide structural insights but often lack interactivity, adaptability, and integration with contextual information. Recent advancements in large language models (LLMs) offer new opportunities to enhance code exploration workflows, yet their lack of grounding and integration with structured views limits their effectiveness. This work introduces a hybrid approach that integrates deterministic reverse engineering with LLM-guided, intent-aware visual exploration. The proposed system combines UML-based visualization, dynamic user interfaces, historical context, and collaborative features into an adaptive tool for code comprehension. By interpreting user queries and interaction patterns, the LLM helps developers navigate and understand complex codebases more effectively. A prototype implementation for Java demonstrates the feasibility of this approach. Future work includes empirical evaluation, scaling to polyglot systems, and exploring GUI-driven LLM interaction models. This research lays the groundwork for intelligent, interactive environments that align with developer cognition and collaborative workflows.
zh

[AI-61] Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)方法主要服务于开发者、难以满足多元利益相关者需求的问题,以及现有XAI在支持假设检验和多层级决策分析方面的局限性。其解决方案的关键在于提出一种统一框架——整体式可解释人工智能(Holistic-XAI, H-XAI),该框架将因果评分(causal rating)方法与传统XAI方法融合,构建一个交互式、多方法的解释流程,使利益相关者能够通过提问、假设验证和对比自动构建的随机及偏置基线,灵活获取个体决策层面和模型全局层面的解释,从而实现对模型行为的动态评估与理解。

链接: https://arxiv.org/abs/2508.05792
作者: Kausik Lakkaraju,Siva Likitha Valluru,Biplav Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current eXplainable AI (XAI) methods largely serve developers, often focusing on justifying model outputs rather than supporting diverse stakeholder needs. A recent shift toward Evaluative AI reframes explanation as a tool for hypothesis testing, but still focuses primarily on operational organizations. We introduce Holistic-XAI (H-XAI), a unified framework that integrates causal rating methods with traditional XAI methods to support explanation as an interactive, multi-method process. H-XAI allows stakeholders to ask a series of questions, test hypotheses, and compare model behavior against automatically constructed random and biased baselines. It combines instance-level and global explanations, adapting to each stakeholder’s goals, whether understanding individual decisions, assessing group-level bias, or evaluating robustness under perturbations. We demonstrate the generality of our approach through two case studies spanning six scenarios: binary credit risk classification and financial time-series forecasting. H-XAI fills critical gaps left by existing XAI methods by combining causal ratings and post-hoc explanations to answer stakeholder-specific questions at both the individual decision level and the overall model level.
zh

[AI-62] From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data

【速读】:该论文旨在解决配电网络拓扑结构重建中因多源异构数据质量不均而导致的可信度不足问题,尤其是在实际电网运行场景下如何实现高精度且物理可行的拓扑推断。其核心解决方案在于提出一个可扩展的框架,通过联合建模两个互补维度:一是基于地理信息系统(GIS)和设备元数据的物理基础设施空间布局,二是基于电压时序信号的系统动态行为特征,从而实现网络连接关系的完整且物理一致的重构。关键创新点在于引入了一种置信度感知的推理机制,能够保留结构信息丰富但质量欠佳的数据输入,并量化每条推断连接的可靠性;同时将变压器容量限制、辐射状拓扑等运行约束嵌入学习过程,确保推断结果在不确定性感知的基础上满足物理可行性,最终在超过8000个计量点的实际数据上验证了超过95%的拓扑重建准确率及显著优于基线方法的置信度校准与计算效率。

链接: https://arxiv.org/abs/2508.05791
作者: Haoran Li,Lihao Mai,Muhao Guo,Jiaqi Wu,Yang Weng,Yannan Sun,Ce Jimmy Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Accurate distribution grid topology is essential for reliable modern grid operations. However, real-world utility data originates from multiple sources with varying characteristics and levels of quality. In this work, developed in collaboration with Oncor Electric Delivery, we propose a scalable framework that reconstructs a trustworthy grid topology by systematically integrating heterogeneous data. We observe that distribution topology is fundamentally governed by two complementary dimensions: the spatial layout of physical infrastructure (e.g., GIS and asset metadata) and the dynamic behavior of the system in the signal domain (e.g., voltage time series). When jointly leveraged, these dimensions support a complete and physically coherent reconstruction of network connectivity. To address the challenge of uneven data quality without compromising observability, we introduce a confidence-aware inference mechanism that preserves structurally informative yet imperfect inputs, while quantifying the reliability of each inferred connection for operator interpretation. This soft handling of uncertainty is tightly coupled with hard enforcement of physical feasibility: we embed operational constraints, such as transformer capacity limits and radial topology requirements, directly into the learning process. Together, these components ensure that inference is both uncertainty-aware and structurally valid, enabling rapid convergence to actionable, trustworthy topologies under real-world deployment conditions. The proposed framework is validated using data from over 8000 meters across 3 feeders in Oncor’s service territory, demonstrating over 95% accuracy in topology reconstruction and substantial improvements in confidence calibration and computational efficiency relative to baseline methods.
zh

[AI-63] Whither symbols in the era of advanced neural networks?

【速读】:该论文试图解决的核心问题是:人类心智是否应被理解为符号系统(symbolic systems)的问题。传统观点认为,人类思维通过符号操作实现概念组合、创新生成和快速学习,因而具有符号性特征。论文提出,现代神经网络及其构建的人工智能系统同样具备这些能力,从而削弱了“人类认知过程必须是符号性的”这一论断。其解决方案的关键在于指出:尽管神经网络本身并非符号系统,但它们在训练过程中依赖于由符号系统生成的数据,说明符号系统在定义人类认知所面临的问题中仍具重要地位。因此,作者提出了一种新的研究议程,旨在重新审视符号系统与神经网络之间的关系,探索人类思维的符号基础。

链接: https://arxiv.org/abs/2508.05776
作者: Thomas L. Griffiths,Brenden M. Lake,R. Thomas McCoy,Ellie Pavlick,Taylor W. Webb
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Some of the strongest evidence that human minds should be thought about in terms of symbolic systems has been the way they combine ideas, produce novelty, and learn quickly. We argue that modern neural networks – and the artificial intelligence systems built upon them – exhibit similar abilities. This undermines the argument that the cognitive processes and representations used by human minds are symbolic, although the fact that these neural networks are typically trained on data generated by symbolic systems illustrates that such systems play an important role in characterizing the abstract problems that human minds have to solve. This argument leads us to offer a new agenda for research on the symbolic basis of human thought.
zh

[AI-64] A Framework for Inherently Safer AGI through Language-Mediated Active Inference

【速读】:该论文试图解决当前人工通用智能(AGI)发展中安全性不足的问题,特别是传统AI安全方法依赖事后可解释性和奖励工程所带来的根本性局限。其解决方案的关键在于将安全保证内嵌于系统核心设计之中,通过结合主动推理(Active Inference)原理与大型语言模型(LLMs),构建一个以自然语言为媒介的多智能体架构:该架构利用透明信念表示和分层价值对齐机制,实现信念与偏好在自然语言中的显式分离、基于资源感知自由能最小化的 bounded rationality(有限理性),以及通过模块化智能体结构实现组合式安全(compositional safety)。这一设计使安全约束能够自下而上地通过层次化马尔可夫毯(hierarchical Markov blankets)传播,从而在不牺牲计算可行性的情况下实现人类可监督的内在安全性。

链接: https://arxiv.org/abs/2508.05766
作者: Bo Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system’s core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework’s safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.
zh

[AI-65] Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning

【速读】:该论文旨在解决在代码强化学习(Code Reinforcement Learning)中高质量测试用例合成困难的问题,这是影响大语言模型(Large Language Models, LLMs)训练精度与可靠性的关键瓶颈。解决方案的核心在于提出Klear-CodeTest框架,其创新性地采用生成-验证(Generator-Validation, G-V)机制:首先通过生成器覆盖广泛编程问题场景,包括常规和边界情况,确保测试用例的全面性;随后利用一致性验证机制将输出结果与标准答案比对,从而保障测试用例的正确性和判别力。此外,设计了多层安全沙箱系统以支持在线平台上的安全、稳定代码执行,显著提升了训练过程的稳定性与模型性能。

链接: https://arxiv.org/abs/2508.05710
作者: Jia Fu,Xinyu Yang,Hongzhi Zhang,Yahui Liu,Jingyuan Zhang,Qi Wang,Fuzheng Zhang,Guorui Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: this https URL.
zh

[AI-66] Semantic Reasoning Meets Numerical Precision: An LLM -Powered Multi-Agent System for Power Grid Control

【速读】:该论文旨在解决现代电力系统因分布式能源资源(DERs)、电动汽车(EV)普及以及极端天气事件频发而导致的规划、运行与管理复杂性剧增的问题。传统基于规则的系统和数值优化方法难以应对大规模、动态性和适应性要求。解决方案的关键在于提出Grid-Agent框架,其核心是将大型语言模型(LLMs)与多智能体强化学习相结合,通过模块化智能体架构实现语义推理与数值精度的融合:其中规划智能体利用潮流计算生成协调动作序列,验证智能体在沙箱环境中评估系统稳定性和动作有效性并支持安全回滚;同时引入自适应多尺度网络表示机制,根据网络规模和复杂度动态选择最优编码方案,从而实现开关配置、电池部署和负荷削减策略的协同优化,具备实时检测与修复电网越限的能力,并支持持续学习与拓扑适应。

链接: https://arxiv.org/abs/2508.05702
作者: Yan Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The increasing penetration of Distributed Energy Resources (DERs), widespread adoption of Electric Vehicles (EVs), and the growing frequency of extreme weather events have significantly increased the complexity of power grid planning, operation, and management. Traditional rule-based systems and numerical optimization approaches often struggle with the scale, dynamics, and adaptability required by modern power networks. This paper introduces Grid-Agent, an autonomous, AI-driven framework that combines Large Language Models (LLMs) with multi-agent reinforcement learning to detect and remediate grid violations in real time. Grid-Agent integrates semantic reasoning with numerical precision through a modular agent architecture: a planning agent generates coordinated action sequences using numerical power flow solvers, while a validation agent evaluates system stability and action effectiveness via sandboxed execution with safety rollbacks. To ensure scalability, Grid-Agent incorporates an adaptive multiscale network representation that dynamically selects optimal encoding schemes based on network size and complexity. The framework enables coordinated violation resolution through optimizing switch configurations, battery deployment, and load curtailment strategies. Experimental results in standard IEEE and CIGRE test systems (IEEE 69-bus, CIGRE MV, and IEEE 30-bus) demonstrate superior violation mitigation performance. Additionally, the framework’s built-in data collection and learning capabilities enable continuous learning and adaptation to diverse network topologies. The autonomous nature of the framework makes it particularly suitable for modern smart grid applications requiring rapid response to dynamic operating conditions.
zh

[AI-67] Multi-Faceted Large Embedding Tables for Pinterest Ads Ranking

【速读】:该论文旨在解决大规模嵌入表(Large Embedding Tables)在Pinterest广告排序模型中应用时面临的独特挑战,包括训练初期性能表现不佳、稀疏性问题以及GPU内存限制导致的可扩展性瓶颈。解决方案的关键在于提出了一种多维度预训练(multi-faceted pretraining)策略,通过融合多种预训练算法显著丰富嵌入表的初始表示,从而大幅提升点击率(CTR)和转化率(CVR);同时设计了CPU-GPU混合服务架构以突破GPU显存限制,实现高可扩展性部署。该方案已在Pinterest广告系统中落地,带来1.34%的在线每千次展示成本(CPC)降低和2.60%的CTR提升,且端到端延迟保持不变。

链接: https://arxiv.org/abs/2508.05700
作者: Runze Su,Jiayin Jin,Jiacheng Li,Sihan Wang,Guangtong Bai,Zelun Wang,Li Tang,Yixiong Meng,Huasen Wu,Zhimeng Pan,Kungang Li,Han Sun,Zhifang Liu,Haoyang Li,Siping Ji,Ling Leng,Prathibha Deshikachar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large embedding tables are indispensable in modern recommendation systems, thanks to their ability to effectively capture and memorize intricate details of interactions among diverse entities. As we explore integrating large embedding tables into Pinterest’s ads ranking models, we encountered not only common challenges such as sparsity and scalability, but also several obstacles unique to our context. Notably, our initial attempts to train large embedding tables from scratch resulted in neutral metrics. To tackle this, we introduced a novel multi-faceted pretraining scheme that incorporates multiple pretraining algorithms. This approach greatly enriched the embedding tables and resulted in significant performance improvements. As a result, the multi-faceted large embedding tables bring great performance gain on both the Click-Through Rate (CTR) and Conversion Rate (CVR) domains. Moreover, we designed a CPU-GPU hybrid serving infrastructure to overcome GPU memory limits and elevate the scalability. This framework has been deployed in the Pinterest Ads system and achieved 1.34% online CPC reduction and 2.60% CTR increase with neutral end-to-end latency change.
zh

[AI-68] Log2Sig: Frequency-Aware Insider Threat Detection via Multivariate Behavioral Signal Decomposition

【速读】:该论文旨在解决内部威胁检测(Insider Threat Detection)中因恶意行为与合法用户操作高度相似而导致的检测难题。现有方法通常将系统日志建模为扁平事件序列,难以捕捉用户行为中隐含的频率动态和多尺度扰动模式。其解决方案的关键在于提出Log2Sig框架,通过将用户日志转换为多变量行为频率信号(Multivariate Behavioral Frequency Signals),并引入多变量变分模态分解(Multivariate Variational Mode Decomposition, MVMD)提取不同时间尺度下的固有模态函数(Intrinsic Mode Functions, IMFs),从而揭示用户行为的多尺度波动特征;在此基础上,结合基于Mamba的时序编码器对日志序列进行长期依赖建模,并将频率成分线性投影后与时序表示融合,构建综合用户行为画像,最终通过多层感知机实现高精度异常检测。

链接: https://arxiv.org/abs/2508.05696
作者: Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Zhiying Li,Guanggang Geng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted to the 2025 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)

点击查看摘要

Abstract:Insider threat detection presents a significant challenge due to the deceptive nature of malicious behaviors, which often resemble legitimate user operations. However, existing approaches typically model system logs as flat event sequences, thereby failing to capture the inherent frequency dynamics and multiscale disturbance patterns embedded in user behavior. To address these limitations, we propose Log2Sig, a robust anomaly detection framework that transforms user logs into multivariate behavioral frequency signals, introducing a novel representation of user behavior. Log2Sig employs Multivariate Variational Mode Decomposition (MVMD) to extract Intrinsic Mode Functions (IMFs), which reveal behavioral fluctuations across multiple temporal scales. Based on this, the model further performs joint modeling of behavioral sequences and frequency-decomposed signals: the daily behavior sequences are encoded using a Mamba-based temporal encoder to capture long-term dependencies, while the corresponding frequency components are linearly projected to match the encoder’s output dimension. These dual-view representations are then fused to construct a comprehensive user behavior profile, which is fed into a multilayer perceptron for precise anomaly detection. Experimental results on the CERT r4.2 and r5.2 datasets demonstrate that Log2Sig significantly outperforms state-of-the-art baselines in both accuracy and F1 score.
zh

[AI-69] Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach

【速读】:该论文旨在解决开源生态系统中第三方软件包(如Python包)选择困难的问题,其核心挑战在于包数量庞大、比较依据不透明,且当前生成式AI工具在推荐时忽视依赖评估、过度依赖流行度而忽略适用性,缺乏可复现性,从而对项目的透明度、长期可靠性与架构决策带来风险。解决方案的关键在于将软件包选择建模为多准则决策(Multi-Criteria Decision-Making, MCDM)问题,构建一个数据驱动的框架:通过自动化数据管道持续收集GitHub、PyPI和Stack Overflow上的元数据、使用趋势、漏洞信息及开发者情绪等多维数据,并将其结构化为反映包、领域特征与质量属性关系的决策模型;在此基础上实现PySelect系统,利用大语言模型理解用户意图并查询决策模型以提供上下文相关的推荐。该方法实现了可扩展、可解释且可复现的证据驱动型软件选择支持。

链接: https://arxiv.org/abs/2508.05693
作者: Siamak Farshidi,Amir Saberhabibi,Behbod Eskafi,Niloofar Nikfarjam,Sadegh Eskandari,Slinger Jansen,Michel Chaudron,Bedir Tekinerdogan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Selecting third-party software packages in open-source ecosystems like Python is challenging due to the large number of alternatives and limited transparent evidence for comparison. Generative AI tools are increasingly used in development workflows, but their suggestions often overlook dependency evaluation, emphasize popularity over suitability, and lack reproducibility. This creates risks for projects that require transparency, long-term reliability, maintainability, and informed architectural decisions. This study formulates software package selection as a Multi-Criteria Decision-Making (MCDM) problem and proposes a data-driven framework for technology evaluation. Automated data pipelines continuously collect and integrate software metadata, usage trends, vulnerability information, and developer sentiment from GitHub, PyPI, and Stack Overflow. These data are structured into a decision model representing relationships among packages, domain features, and quality attributes. The framework is implemented in PySelect, a decision support system that uses large language models to interpret user intent and query the model to identify contextually appropriate packages. The approach is evaluated using 798,669 Python scripts from 16,887 GitHub repositories and a user study based on the Technology Acceptance Model. Results show high data extraction precision, improved recommendation quality over generative AI baselines, and positive user evaluations of usefulness and ease of use. This work introduces a scalable, interpretable, and reproducible framework that supports evidence-based software selection using MCDM principles, empirical data, and AI-assisted intent modeling.
zh

[AI-70] Risk Analysis Techniques for Governed LLM -based Multi-Agent Systems

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent Systems, MAS)在组织环境中部署时所面临的风险识别与分析难题。传统针对单个智能体的风险评估方法无法有效捕捉多智能体之间交互引发的涌现行为和新型失效模式,因此亟需一种适用于多智能体系统的全新风险分析框架。解决方案的关键在于提出一套面向六类关键失效模式(包括级联可靠性故障、智能体间通信失败、单一化崩溃、从众偏差、理论心智缺陷及混合动机动态)的实践工具包,并强调通过分阶段抽象测试逐步提升分析有效性——即从模拟、观测分析、基准测试到红队测试等多维度收集收敛证据,在控制暴露风险的前提下系统性地验证潜在负面影响,从而为组织建立稳健的多智能体AI风险管理机制奠定基础。

链接: https://arxiv.org/abs/2508.05687
作者: Alistair Reid,Simon O’Callaghan,Liam Carroll,Tiberio Caetano
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organisations are starting to adopt LLM-based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi-agent networks. Yet a collection of safe agents does not guarantee a safe collection of agents, as interactions between agents over time create emergent behaviours and induce novel failure modes. This means multi-agent systems require a fundamentally different risk analysis approach than that used for a single agent. This report addresses the early stages of risk identification and analysis for multi-agent AI systems operating within governed environments where organisations control their agent configurations and deployment. In this setting, we examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics. For each, we provide a toolkit for practitioners to extend or integrate into their existing frameworks to assess these failure modes within their organisational contexts. Given fundamental limitations in current LLM behavioural understanding, our approach centres on analysis validity, and advocates for progressively increasing validity through staged testing across stages of abstraction and deployment that gradually increases exposure to potential negative impacts, while collecting convergent evidence through simulation, observational analysis, benchmarking, and red teaming. This methodology establishes the groundwork for robust organisational risk management as these LLM-based multi-agent systems are deployed and operated. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.05687 [cs.MA] (or arXiv:2508.05687v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.05687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-71] Selection-Based Vulnerabilities: Clean-Label Backdoor Attacks in Active Learning

【速读】:该论文旨在解决主动学习(Active Learning, AL)在安全方面的潜在漏洞问题,即AL是否具备抗攻击能力。研究表明,当前主流的AL框架中用于选择最具信息量样本的获取函数(acquisition function)可能成为后门攻击的切入点。解决方案的关键在于提出ALA框架——这是首个利用获取函数作为投毒攻击面的实用攻击方法:通过优化不可察觉的对抗性扰动输入,使其在获取函数下产生高不确定性得分,从而显著提升被选中标注的概率。实验表明,该攻击在极低投毒预算(0.5%-1.0%)下即可实现高达94%的成功率,同时保持模型性能不变且对人工标注者隐蔽,揭示了主动学习在可信数据场景中部署时需谨慎对待其安全性风险。

链接: https://arxiv.org/abs/2508.05681
作者: Yuhan Zhi,Longtian Wang,Xiaofei Xie,Chao Shen,Qiang Hu,Xiaohong Guan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Active learning(AL), which serves as the representative label-efficient learning paradigm, has been widely applied in resource-constrained scenarios. The achievement of AL is attributed to acquisition functions, which are designed for identifying the most important data to label. Despite this success, one question remains unanswered: is AL safe? In this work, we introduce ALA, a practical and the first framework to utilize the acquisition function as the poisoning attack surface to reveal the weakness of active learning. Specifically, ALA optimizes imperceptibly poisoned inputs to exhibit high uncertainty scores, increasing their probability of being selected by acquisition functions. To evaluate ALA, we conduct extensive experiments across three datasets, three acquisition functions, and two types of clean-label backdoor triggers. Results show that our attack can achieve high success rates (up to 94%) even under low poisoning budgets (0.5%-1.0%) while preserving model utility and remaining undetectable to human annotators. Our findings remind active learning users: acquisition functions can be easily exploited, and active learning should be deployed with caution in trusted data scenarios.
zh

[AI-72] Are All Genders Equal in the Eyes of Algorithms? – Analysing Search and Retrieval Algorithms for Algorithmic Gender Fairness

【速读】:该论文旨在解决算法系统(如搜索引擎和信息检索平台)在学术可见性与知识传播中可能隐性放大性别偏见的问题,尤其关注其如何影响不同性别的学者在数字环境中的代表性差异。解决方案的关键在于提出并应用一种“偏置保留型”的算法性别公平性定义,即评估算法输出是否真实反映现实中的性别分布,而非人为引入或加剧不平等;通过分析德国高校及应用科学大学的学术人员数据,发现尽管无明显歧视,但男性教授在搜索结果数量和出版物匹配度上更具优势,女性则表现出更高的数字可见性波动,揭示了平台算法、机构内容管理与个体自我呈现三者之间的复杂交互作用。

链接: https://arxiv.org/abs/2508.05680
作者: Stefanie Urchs,Veronika Thurner,Matthias Aßenmacher,Ludwig Bothmann,Christian Heumann,Stephanie Thiemichen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithmic systems such as search engines and information retrieval platforms significantly influence academic visibility and the dissemination of knowledge. Despite assumptions of neutrality, these systems can reproduce or reinforce societal biases, including those related to gender. This paper introduces and applies a bias-preserving definition of algorithmic gender fairness, which assesses whether algorithmic outputs reflect real-world gender distributions without introducing or amplifying disparities. Using a heterogeneous dataset of academic profiles from German universities and universities of applied sciences, we analyse gender differences in metadata completeness, publication retrieval in academic databases, and visibility in Google search results. While we observe no overt algorithmic discrimination, our findings reveal subtle but consistent imbalances: male professors are associated with a greater number of search results and more aligned publication records, while female professors display higher variability in digital visibility. These patterns reflect the interplay between platform algorithms, institutional curation, and individual self-presentation. Our study highlights the need for fairness evaluations that account for both technical performance and representational equality in digital systems.
zh

[AI-73] Adversarial Attacks on Reinforcement Learning-based Medical Questionnaire Systems: Input-level Perturbation Strategies and Medical Constraint Validation

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的医疗问卷系统在实际应用中的安全性和鲁棒性问题,尤其是面对对抗攻击时的脆弱性。其核心挑战在于如何在保持临床合理性的前提下生成有效的对抗样本,以评估系统的稳定性。解决方案的关键在于将诊断过程建模为马尔可夫决策过程(Markov Decision Process, MDP),并设计了一套包含247项医学约束的验证框架,确保生成的对抗样本符合生理范围、症状相关性和条件医学逻辑;在此基础上,实现了对六种主流对抗攻击方法(如FGSM、PGD、AutoAttack等)的系统性评估,最终在NHIS数据集上成功生成97.6%临床可接受的对抗样本,并揭示了RL驱动的医疗问答系统即使在严格医学约束下仍存在显著漏洞,攻击成功率高达64.70%(AutoAttack)。

链接: https://arxiv.org/abs/2508.05677
作者: Peizhuo Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages (21 pages main text, 3 pages references, 6 pages appendix), 4 figures

点击查看摘要

Abstract:RL-based medical questionnaire systems have shown great potential in medical scenarios. However, their safety and robustness remain unresolved. This study performs a comprehensive evaluation on adversarial attack methods to identify and analyze their potential vulnerabilities. We formulate the diagnosis process as a Markov Decision Process (MDP), where the state is the patient responses and unasked questions, and the action is either to ask a question or to make a diagnosis. We implemented six prevailing major attack methods, including the Fast Gradient Signed Method (FGSM), Projected Gradient Descent (PGD), Carlini Wagner Attack (CW) attack, Basic Iterative Method (BIM), DeepFool, and AutoAttack, with seven epsilon values each. To ensure the generated adversarial examples remain clinically plausible, we developed a comprehensive medical validation framework consisting of 247 medical constraints, including physiological bounds, symptom correlations, and conditional medical constraints. We achieved a 97.6% success rate in generating clinically plausible adversarial samples. We performed our experiment on the National Health Interview Survey (NHIS) dataset (this https URL), which consists of 182,630 samples, to predict the participant’s 4-year mortality rate. We evaluated our attacks on the AdaptiveFS framework proposed in arXiv:2004.00994. Our results show that adversarial attacks could significantly impact the diagnostic accuracy, with attack success rates ranging from 33.08% (FGSM) to 64.70% (AutoAttack). Our work has demonstrated that even under strict medical constraints on the input, such RL-based medical questionnaire systems still show significant vulnerabilities.
zh

[AI-74] Principle-Guided Verilog Optimization: IP-Safe Knowledge Transfer via Local-Cloud Collaboration

【速读】:该论文旨在解决在寄存器传输级(Register Transfer Level, RTL)代码优化过程中,如何在利用强大云上大语言模型(Large Language Models, LLMs)提升优化效果的同时,避免敏感知识产权(Intellectual Property, IP)泄露的问题。其解决方案的关键在于提出首个IP保护型边缘-云协同框架:本地部署小型LLM(如Qwen-2.5-Coder-7B)对高质量目标设计与初稿Verilog代码进行安全比对分析,提炼出通用的设计原则;再将这些抽象化的指导信息提交至云端更强的LLM(如Deepseek-V3)执行针对性代码改进,确保仅传递无敏感信息的抽象建议,从而实现性能优化与IP保护之间的平衡。实验表明,该协同策略显著优于单一云模型或商业模型(如GPT-4o),尤其在功耗优化等特定指标上表现突出。

链接: https://arxiv.org/abs/2508.05675
作者: Jing Wang,Zheng Li,Lei Li,Fan He,Liyu Lin,Yao Lai,Yan Li,Xiaoyang Zeng,Yufeng Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Our code and dataset are available at this https URL

点击查看摘要

Abstract:Recent years have witnessed growing interest in adopting large language models (LLMs) for Register Transfer Level (RTL) code optimization. While powerful cloud-based LLMs offer superior optimization capabilities, they pose unacceptable intellectual property (IP) leakage risks when processing proprietary hardware designs. In this paper, we propose a new scenario where Verilog code must be optimized for specific attributes without leaking sensitive IP information. We introduce the first IP-preserving edge-cloud collaborative framework that leverages the benefits of both paradigms. Our approach employs local small LLMs (e.g., Qwen-2.5-Coder-7B) to perform secure comparative analysis between paired high-quality target designs and novice draft codes, yielding general design principles that summarize key insights for improvements. These principles are then used to query stronger cloud LLMs (e.g., Deepseek-V3) for targeted code improvement, ensuring that only abstracted and IP-safe guidance reaches external services. Our experimental results demonstrate that the framework achieves significantly higher optimization success rates compared to baseline methods. For example, combining Qwen-2.5-Coder-7B and Deepseek-V3 achieves a 66.67% optimization success rate for power utilization, outperforming Deepseek-V3 alone (49.81%) and even commercial models like GPT-4o (55.81%). Further investigation of local and cloud LLM combinations reveals that different model pairings exhibit varying strengths for specific optimization objectives, with interesting trends emerging when varying the number of comparative code pairs. Our work establishes a new paradigm for secure hardware design optimization that balances performance gains with IP protection.
zh

[AI-75] owards Effective Offensive Security LLM Agents LLM Agents: Hyperparameter Tuning LLM as a Judge and a Lightweight CTF Benchmark

【速读】:该论文旨在解决生成式 AI (Generative AI) 在网络安全攻防场景中自动化执行任务(特别是CTF竞赛)时,如何提升智能体(agent)性能与可评估性的问题。其核心挑战在于现有方法缺乏对代理行为的细粒度评估机制以及对关键超参数影响的系统性理解。解决方案的关键在于提出三个创新:一是设计 CTFJudge 框架,利用大语言模型(LLM)作为裁判对代理轨迹进行多步骤解析和量化评价;二是引入 CTF Competency Index (CCI) 以衡量部分正确性,精准反映代理输出与人工标准的一致性;三是通过实证分析温度、top-p 和最大 token 长度等 LLM 超参数对任务规划能力的影响,从而优化多代理协作配置。此外,作者构建了 CTFTiny 基准数据集用于高效评估,为未来基于 LLM 的安全智能体研究提供标准化工具与实践指南。

链接: https://arxiv.org/abs/2508.05674
作者: Minghao Shao,Nanda Rani,Kimberly Milner,Haoran Xi,Meet Udeshi,Saksham Aggarwal,Venkata Sai Charan Putrevu,Sandeep Kumar Shukla,Prashanth Krishnamurthy,Farshad Khorrami,Ramesh Karri,Muhammad Shafique
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public this https URL along with CTFJudge on this https URL.
zh

[AI-76] Breaking the Top-K Barrier: Advancing Top-K Ranking Metrics Optimization in Recommender Systems KDD2025

【速读】:该论文旨在解决推荐系统中基于NDCG@K(Normalized Discounted Cumulative Gain at K)指标优化时面临的挑战,主要包括其固有的不连续性以及Top-K截断带来的训练不稳定性和高计算成本问题。解决方案的关键在于提出一种新的损失函数SoftmaxLoss@K(SL@K),通过引入分位数(quantile)技术处理Top-K截断,并推导出NDCG@K的平滑上界以克服不连续性问题,从而实现理论保障、梯度稳定、计算高效且对噪声鲁棒的优化目标。

链接: https://arxiv.org/abs/2508.05673
作者: Weiqin Yang,Jiawei Chen,Shengjia Zhang,Peng Wu,Yuegang Sun,Yan Feng,Chun Chen,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by KDD 2025

点击查看摘要

Abstract:In the realm of recommender systems (RS), Top- K ranking metrics such as NDCG@ K are the gold standard for evaluating recommendation performance. However, during the training of recommendation models, optimizing NDCG@ K poses significant challenges due to its inherent discontinuous nature and the intricate Top- K truncation. Recent efforts to optimize NDCG@ K have either overlooked the Top- K truncation or suffered from high computational costs and training instability. To overcome these limitations, we propose SoftmaxLoss@ K (SL@ K ), a novel recommendation loss tailored for NDCG@ K optimization. Specifically, we integrate the quantile technique to handle Top- K truncation and derive a smooth upper bound for optimizing NDCG@ K to address discontinuity. The resulting SL@ K loss has several desirable properties, including theoretical guarantees, ease of implementation, computational efficiency, gradient stability, and noise robustness. Extensive experiments on four real-world datasets and three recommendation backbones demonstrate that SL@ K outperforms existing losses with a notable average improvement of 6.03%. The code is available at this https URL.
zh

[AI-77] LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

【速读】:该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统在特定领域知识应用中面临的两大挑战:一是预训练嵌入模型在特定领域性能下降,二是基于大语言模型(Large Language Model, LLM)的检索器计算成本过高。解决方案的关键在于提出一种模型无关的框架 LMAR(Language Model Augmented Retriever),其核心包括两个阶段:第一阶段通过 LLM 引导的数据合成与三元组采样实现高质量监督信号的构建;第二阶段采用对比嵌入适配与高效文本聚类策略,在不破坏上下文完整性的前提下优化嵌入表示。该方法显著提升了领域适应能力,同时保持较低的硬件需求和延迟,具备良好的可扩展性和对新兴 RAG 架构的兼容性。

链接: https://arxiv.org/abs/2508.05672
作者: Yao Zhao,Yantian Ding,Zhiyue Zhang,Dapeng Yao,Yanxun Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results across multiple domain-specific benchmark datasets demonstrate that LMAR outperforms multiple baseline models, while maintaining moderate hardware requirements and low latency. Its model-agnostic nature further enables seamless integration with emerging RAG architectures and text embedding models, ensuring continual improvements without redesigning the pipeline. These results highlight LMAR as a practical and cost-effective solution for scalable domain-specific adaptation.
zh

[AI-78] Can LLM s effectively provide game-theoretic-based scenarios for cybersecurity?

【速读】:该论文试图解决的问题是:经典博弈论框架是否能够有效刻画由大型语言模型(Large Language Models, LLMs)驱动的智能体(agents)的行为,特别是在网络安全场景下,LLM驱动的攻击者与防御者之间的策略互动是否能收敛到理论预期结果,以及是否存在因模型偏差或语言差异导致的行为偏离。解决方案的关键在于构建一个可复现的博弈论驱动LLM代理框架,并在两类典型博弈场景——一次性零和博弈和动态囚徒困境中进行实证测试;通过引入定量指标评估LLM代理的内部一致性与跨语言稳定性,发现LLM行为不仅受个体特征(如人格特质或对重复博弈的认知)影响,还表现出显著的语言敏感性,从而揭示了在网络安全应用中盲目部署LLM可能带来的风险,并为选择更稳定、适合安全场景的LLM提供了量化依据。

链接: https://arxiv.org/abs/2508.05670
作者: Daniele Proverbio,Alessio Buscemi,Alessandro Di Stefano, TheAnh Han,German Castignani,Pietro Liò
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Game theory has long served as a foundational tool in cybersecurity to test, predict, and design strategic interactions between attackers and defenders. The recent advent of Large Language Models (LLMs) offers new tools and challenges for the security of computer systems; In this work, we investigate whether classical game-theoretic frameworks can effectively capture the behaviours of LLM-driven actors and bots. Using a reproducible framework for game-theoretic LLM agents, we investigate two canonical scenarios – the one-shot zero-sum game and the dynamic Prisoner’s Dilemma – and we test whether LLMs converge to expected outcomes or exhibit deviations due to embedded biases. Our experiments involve four state-of-the-art LLMs and span five natural languages, English, French, Arabic, Vietnamese, and Mandarin Chinese, to assess linguistic sensitivity. For both games, we observe that the final payoffs are influenced by agents characteristics such as personality traits or knowledge of repeated rounds. Moreover, we uncover an unexpected sensitivity of the final payoffs to the choice of languages, which should warn against indiscriminate application of LLMs in cybersecurity applications and call for in-depth studies, as LLMs may behave differently when deployed in different countries. We also employ quantitative metrics to evaluate the internal consistency and cross-language stability of LLM agents, to help guide the selection of the most stable LLMs and optimising models for secure applications.
zh

[AI-79] ITDR: An Instruction Tuning Dataset for Enhancing Large Language Models in Recommendations

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推荐系统任务中表现受限的问题,其核心挑战在于用户行为数据与自然语言在结构上的差异导致LLMs难以有效建模用户偏好与物品之间的关联。为此,作者构建了一个名为ITDR的指令微调数据集,该数据集涵盖用户-物品交互和用户-物品理解两大根任务下的7个子任务,整合了13个公开推荐数据集,并采用人工设计的标准模板生成约20万条实例。关键创新在于通过结构化、多任务的指令微调策略,显著提升了主流开源LLM(如GLM-4、Qwen2.5、LLaMA-3.2等)在推荐任务中的性能,同时揭示了任务描述和数据规模对微调效果的影响机制,为通用大模型在推荐场景中的适配提供了可复现的数据基础与方法框架。

链接: https://arxiv.org/abs/2508.05667
作者: Zekun Liu,Xiaowen Huang,Jitao Sang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated outstanding performance in natural language processing tasks. However, in the field of recommendation systems, due to the structural differences between user behavior data and natural language, LLMs struggle to effectively model the associations between user preferences and items. Although prompt-based methods can generate recommendation results, their inadequate understanding of recommendation tasks leads to constrained performance. To address this gap, in this work, we construct a sufficient instruction tuning dataset, ITDR, which encompasses 7 subtasks across two core root tasks–user-item interaction and user-item understanding. The dataset integrates data from 13 public recommendation datasets and is built using manually crafted standardized templates, comprising approximately 200,000 instances. Experimental results demonstrate that ITDR significantly enhances the performance of mainstream open-source LLMs such as GLM-4, Qwen2.5, Qwen2.5-Instruct and LLaMA-3.2 on recommendation tasks. Furthermore, we analyze the correlations between tasks and explore the impact of task descriptions and data scale on instruction tuning effectiveness. Finally, we perform comparative experiments against closed-source LLMs with substantial parameters. Our tuning dataset ITDR and the fine-tuned large recommendation models can be accessed at this https URL.
zh

[AI-80] HySemRAG : A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis

【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)架构在大规模文献综述与方法论研究缺口识别中的局限性,特别是语义检索精度不足、结果可验证性差以及自动化程度低等问题。其核心解决方案是提出HySemRAG框架,关键创新在于三方面:一是采用混合检索策略,融合语义搜索、关键词过滤与知识图谱遍历以提升检索准确性;二是构建基于智能体的自校正机制,通过迭代质量保证实现高置信度输出;三是引入后验引用验证流程,确保生成内容的可追溯性。该框架通过八个集成阶段完成从多源元数据获取到知识图谱构建的全流程自动化处理,最终形成Neo4j知识图谱和Qdrant向量集合双数据产品,显著提升了文献信息合成的结构化水平与可信度。

链接: https://arxiv.org/abs/2508.05666
作者: Alejandro Godinez
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 47 pages, 10 figures. Code: this https URL . Demo: this https URL . ETL+multi-agent RAG framework for literature synthesis, 35.1% improvement over PDF chunking. Real application: reduced 17,400 papers to 24 relevant ones (99.86%) in 10 minutes for wastewater epidemiology review

点击查看摘要

Abstract:We present HySemRAG, a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG) to automate large-scale literature synthesis and identify methodological research gaps. The system addresses limitations in existing RAG architectures through a multi-layered approach: hybrid retrieval combining semantic search, keyword filtering, and knowledge graph traversal; an agentic self-correction framework with iterative quality assurance; and post-hoc citation verification ensuring complete traceability. Our implementation processes scholarly literature through eight integrated stages: multi-source metadata acquisition, asynchronous PDF retrieval, custom document layout analysis using modified Docling architecture, bibliographic management, LLM-based field extraction, topic modeling, semantic unification, and knowledge graph construction. The system creates dual data products - a Neo4j knowledge graph enabling complex relationship queries and Qdrant vector collections supporting semantic search - serving as foundational infrastructure for verifiable information synthesis. Evaluation across 643 observations from 60 testing sessions demonstrates structured field extraction achieving 35.1% higher semantic similarity scores (0.655 \pm 0.178) compared to PDF chunking approaches (0.485 \pm 0.204, p 0.000001). The agentic quality assurance mechanism achieves 68.3% single-pass success rates with 99.0% citation accuracy in validated responses. Applied to geospatial epidemiology literature on ozone exposure and cardiovascular disease, the system identifies methodological trends and research gaps, demonstrating broad applicability across scientific domains for accelerating evidence synthesis and discovery.
zh

[AI-81] From Static to Dynamic: A Streaming RAG Approach to Real-time Knowledge Base

【速读】:该论文旨在解决动态数据流(如新闻、社交媒体、传感器网络和金融市场的实时数据)对静态检索增强生成(Retrieval-Augmented Generation, RAG)框架的挑战,具体包括:全量索引导致高内存开销、周期性重建引入延迟影响数据新鲜度、以及简单采样牺牲语义覆盖范围。其解决方案的核心是提出Streaming RAG,一个统一的流水线架构,融合多向量余弦筛选(multi-vector cosine screening)、小批量聚类(mini-batch clustering)与基于计数器的热点过滤器(counter-based heavy-hitter filter),以维护紧凑且语义丰富的原型集合(prototype set)。作者进一步证明了一个近似边界 $ \mathbb{E}[R(K_t)] \ge R^* - L\Delta $,将检索质量与聚类方差关联起来,并设计了增量式索引更新机制,在不中断查询的前提下刷新原型,从而在150 MB内存预算下实现Recall@10提升最高达3点(p < 0.01)、端到端延迟低于15 ms、吞吐量超900文档/秒,显著优于现有方法。

链接: https://arxiv.org/abs/2508.05662
作者: Yuzhou Zhu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic streams from news feeds, social media, sensor networks, and financial markets challenge static RAG frameworks. Full-scale indices incur high memory costs; periodic rebuilds introduce latency that undermines data freshness; naive sampling sacrifices semantic coverage. We present Streaming RAG, a unified pipeline that combines multi-vector cosine screening, mini-batch clustering, and a counter-based heavy-hitter filter to maintain a compact prototype set. We further prove an approximation bound \ E[R(K_t)] \ge R^* - L \Delta\ linking retrieval quality to clustering variance. An incremental index upsert mechanism refreshes prototypes without interrupting queries. Experiments on eight real-time streams show statistically significant gains in Recall@10 (up to 3 points, p 0.01), end-to-end latency below 15 ms, and throughput above 900 documents per second under a 150 MB budget. Hyperparameter sensitivity analysis over cluster count, admission probability, relevance threshold, and counter capacity validates default settings. In open-domain question answering with GPT-3.5 Turbo, we record 3.2-point gain in Exact Match and 2.8-point gain in F1 on SQuAD; abstractive summarization yields ROUGE-L improvements. Streaming RAG establishes a new Pareto frontier for retrieval augmentation.
zh

[AI-82] Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace ATC KDD2025

【速读】:该论文旨在解决消费者对消费者(C2C)市场中因商品列表非结构化且以视觉驱动为主而导致的搜索效率低下问题,从而提升用户通过图像进行商品检索的体验。其解决方案的关键在于构建一个可扩展的视觉搜索系统,该系统基于零样本(zero-shot)视觉-语言模型(vision-language models),并采用统一的嵌入(embedding)管道结合降维优化策略,实现了高效实时推理与后台索引流程的集成。实验表明,多语言SigLIP模型在离线评估中显著优于现有微调基线(nDCG@5提升13.3%),在线A/B测试进一步验证了其实际效果——图像搜索带来的交易率最高提升达40.9%,证明了零样本模型作为生产级基线的强大实用性与部署灵活性。

链接: https://arxiv.org/abs/2508.05661
作者: Andre Rusli,Shoma Ishimoto,Sho Akiyama,Aman Kumar Singh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 6 pages, KDD 2025 Workshop on Two-sided Marketplace Optimization: Search, Pricing, Matching Growth (TSMO)

点击查看摘要

Abstract:Visual search offers an intuitive way for customers to explore diverse product catalogs, particularly in consumer-to-consumer (C2C) marketplaces where listings are often unstructured and visually driven. This paper presents a scalable visual search system deployed in Mercari’s C2C marketplace, where end-users act as buyers and sellers. We evaluate recent vision-language models for zero-shot image retrieval and compare their performance with an existing fine-tuned baseline. The system integrates real-time inference and background indexing workflows, supported by a unified embedding pipeline optimized through dimensionality reduction. Offline evaluation using user interaction logs shows that the multilingual SigLIP model outperforms other models across multiple retrieval metrics, achieving a 13.3% increase in nDCG@5 over the baseline. A one-week online A/B test in production further confirms real-world impact, with the treatment group showing substantial gains in engagement and conversion, up to a 40.9% increase in transaction rate via image search. Our findings highlight that recent zero-shot models can serve as a strong and practical baseline for production use, which enables teams to deploy effective visual search systems with minimal overhead, while retaining the flexibility to fine-tune based on future data or domain-specific needs.
zh

[AI-83] Open-Source Agent ic Hybrid RAG Framework for Scientific Literature Review

【速读】:该论文旨在解决科学文献激增背景下传统文献综述方法效率低下、难以整合结构化元数据与全文内容分析的问题。现有混合检索增强生成(Hybrid Retrieval Augmented Generation, RAG)系统通常静态配置、依赖专有工具且缺乏不确定性估计,限制了其在科研场景中的实用性。解决方案的关键在于提出一种智能体(agent)驱动的动态RAG架构:首先,智能体能根据查询语义实时选择图谱检索(GraphRAG)或向量检索(VectorRAG)策略;其次,通过指令微调(instruction-tuned generation)实现对研究者需求的实时适应性生成;最后,在推理过程中引入自助法(bootstrapped evaluation)量化不确定性,从而提升相关性、减少幻觉并增强可复现性。该方案基于PubMed、arXiv和Google Scholar等开放获取API构建Neo4j引用知识图谱(KG)与FAISS向量存储(VS),结合Llama-3.3-70B模型完成端到端动态调度与优化,显著优于基线模型,在多个指标上取得系统性提升。

链接: https://arxiv.org/abs/2508.05660
作者: Aditya Nagori,Ricardo Accorsi Casonatto,Ayush Gautam,Abhinav Manikantha Sai Cheruvu,Rishikesan Kamaleswaran
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The surge in scientific publications challenges traditional review methods, demanding tools that integrate structured metadata with full-text analysis. Hybrid Retrieval Augmented Generation (RAG) systems, combining graph queries with vector search offer promise but are typically static, rely on proprietary tools, and lack uncertainty estimates. We present an agentic approach that encapsulates the hybrid RAG pipeline within an autonomous agent capable of (1) dynamically selecting between GraphRAG and VectorRAG for each query, (2) adapting instruction-tuned generation in real time to researcher needs, and (3) quantifying uncertainty during inference. This dynamic orchestration improves relevance, reduces hallucinations, and promotes reproducibility. Our pipeline ingests bibliometric open-access data from PubMed, arXiv, and Google Scholar APIs, builds a Neo4j citation-based knowledge graph (KG), and embeds full-text PDFs into a FAISS vector store (VS) using the all-MiniLM-L6-v2 model. A Llama-3.3-70B agent selects GraphRAG (translating queries to Cypher for KG) or VectorRAG (combining sparse and dense retrieval with re-ranking). Instruction tuning refines domain-specific generation, and bootstrapped evaluation yields standard deviation for evaluation metrics. On synthetic benchmarks mimicking real-world queries, the Instruction-Tuned Agent with Direct Preference Optimization (DPO) outperforms the baseline, achieving a gain of 0.63 in VS Context Recall and a 0.56 gain in overall Context Precision. Additional gains include 0.24 in VS Faithfulness, 0.12 in both VS Precision and KG Answer Relevance, 0.11 in overall Faithfulness score, 0.05 in KG Context Recall, and 0.04 in both VS Answer Relevance and overall Precision. These results highlight the system’s improved reasoning over heterogeneous sources and establish a scalable framework for autonomous, agentic scientific discovery. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.05660 [cs.IR] (or arXiv:2508.05660v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.05660 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-84] Beyond Single Labels: Improving Conversational Recommendation through LLM -Powered Data Augmentation

【速读】:该论文旨在解决对话式推荐系统(Conversational Recommender Systems, CRSs)中存在的“假负样本”(false negative)问题,即在训练过程中将用户可能喜欢的项目错误地标记为负样本,从而导致推荐性能下降。其解决方案的关键在于提出一种新颖的数据增强框架:首先利用大语言模型(LLM)驱动的语义检索器识别多样且语义相关的候选项目,再通过相关性评分器过滤噪声样本;在此基础上,设计了一个两阶段训练策略,以平衡语义相关性和协同信息的保留,从而有效提升CRS的推荐效果。

链接: https://arxiv.org/abs/2508.05657
作者: Haozhe Xu,Xiaohua Wang,Changze Lv,Xiaoqing Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational recommender systems (CRSs) enhance recommendation quality by engaging users in multi-turn dialogues, capturing nuanced preferences through natural language interactions. However, these systems often face the false negative issue, where items that a user might like are incorrectly labeled as negative during training, leading to suboptimal this http URL the label set through data augmentation presents an intuitive solution but faces the challenge of balancing two key aspects: ensuring semantic relevance and preserving the collaborative information inherent in CRS datasets. To address these issues, we propose a novel data augmentation framework that first leverages an LLM-based semantic retriever to identify diverse and semantically relevant items, which are then filtered by a relevance scorer to remove noisy candidates. Building on this, we introduce a two-stage training strategy balancing semantic relevance and collaborative information. Extensive experiments on two benchmark datasets and user simulators demonstrate significant and consistent performance improvements across various recommenders, highlighting the effectiveness of our approach in advancing CRS performance.
zh

[AI-85] Comparison of Information Retrieval Techniques Applied to IT Support Tickets

【速读】:该论文旨在解决IT支持分析人员在处理大量支持工单(support tickets)时效率低下的问题,核心挑战在于如何高效地从历史工单中检索出与当前问题最相关的解决方案。解决方案的关键在于利用信息检索(Information Retrieval, IR)技术对工单内容进行语义匹配,从而自动推荐过往的修复方案。研究比较了十一种不同的IR方法,发现基于Sentence-BERT的多语言模型(distilluse-base-multilingual-cased-v1)表现最优,其推荐的相关性达到78.7%,显著优于TF-IDF(69.0%)、Word2vec(68.7%)和LDA(66.3%)等传统方法。此外,作者还提出了一种新的评估指标,更贴近IT分析师对检索质量的实际感知,并开源了数据集与代码,验证了该系统的可行性与实用性。

链接: https://arxiv.org/abs/2508.05654
作者: Leonardo Santiago Benitez Pereira,Robinson Pizzio,Samir Bonho
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Institutions dependent on IT services and resources acknowledge the crucial significance of an IT help desk system, that act as a centralized hub connecting IT staff and users for service requests. Employing various Machine Learning models, these IT help desk systems allow access to corrective actions used in the past, but each model has different performance when applied to different datasets. This work compares eleven Information Retrieval techniques in a dataset of IT support tickets, with the goal of implementing a software that facilitates the work of Information Technology support analysts. The best results were obtained with the Sentence-BERT technique, in its multi-language variation distilluse-base-multilingual-cased-v1, where 78.7% of the recommendations made by the model were considered relevant. TF-IDF (69.0%), Word2vec (68.7%) and LDA (66.3%) techniques also had consistent results. Furthermore, the used datasets and essential parts of coding have been published and made open source. It also demonstrated the practicality of a support ticket recovery system by implementing a minimal viable prototype, and described in detail the implementation of the system. Finally, this work proposed a novel metric for comparing the techniques, whose aim is to closely reflect the perception of the IT analysts about the retrieval quality.
zh

[AI-86] Modeling Interactive Narrative Systems: A Formal Approach

【速读】:该论文旨在解决交互式叙事系统(Interactive Narrative Systems, INS)研究中存在的碎片化问题,即由于研究方法多样、系统表征不统一,导致难以对INS的特性进行系统分析、描述与比较。其解决方案的关键在于提出一种形式化的表示框架,该框架基于现有先进方法的多样性构建,提供了一致的术语体系和建模结构,从而支持对INS属性的标准化表达与评估。通过在“小红帽”场景中的实验验证,证明了该形式化方法在提升INS评价质量方面的有效性,有助于推动INS研究领域的协作与一致性发展。

链接: https://arxiv.org/abs/2508.05653
作者: Jules Clerc,Domitile Lourdeaux,Mohamed Sallak,Johann Barbier,Marc Ravaine
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive Narrative Systems (INS) have revolutionized digital experiences by empowering users to actively shape their stories, diverging from traditional passive storytelling. However, the field faces challenges due to fragmented research efforts and diverse system representations. This paper introduces a formal representation framework for INS, inspired by diverse approaches from the state of the art. By providing a consistent vocabulary and modeling structure, the framework facilitates the analysis, the description and comparison of INS properties. Experimental validations on the “Little Red Riding Hood” scenario highlight the usefulness of the proposed formalism and its impact on improving the evaluation of INS. This work aims to foster collaboration and coherence within the INS research community by proposing a methodology for formally representing these systems.
zh

[AI-87] Lessons from A Large Language Model-based Outdoor Trail Recommendation Chatbot with Retrieval Augmented Generation

【速读】:该论文旨在解决两个核心问题:一是如何通过对话式人工智能(Conversational AI)提供准确的户外步道信息;二是如何实现可用且高效的推荐服务。解决方案的关键在于开发了一个基于大语言模型(Large Language Model, LLM)并结合检索增强生成(Retrieval-Augmented Generation, RAG)技术的聊天机器人系统——Judy。该系统通过整合结构化户外步道数据与LLM的语义理解能力,在保证推荐准确性的同时提升了交互效率与用户体验,实验结果验证了其在康涅狄格州户外步道场景下的有效性与可用性。

链接: https://arxiv.org/abs/2508.05652
作者: Julia Ann Mathew,Suining He
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 4 pages, UrbComp 2025

点击查看摘要

Abstract:The increasing popularity of outdoor recreational activities (such as hiking and biking) has boosted the demand for a conversational AI system to provide informative and personalized suggestion on outdoor trails. Challenges arise in response to (1) how to provide accurate outdoor trail information via conversational AI; and (2) how to enable usable and efficient recommendation services. To address above, this paper discusses the preliminary and practical lessons learned from developing Judy, an outdoor trail recommendation chatbot based on the large language model (LLM) with retrieval augmented generation (RAG). To gain concrete system insights, we have performed case studies with the outdoor trails in Connecticut (CT), US. We have conducted web-based data collection, outdoor trail data management, and LLM model performance studies on the RAG-based recommendation. Our experimental results have demonstrated the accuracy, effectiveness, and usability of Judy in recommending outdoor trails based on the LLM with RAG.
zh

[AI-88] OmniBench-RAG : A Multi-Domain Evaluation Platform for Retrieval-Augmented Generation Tools

【速读】:该论文旨在解决当前检索增强生成(Retrieval Augmented Generation, RAG)系统在评估时缺乏可复现性、可解释性以及跨模型和跨领域标准化比较框架的问题。现有方法普遍存在领域覆盖不足、度量粒度粗略(无法捕捉子文档级别的精度差异)及忽略计算效率权衡等缺陷。其解决方案的核心是提出OmniBench RAG——一个面向多领域的自动化RAG评估平台,通过引入两个标准化指标:Improvements(准确率提升)与Transformation(前后RAG模型的效率差异),实现对RAG性能在准确性与效率维度上的量化分析,并结合动态测试生成、模块化评估流水线和自动知识库构建机制,支持跨模型与跨任务的可复现比较。

链接: https://arxiv.org/abs/2508.05650
作者: Jiaxuan Liang,Shide Zhou,Kailong Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Retrieval Augmented Generation (RAG) is now widely adopted to enhance LLMs, evaluating its true performance benefits in a reproducible and interpretable way remains a major hurdle. Existing methods often fall short: they lack domain coverage, employ coarse metrics that miss sub document precision, and fail to capture computational trade offs. Most critically, they provide no standardized framework for comparing RAG effectiveness across different models and domains. We introduce OmniBench RAG, a novel automated platform for multi domain evaluation of RAG systems. The platform quantifies performance gains across accuracy and efficiency dimensions, spanning nine knowledge fields including culture, geography, and health. We introduce two standardized metrics: Improvements (accuracy gains) and Transformation (efficiency differences between pre RAG and post RAG models), enabling reproducible comparisons across models and tasks. The platform features dynamic test generation, modular evaluation pipelines, and automated knowledge base construction. Our evaluation reveals striking variability in RAG effectiveness, from significant gains in culture to declines in mathematics, highlighting the critical importance of systematic, domain aware assessment. A demonstration video is available at: this https URL. Code and datasets: this https URL. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.05650 [cs.IR] (or arXiv:2508.05650v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.05650 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-89] AquiLLM : a RAG Tool for Capturing Tacit Knowledge in Research Groups

【速读】:该论文旨在解决研究团队中分散、非结构化且常以隐性知识(tacit knowledge)形式存在的集体知识难以被有效捕获、存储与检索的问题,尤其针对内部私有文档(如邮件、会议记录、培训材料等)因隐私顾虑而无法被现有检索增强生成(Retrieval-Augmented Generation, RAG)系统充分利用的困境。其解决方案的关键在于提出了一种轻量级、模块化的RAG系统——AquiLLM,该系统支持多种文档类型并具备可配置的隐私设置,从而在保障内部资料安全的前提下,提升研究群体对正式与非正式知识资源的访问效率与可用性。

链接: https://arxiv.org/abs/2508.05648
作者: Chandler Campbell,Bernie Boscoe,Tuan Do
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to US Research Software Engineer Association (US-RSE) 2025

点击查看摘要

Abstract:Research groups face persistent challenges in capturing, storing, and retrieving knowledge that is distributed across team members. Although structured data intended for analysis and publication is often well managed, much of a group’s collective knowledge remains informal, fragmented, or undocumented–often passed down orally through meetings, mentoring, and day-to-day collaboration. This includes private resources such as emails, meeting notes, training materials, and ad hoc documentation. Together, these reflect the group’s tacit knowledge–the informal, experience-based expertise that underlies much of their work. Accessing this knowledge can be difficult, requiring significant time and insider understanding. Retrieval-augmented generation (RAG) systems offer promising solutions by enabling users to query and generate responses grounded in relevant source material. However, most current RAG-LLM systems are oriented toward public documents and overlook the privacy concerns of internal research materials. We introduce AquiLLM (pronounced ah-quill-em), a lightweight, modular RAG system designed to meet the needs of research groups. AquiLLM supports varied document types and configurable privacy settings, enabling more effective access to both formal and informal knowledge within scholarly groups.
zh

[AI-90] Query-Aware Graph Neural Networks for Enhanced Retrieval-Augmented Generation

【速读】:该论文旨在解决复杂多跳问答任务中传统密集检索方法(dense retrieval)因将文档视为独立实体而难以捕捉跨文档语义关联的问题。其解决方案的关键在于构建基于每轮对话的图结构知识库(knowledge graph),通过引入查询感知注意力机制(query-aware attention)与可学习评分头(learned scoring heads),实现对文本片段间顺序和语义关系的建模,并结合查询引导的池化策略(query-guided pooling)动态聚焦于与用户查询相关的图结构区域,从而提升检索精度。

链接: https://arxiv.org/abs/2508.05647
作者: Vibhor Agrawal,Fay Wang,Rishi Puri
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel graph neural network (GNN) architecture for retrieval-augmented generation (RAG) that leverages query-aware attention mechanisms and learned scoring heads to improve retrieval accuracy on complex, multi-hop questions. Unlike traditional dense retrieval methods that treat documents as independent entities, our approach constructs per-episode knowledge graphs that capture both sequential and semantic relationships between text chunks. We introduce an Enhanced Graph Attention Network with query-guided pooling that dynamically focuses on relevant parts of the graph based on user queries. Experimental results demonstrate that our approach significantly outperforms standard dense retrievers on complex question answering tasks, particularly for questions requiring multi-document reasoning. Our implementation leverages PyTorch Geometric for efficient processing of graph-structured data, enabling scalable deployment in production retrieval systems
zh

[AI-91] Request-Only Optimization for Recommendation Systems

【速读】:该论文旨在解决工业级深度学习推荐模型(Deep Learning Recommendation Models, DLRMs)在超大规模训练场景下面临的存储效率低、训练效率差以及模型质量难以提升的问题。随着用户历史数据的丰富,DLRMs已扩展至每样本高达万亿次浮点运算(TFLOPs),对存储和计算资源提出严峻挑战。解决方案的关键在于提出请求级优化(Request-Only Optimizations, ROO)范式,通过协同设计数据层(仅记录用户请求而非点击印象)、基础设施(基于请求的数据处理流水线)与模型架构(仅依赖请求输入的神经网络结构),实现原生特征去重以减少存储开销,并通过消除单个请求内多个印象间的重复计算与通信,显著提升模型对用户兴趣信号的捕捉能力,从而同时优化存储效率、训练效率和模型性能。

链接: https://arxiv.org/abs/2508.05640
作者: Liang Guo,Wei Li,Lucy Liao,Huihui Cheng,Rui Zhang,Yu Shi,Yueming Wang,Yanzun Huang,Keke Zhai,Pengchao Wang,Timothy Shi,Xuan Cao,Shengzhi Wang,Renqin Cai,Zhaojie Gong,Omkar Vichare,Rui Jian,Leon Gao,Shiyan Deng,Xingyu Liu,Xiong Zhang,Fu Li,Wenlei Xie,Bin Wen,Rui Li,Xing Liu,Jiaqi Zhai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Learning Recommendation Models (DLRMs) represent one of the largest machine learning applications on the planet. Industry-scale DLRMs are trained with petabytes of recommendation data to serve billions of users every day. To utilize the rich user signals in the long user history, DLRMs have been scaled up to unprecedented complexity, up to trillions of floating-point operations (TFLOPs) per example. This scale, coupled with the huge amount of training data, necessitates new storage and training algorithms to efficiently improve the quality of these complex recommendation systems. In this paper, we present a Request-Only Optimizations (ROO) training and modeling paradigm. ROO simultaneously improves the storage and training efficiency as well as the model quality of recommendation systems. We holistically approach this challenge through co-designing data (i.e., request-only data), infrastructure (i.e., request-only based data processing pipeline), and model architecture (i.e., request-only neural architectures). Our ROO training and modeling paradigm treats a user request as a unit of the training data. Compared with the established practice of treating a user impression as a unit, our new design achieves native feature deduplication in data logging, consequently saving data storage. Second, by de-duplicating computations and communications across multiple impressions in a request, this new paradigm enables highly scaled-up neural network architectures to better capture user interest signals, such as Generative Recommenders (GRs) and other request-only friendly architectures.
zh

[AI-92] Automated Visualization Makeovers with LLM s

【速读】:该论文旨在解决数据科学教育中缺乏系统性数据可视化训练的问题,即如何有效提升用户对已有图表的改进能力,使其更准确、高效地传达信息。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, LLMs)通过提示工程(prompt engineering)实现对现有图表的半自动化优化建议:模型基于预训练知识与用户指定的最佳实践指南,对图像或生成代码形式的图表进行分析并提供可操作的改进建议,而非从原始数据生成新可视化脚本。此方法聚焦于“教育式反馈”,强调提升用户自身可视化素养,同时通过定量评估验证了模型对多种图表类型中常见问题的敏感性。

链接: https://arxiv.org/abs/2508.05637
作者: Siddharth Gangwar,David A. Selby,Sebastian J. Vollmer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Making a good graphic that accurately and efficiently conveys the desired message to the audience is both an art and a science, typically not taught in the data science curriculum. Visualisation makeovers are exercises where the community exchange feedback to improve charts and data visualizations. Can multi-modal large language models (LLMs) emulate this task? Given a plot in the form of an image file, or the code used to generate it, an LLM, primed with a list of visualization best practices, is employed to semi-automatically generate constructive criticism to produce a better plot. Our system is centred around prompt engineering of a pre-trained model, relying on a combination of userspecified guidelines and any latent knowledge of data visualization practices that might lie within an LLMs training corpus. Unlike other works, the focus is not on generating valid visualization scripts from raw data or prompts, but on educating the user how to improve their existing data visualizations according to an interpretation of best practices. A quantitative evaluation is performed to measure the sensitivity of the LLM agent to various plotting issues across different chart types. We make the tool available as a simple self-hosted applet with an accessible Web interface.
zh

[AI-93] Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent -based Simulation

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)驱动的基于代理的模型(Agent-Based Models, ABMs)和理性基于代理的模型(Rational Agent-Based Models, RABMs)在性能评估中的挑战,尤其是由于系统复杂性和随机性导致的传统指标不足的问题。解决方案的关键在于开发领域驱动的奖励机制与度量标准(Domain-driven-RL-metrics),将领域知识嵌入到RL目标函数中,并结合传统及前沿评估指标,在一个用于模拟疫情中口罩使用、疫苗接种和封控行为的理性ABM案例研究中验证其有效性,从而提升模型在不同情境(如口罩可获得性差异)下的可解释性与可比性。

链接: https://arxiv.org/abs/2508.05154
作者: Rishabh Gaur,Gaurav Deshkar,Jayanta Kshirsagar,Harshal Hayatnagarkar,Janani Venugopalan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics’’ using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks.
zh

[AI-94] SHACL Validation in the Presence of Ontologies: Semantics and Rewriting Techniques

【速读】:该论文旨在解决SHACL(Shape Constraint Language)与OWL(Web Ontology Language)在RDF数据管理中语义鸿沟的问题,即如何在存在本体的情况下实现SHACL约束验证。其核心挑战在于SHACL基于闭世界假设(closed-world assumption),而OWL基于开世界假设(open-world assumption),二者语义不一致导致验证困难。解决方案的关键是提出一种基于核心普遍模型(core universal models)的SHACL验证语义,并针对Horn-ALCHIQ描述逻辑中的本体构造此类模型;进一步利用该模型的有限表示,设计了一种重写技术,将带本体的SHACL验证问题转化为标准SHACL验证问题,从而在理论上和计算上统一处理二者。

链接: https://arxiv.org/abs/2507.12286
作者: Anouk Oudshoorn,Magdalena Ortiz,Mantas Simkus
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 36 pages, 6 figures, submitted to the journal of Artificial Intelligence (AIJ)

点击查看摘要

Abstract:SHACL and OWL are two prominent W3C standards for managing RDF data. These languages share many features, but they have one fundamental difference: OWL, designed for inferring facts from incomplete data, makes the open-world assumption, whereas SHACL is a constraint language that treats the data as complete and must be validated under the closed-world assumption. The combination of both formalisms is very appealing and has been called for, but their semantic gap is a major challenge, semantically and computationally. In this paper, we advocate a semantics for SHACL validation in the presence of ontologies based on core universal models. We provide a technique for constructing these models for ontologies in the rich data-tractable description logic Horn-ALCHIQ. Furthermore, we use a finite representation of this model to develop a rewriting technique that reduces SHACL validation in the presence of ontologies to standard validation. Finally, we study the complexity of SHACL validation in the presence of ontologies, and show that even very simple ontologies make the problem EXPTIME-complete, and PTIME-complete in data complexity.
zh

[AI-95] Epidemic Control on a Large-Scale-Agent -Based Epidemiology Model using Deep Deterministic Policy Gradient

【速读】:该论文旨在解决疫情干预措施(如封锁和疫苗接种)在大规模人群中如何实现健康与经济目标之间的最优平衡问题。现有研究受限于模拟规模小、模型类型不适用于干预分析以及可探索的策略有限,难以自动识别最优干预政策。解决方案的关键在于构建一个基于深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)的策略优化框架,并将其应用于包含10万个体的大规模流行病学代理模型(agent-based simulation),实现了多目标优化(兼顾感染率、住院率与经济状况),从而在无封锁且仅对中老年群体进行疫苗接种的情境下,得出兼顾健康与经济表现的最优策略。

链接: https://arxiv.org/abs/2304.04475
作者: Gaurav Deshkar,Jayanta Kshirsagar,Harshal Hayatnagarkar,Janani Venugopalan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:To mitigate the impact of the pandemic, several measures include lockdowns, rapid vaccination programs, school closures, and economic stimulus. These interventions can have positive or unintended negative consequences. Current research to model and determine an optimal intervention automatically through round-tripping is limited by the simulation objectives, scale (a few thousand individuals), model types that are not suited for intervention studies, and the number of intervention strategies they can explore (discrete vs continuous). We address these challenges using a Deep Deterministic Policy Gradient (DDPG) based policy optimization framework on a large-scale (100,000 individual) epidemiological agent-based simulation where we perform multi-objective optimization. We determine the optimal policy for lockdown and vaccination in a minimalist age-stratified multi-vaccine scenario with a basic simulation for economic activity. With no lockdown and vaccination (mid-age and elderly), results show optimal economy (individuals below the poverty line) with balanced health objectives (infection, and hospitalization). An in-depth simulation is needed to further validate our results and open-source our framework.
zh

[AI-96] Intuition emerges in Maximum Caliber models at criticality

【速读】:该论文试图解决的问题是:大规模预测模型是否仅仅是复述训练数据,还是能够产生真正的洞察力,即是否存在一种可被物理机制解释的“直觉”(intuition)现象。解决方案的关键在于提出了一种称为“mind-tuning”的最小原理,该原理通过引入一个类温度参数 λ 强制在预测模型中实现最大 caliber(最大作用量),从而在学习过程中形成一种亚稳态(metastable phase)。该机制通过平衡当前的下一个词预测能力与未来路径熵(future path-entropy)来实现,在特定参数区间内,模型会自发发现新的目标导向策略,表现出类似直觉的行为。这一发现揭示了直觉作为系统在记忆与探索之间临界平衡下涌现的特性。

链接: https://arxiv.org/abs/2508.06477
作者: Lluís Arola-Fernández
机构: 未知
类目: Physics and Society (physics.soc-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whether large predictive models merely parrot their training data or produce genuine insight lacks a physical explanation. This work reports a primitive form of intuition that emerges as a metastable phase of learning that critically balances next-token prediction against future path-entropy. The intuition mechanism is discovered via mind-tuning, the minimal principle that imposes Maximum Caliber in predictive models with a control temperature-like parameter \lambda . Training on random walks in deterministic mazes reveals a rich phase diagram: imitation (low \lambda ), rule-breaking hallucination (high \lambda ), and a fragile in-between window exhibiting strong protocol-dependence (hysteresis) and multistability, where models spontaneously discover novel goal-directed strategies. These results are captured by an effective low-dimensional theory and frame intuition as an emergent property at the critical balance between memorizing what is and wondering what could be.
zh

[AI-97] LLM Serving Optimization with Variable Prefill and Decode Lengths

【速读】:该论文致力于解决大语言模型(Large Language Model, LLM)服务中请求调度优化问题,其中每个请求具有异构的预填充(prefill)和解码(decode)长度。预填充长度决定初始KV缓存内存占用,而解码过程每生成一个输出token都会线性增加KV缓存使用量,导致资源竞争与调度复杂性显著提升。由于批处理、放置约束、任务先后依赖关系及内存随时间线性增长的特性,该问题被证明为NP-hard。现有策略如先到先服务(FCFS)和最短优先(SF)的竞争力比随内存限制亚线性增长,在实际大规模内存场景下表现不佳。论文提出一种基于新选择度量(selection metric)的调度算法,能够动态高效地形成批次,理论证明其具有常数竞争力比;并通过动态规划、局部搜索和线性规划(LP)等变体实现高性能与高效率的平衡,在仿真中显著优于标准基线方法。

链接: https://arxiv.org/abs/2508.06133
作者: Meixuan Wang,Yinyu Ye,Zijie Zhou
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. Given a set of n requests, our goal is to schedule and process them to minimize the total completion time. We show that this problem is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios scale up sublinearly with the memory limit-a significant drawback in real-world settings where memory demand is large. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. Finally, we develop and evaluate a few algorithm variants inspired by this approach, including dynamic programming variants, local search methods, and an LP-based scheduler, demonstrating through comprehensive simulations that they outperform standard baselines while maintaining computational efficiency.
zh

[AI-98] CLAPP: The CLASS LLM Agent for Pair Programming

【速读】:该论文旨在解决科研人员在使用Einstein-Boltzmann求解器CLASS进行计算宇宙学研究时,因工具复杂性高、代码编写与调试困难而面临的效率瓶颈问题。解决方案的关键在于提出CLAPP(CLASS LLM Agent for Pair Programming),其核心创新包括:基于大语言模型(Large Language Models, LLMs)的多智能体协同架构、针对CLASS文档的语义搜索机制,以及集成实时Python执行环境,从而实现交互式代码生成、错误调试和可视化绘图支持,显著提升人机协作效率并降低AI工具使用门槛。

链接: https://arxiv.org/abs/2508.05728
作者: Santiago Casas,Christian Fidler,Boris Bolliet,Francisco Villaescusa-Navarro,Julien Lesgourgues
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Code: this https URL , Streamlit app: this https URL

点击查看摘要

Abstract:We introduce CLAPP (CLASS LLM Agent for Pair Programming), an interactive AI assistant designed to support researchers working with the Einstein-Boltzmann solver CLASS. CLAPP leverages large language models (LLMs) and domain-specific retrieval to provide conversational coding support for CLASS-answering questions, generating code, debugging errors, and producing plots. Its architecture combines multi-agent LLM orchestration, semantic search across CLASS documentation, and a live Python execution environment. Deployed as a user-friendly web application, CLAPP lowers the entry barrier for scientists unfamiliar with AI tools and enables more productive human-AI collaboration in computational and numerical cosmology. The app is available at this https URL
zh

[AI-99] A Physiologically-Constrained Neural Network Digital Twin Framework for Replicating Glucose Dynamics in Type 1 Diabetes

【速读】:该论文旨在解决当前用于模拟1型糖尿病(Type 1 Diabetes, T1D)个体葡萄糖动态的模型普遍存在生理机制缺失和难以个性化的问题。现有模型往往无法准确反映关键生理过程,且缺乏对个体差异的有效建模能力,限制了其在精准医疗和临床决策支持中的应用。解决方案的关键在于提出一种生理约束神经网络(Physiologically-Constrained Neural Network, PCNN)数字孪生框架:首先构建一个基于常微分方程(Ordinary Differential Equations, ODEs)约束的群体级状态空间神经网络模型,确保其符合已知的T1D生理动力学特性;随后通过引入个体特异性参数(如血糖管理数据与情境信息)来生成个性化数字孪生体,从而同时捕捉个体间和个体内变异。该方法在真实世界T1D Exercise Initiative研究数据上验证有效,结果显示模拟与实际葡萄糖指标(如目标范围时间、低/高血糖时间)具有临床等效性,具备支持个性化虚拟治疗测试与胰岛素优化的潜力。

链接: https://arxiv.org/abs/2508.05705
作者: Valentina Roquemen-Echeverri,Taisa Kushner,Peter G. Jacobs,Clara Mosquera-Lopez
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Simulating glucose dynamics in individuals with type 1 diabetes (T1D) is critical for developing personalized treatments and supporting data-driven clinical decisions. Existing models often miss key physiological aspects and are difficult to individualize. Here, we introduce physiologically-constrained neural network (NN) digital twins to simulate glucose dynamics in T1D. To ensure interpretability and physiological consistency, we first build a population-level NN state-space model aligned with a set of ordinary differential equations (ODEs) describing glucose regulation. This model is formally verified to conform to known T1D dynamics. Digital twins are then created by augmenting the population model with individual-specific models, which include personal data, such as glucose management and contextual information, capturing both inter- and intra-individual variability. We validate our approach using real-world data from the T1D Exercise Initiative study. Two weeks of data per participant were split into 5-hour sequences and simulated glucose profiles were compared to observed ones. Clinically relevant outcomes were used to assess similarity via paired equivalence t-tests with predefined clinical equivalence margins. Across 394 digital twins, glucose outcomes were equivalent between simulated and observed data: time in range (70-180 mg/dL) was 75.1 \pm 21.2% (simulated) vs. 74.4 \pm 15.4% (real; P0.001); time below range (70 mg/dL) 2.5 \pm 5.2% vs. 3.0 \pm 3.3% (P=0.022); and time above range (180 mg/dL) 22.4 \pm 22.0% vs. 22.6 \pm 15.9% (P0.001). Our framework can incorporate unmodeled factors like sleep and activity while preserving key dynamics. This approach enables personalized in silico testing of treatments, supports insulin optimization, and integrates physics-based and data-driven modeling. Code: this https URL
zh

机器学习

[LG-0] LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection

链接: https://arxiv.org/abs/2508.06467
作者: Ameya Anjarlekar,Sandeep Pombra
类目: Machine Learning (cs.LG)
*备注: 14 Pages, 3 Figures, 11 Tables

点击查看摘要

Abstract:The growing legal and ethical scrutiny of large language models (LLMs) necessitates effective machine unlearning, particularly for sensitive or unauthorized data. Existing empirical methods often yield incomplete forgetting or unintended degradation of unrelated knowledge due to poor localization. In this work, we propose GRIN: a modular and targeted framework for LLM unlearning. GRIN introduces a novel gradient-ratio-based metric to identify parameters most responsible for memorizing forget data. We then perform selective noise injection into these parameters prior to fine-tuning, which improves unlearning performance while maintaining model utility. Finally, we propose new evaluation metrics tailored to the LLM setting and validate our approach on standard benchmarks such as TOFU, WMDP, and SafePKU.

[LG-1] Maximum Impact with Fewer Features: Efficient Feature Selection for Cold-Start Recommenders through Collaborative Importance Weighting

链接: https://arxiv.org/abs/2508.06455
作者: Nikita Sukhorukov,Danil Gusak,Evgeny Frolov
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cold-start challenges in recommender systems necessitate leveraging auxiliary features beyond user-item interactions. However, the presence of irrelevant or noisy features can degrade predictive performance, whereas an excessive number of features increases computational demands, leading to higher memory consumption and prolonged training times. To address this, we propose a feature selection strategy that prioritizes the user behavioral information. Our method enhances the feature representation by incorporating correlations from collaborative behavior data using a hybrid matrix factorization technique and then ranks features using a mechanism based on the maximum volume algorithm. This approach identifies the most influential features, striking a balance between recommendation accuracy and computational efficiency. We conduct an extensive evaluation across various datasets and hybrid recommendation models, demonstrating that our method excels in cold-start scenarios by selecting minimal yet highly effective feature subsets. Even under strict feature reduction, our approach surpasses existing feature selection techniques while maintaining superior efficiency. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2508.06455 [cs.IR] (or arXiv:2508.06455v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.06455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] SASRec: Enhancing Transformer-based Recommendations in a Modular Fashion RECSYS2025

链接: https://arxiv.org/abs/2508.06450
作者: Daria Tikhonovich,Nikita Zelinskiy,Aleksandr V. Petrov,Mayya Spirina,Andrei Semenov,Andrey V. Savchenko,Sergei Kuliev
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at ACM RecSys 2025

点击查看摘要

Abstract:Since their introduction, Transformer-based models, such as SASRec and BERT4Rec, have become common baselines for sequential recommendations, surpassing earlier neural and non-neural methods. A number of following publications have shown that the effectiveness of these models can be improved by, for example, slightly updating the architecture of the Transformer layers, using better training objectives, and employing improved loss functions. However, the additivity of these modular improvements has not been systematically benchmarked - this is the gap we aim to close in this paper. Through our experiments, we identify a very strong model that uses SASRec’s training objective, LiGR Transformer layers, and Sampled Softmax Loss. We call this combination eSASRec (Enhanced SASRec). While we primarily focus on realistic, production-like evaluation, in our preliminarily study we find that common academic benchmarks show eSASRec to be 23% more effective compared to the most recent state-of-the-art models, such as ActionPiece. In our main production-like benchmark, eSASRec resides on the Pareto frontier in terms of the accuracy-coverage tradeoff (alongside the recent industrial models HSTU and FuXi. As the modifications compared to the original SASRec are relatively straightforward and no extra features are needed (such as timestamps in HSTU), we believe that eSASRec can be easily integrated into existing recommendation pipelines and can can serve as a strong yet very simple baseline for emerging complicated algorithms. To facilitate this, we provide the open-source implementations for our models and benchmarks in repository this https URL

[LG-3] A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images

链接: https://arxiv.org/abs/2508.06409
作者: Wooyong Jung,Sola Kim,Dongwook Kim,Maryam Tabar,Dongwon Lee
类目: Machine Learning (cs.LG)
*备注: 10 pages, Accepted to SBP-BRiMS 2025

点击查看摘要

Abstract:Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.

[LG-4] Blockchain-Enabled Federated Learning

链接: https://arxiv.org/abs/2508.06406
作者: Murtaza Rangwala,Venugopal K R,Rajkumar Buyya
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 32 pages, 6 figures, chapter for edited book (Federated Learning: Foundations and Applications)

点击查看摘要

Abstract:Blockchain-enabled federated learning (BCFL) addresses fundamental challenges of trust, privacy, and coordination in collaborative AI systems. This chapter provides comprehensive architectural analysis of BCFL systems through a systematic four-dimensional taxonomy examining coordination structures, consensus mechanisms, storage architectures, and trust models. We analyze design patterns from blockchain-verified centralized coordination to fully decentralized peer-to-peer networks, evaluating trade-offs in scalability, security, and performance. Through detailed examination of consensus mechanisms designed for federated learning contexts, including Proof of Quality and Proof of Federated Learning, we demonstrate how computational work can be repurposed from arbitrary cryptographic puzzles to productive machine learning tasks. The chapter addresses critical storage challenges by examining multi-tier architectures that balance blockchain’s transaction constraints with neural networks’ large parameter requirements while maintaining cryptographic integrity. A technical case study of the TrustMesh framework illustrates practical implementation considerations in BCFL systems through distributed image classification training, demonstrating effective collaborative learning across IoT devices with highly non-IID data distributions while maintaining complete transparency and fault tolerance. Analysis of real-world deployments across healthcare consortiums, financial services, and IoT security applications validates the practical viability of BCFL systems, achieving performance comparable to centralized approaches while providing enhanced security guarantees and enabling new models of trustless collaborative intelligence.

[LG-5] ree-Based Deep Learning for Ranking Symbolic Integration Algorithms

链接: https://arxiv.org/abs/2508.06383
作者: Rashid Barket,Matthew England,Jürgen Gerhard
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG)
*备注: 29 pages, 13 figures, 5 tables, submitted to Transactions on Mathematical Software (TOMS)

点击查看摘要

Abstract:Symbolic indefinite integration in Computer Algebra Systems such as Maple involves selecting the most effective algorithm from multiple available methods. Not all methods will succeed for a given problem, and when several do, the results, though mathematically equivalent, can differ greatly in presentation complexity. Traditionally, this choice has been made with minimal consideration of the problem instance, leading to inefficiencies. We present a machine learning (ML) approach using tree-based deep learning models within a two-stage architecture: first identifying applicable methods for a given instance, then ranking them by predicted output complexity. Furthermore, we find representing mathematical expressions as tree structures significantly improves performance over sequence-based representations, and our two-stage framework outperforms alternative ML formulations. Using a diverse dataset generated by six distinct data generators, our models achieve nearly 90% accuracy in selecting the optimal method on a 70,000 example holdout test set. On an independent out-of-distribution benchmark from Maple’s internal test suite, our tree transformer model maintains strong generalisation, outperforming Maple’s built-in selector and prior ML approaches. These results highlight the critical role of data representation and problem framing in ML for symbolic computation, and we expect our methodology to generalise effectively to similar optimisation problems in mathematical software. Comments: 29 pages, 13 figures, 5 tables, submitted to Transactions on Mathematical Software (TOMS) Subjects: Symbolic Computation (cs.SC); Machine Learning (cs.LG) Cite as: arXiv:2508.06383 [cs.SC] (or arXiv:2508.06383v1 [cs.SC] for this version) https://doi.org/10.48550/arXiv.2508.06383 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rashid Barket [view email] [v1] Fri, 8 Aug 2025 15:13:39 UTC (675 KB)

[LG-6] Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means

链接: https://arxiv.org/abs/2508.06353
作者: Parichit Sharma,Marcin Stanislaw,Hasan Kurban,Oguzhan Kulekci,Mehmet Dalkilic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces Geometric-k-means (or Gk-means for short), a novel approach that significantly enhances the efficiency and energy economy of the widely utilized k-means algorithm, which, despite its inception over five decades ago, remains a cornerstone in machine learning applications. The essence of Gk-means lies in its active utilization of geometric principles, specifically scalar projection, to significantly accelerate the algorithm without sacrificing solution quality. This geometric strategy enables a more discerning focus on data points that are most likely to influence cluster updates, which we call as high expressive data (HE). In contrast, low expressive data (LE), does not impact clustering outcome, is effectively bypassed, leading to considerable reductions in computational overhead. Experiments spanning synthetic, real-world and high-dimensional datasets, demonstrate Gk-means is significantly better than traditional and state of the art (SOTA) k-means variants in runtime and distance computations (DC). Moreover, Gk-means exhibits better resource efficiency, as evidenced by its reduced energy footprint, placing it as more sustainable alternative.

[LG-7] Introducing Fractional Classification Loss for Robust Learning with Noisy Labels

链接: https://arxiv.org/abs/2508.06346
作者: Mert Can Kurucu,Tufan Kumbasar,İbrahim Eksin,Müjde Güzelkaya
类目: Machine Learning (cs.LG)
*备注: 25 pages, 6 figures, 2 table. Submitted to Pattern Recognition

点击查看摘要

Abstract:Robust loss functions are crucial for training deep neural networks in the presence of label noise, yet existing approaches require extensive, dataset-specific hyperparameter tuning. In this work, we introduce Fractional Classification Loss (FCL), an adaptive robust loss that automatically calibrates its robustness to label noise during training. Built within the active-passive loss framework, FCL employs the fractional derivative of the Cross-Entropy (CE) loss as its active component and the Mean Absolute Error (MAE) as its passive loss component. With this formulation, we demonstrate that the fractional derivative order \mu spans a family of loss functions that interpolate between MAE-like robustness and CE-like fast convergence. Furthermore, we integrate \mu into the gradient-based optimization as a learnable parameter and automatically adjust it to optimize the trade-off between robustness and convergence speed. We reveal that FCL’s unique property establishes a critical trade-off that enables the stable learning of \mu : lower log penalties on difficult or mislabeled examples improve robustness but impose higher penalties on easy or clean data, reducing model confidence in them. Consequently, FCL can dynamically reshape its loss landscape to achieve effective classification performance under label noise. Extensive experiments on benchmark datasets show that FCL achieves state-of-the-art results without the need for manual hyperparameter tuning.

[LG-8] EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

链接: https://arxiv.org/abs/2508.06321
作者: Durjoy Chandra Paul,Gaurob Saha,Md Amjad Hossain
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: To be published in ICCCNT 2025 (16th International Conference on Computing Communication and Networking Technologies)

点击查看摘要

Abstract:Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78% and unweighted accuracy of 92.52% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75% and unweighted accuracy of 91.28%. On the RAVDESS dataset, we get a weighted accuracy of 94.53% and 94.98% unweighted accuracy for ReLU activation and 93.72% weighted accuracy and 94.64% unweighted accuracy for ELU activation. These results highlight EmoAugNet’s effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.

[LG-9] Low-Bit Data Processing Using Multiple-Output Spiking Neurons with Non-linear Reset Feedback

链接: https://arxiv.org/abs/2508.06292
作者: Sanja Karilanova,Subhrakanti Dey,Ayça Özçelikkale
类目: Machine Learning (cs.LG)
*备注: 15 pages, 7 Tables, 6 Figures

点击查看摘要

Abstract:Neuromorphic computing is an emerging technology enabling low-latency and energy-efficient signal processing. A key algorithmic tool in neuromorphic computing is spiking neural networks (SNNs). SNNs are biologically inspired neural networks which utilize stateful neurons, and provide low-bit data processing by encoding and decoding information using spikes. Similar to SNNs, deep state-space models (SSMs) utilize stateful building blocks. However, deep SSMs, which recently achieved competitive performance in various temporal modeling tasks, are typically designed with high-precision activation functions and no reset mechanisms. To bridge the gains offered by SNNs and the recent deep SSM models, we propose a novel multiple-output spiking neuron model that combines a linear, general SSM state transition with a non-linear feedback mechanism through reset. Compared to the existing neuron models for SNNs, our proposed model clearly conceptualizes the differences between the spiking function, the reset condition and the reset action. The experimental results on various tasks, i.e., a keyword spotting task, an event-based vision task and a sequential pattern recognition task, show that our proposed model achieves performance comparable to existing benchmarks in the SNN literature. Our results illustrate how the proposed reset mechanism can overcome instability and enable learning even when the linear part of neuron dynamics is unstable, allowing us to go beyond the strictly enforced stability of linear dynamics in recent deep SSM models.

[LG-10] A Study on Regularization-Based Continual Learning Methods for Indic ASR

链接: https://arxiv.org/abs/2508.06280
作者: Gokul Adethya T,S. Jaya Nirmala
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Indias linguistic diversity poses significant challenges for developing inclusive Automatic Speech Recognition (ASR) systems. Traditional multilingual models, which require simultaneous access to all language data, are impractical due to the sequential arrival of data and privacy constraints. Continual Learning (CL) offers a solution by enabling models to learn new languages sequentially without catastrophically forgetting previously learned knowledge. This paper investigates CL for ASR on Indian languages using a subset of the IndicSUPERB benchmark. We employ a Conformer-based hybrid RNN-T/CTC model, initially pretrained on Hindi, which is then incrementally trained on eight additional Indian languages, for a total sequence of nine languages. We evaluate three prominent regularization- and distillation-based CL strategies: Elastic Weight Consolidation (EWC), Memory Aware Synapses (MAS), and Learning without Forgetting (LwF), selected for their suitability in no-replay, privacy-conscious scenarios. Performance is analyzed using Word Error Rate (WER) for both RNN-T and CTC paths on clean and noisy data, as well as knowledge retention via Backward Transfer. We also explore the impact of varying the number of training epochs (1, 2, 5, and 10) per task. Results, compared against naive fine-tuning, demonstrate CLs effectiveness in mitigating forgetting, making it a promising approach for scalable ASR in diverse Indian languages under realistic constraints. The code is available at: this https URL

[LG-11] Multi-Omics Analysis for Cancer Subtype Inference via Unrolling Graph Smoothness Priors

链接: https://arxiv.org/abs/2508.06257
作者: Jielong Lu,Zhihao Wu,Jiajun Yu,Jiajun Bu,Haishuai Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating multi-omics datasets through data-driven analysis offers a comprehensive understanding of the complex biological processes underlying various diseases, particularly cancer. Graph Neural Networks (GNNs) have recently demonstrated remarkable ability to exploit relational structures in biological data, enabling advances in multi-omics integration for cancer subtype classification. Existing approaches often neglect the intricate coupling between heterogeneous omics, limiting their capacity to resolve subtle cancer subtype heterogeneity critical for precision oncology. To address these limitations, we propose a framework named Graph Transformer for Multi-omics Cancer Subtype Classification (GTMancer). This framework builds upon the GNN optimization problem and extends its application to complex multi-omics data. Specifically, our method leverages contrastive learning to embed multi-omics data into a unified semantic space. We unroll the multiplex graph optimization problem in that unified space and introduce dual sets of attention coefficients to capture structural graph priors both within and among multi-omics data. This approach enables global omics information to guide the refining of the representations of individual omics. Empirical experiments on seven real-world cancer datasets demonstrate that GTMancer outperforms existing state-of-the-art algorithms.

[LG-12] Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

链接: https://arxiv.org/abs/2508.06247
作者: Zichun Ye,Runqi Wang,Xutong Liu,Shuai Li
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor \log T that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of O\big( (\log k)^2\sqrtkmT\big ) under semi-bandit feedback, where m is the number of arms and k is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on \log T and matches the established \Omega\big( \sqrtkmT\big) lower bound up to O\big((\log k)^2\big) . We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

[LG-13] SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems

链接: https://arxiv.org/abs/2508.06243
作者: Ioan-Sorin Comsa,Purav Shah,Karthik Vaidhyanathan,Deepak Gangadharan,Christof Imhof,Per Bergamin,Aryan Kaushik,Gabriel-Miro Muntean,Ramona Trestian
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14% and reduces unfair scheduling time by 15% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10%, confirming its efficiency. These results demonstrate SCAR’s scalability and fairness benefits for dynamic vehicular networks.

[LG-14] Recurrent Deep Differentiable Logic Gate Networks

链接: https://arxiv.org/abs/2508.06097
作者: Simon Bührer,Andreas Plesner,Till Aczel,Roger Wattenhofer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While differentiable logic gates have shown promise in feedforward networks, their application to sequential modeling remains unexplored. This paper presents the first implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN), combining Boolean operations with recurrent architectures for sequence-to-sequence learning. Evaluated on WMT’14 English-German translation, RDDLGN achieves 5.00 BLEU and 30.9% accuracy during training, approaching GRU performance (5.41 BLEU) and graceful degradation (4.39 BLEU) during inference. This work establishes recurrent logic-based neural computation as viable, opening research directions for FPGA acceleration in sequential modeling and other recursive network architectures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.06097 [cs.LG] (or arXiv:2508.06097v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.06097 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Adaptive Backtracking for Privacy Protection in Large Language Models

链接: https://arxiv.org/abs/2508.06087
作者: Zhihao Yao,Yuxuan Gu,Xiachong Feng,Weitao Ma,Bo Li,Xiaocheng Feng
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The preservation of privacy has emerged as a critical topic in the era of artificial intelligence. However, current work focuses on user-oriented privacy, overlooking severe enterprise data leakage risks exacerbated by the Retrieval-Augmented Generation paradigm. To address this gap, our paper introduces a novel objective: enterprise-oriented privacy concerns. Achieving this objective requires overcoming two fundamental challenges: existing methods such as data sanitization severely degrade model performance, and the field lacks public datasets for evaluation. We address these challenges with several solutions. (1) To prevent performance degradation, we propose ABack, a training-free mechanism that leverages a Hidden State Model to pinpoint the origin of a leakage intention and rewrite the output safely. (2) To solve the lack of datasets, we construct PriGenQA, a new benchmark for enterprise privacy scenarios in healthcare and finance. To ensure a rigorous evaluation, we move beyond simple static attacks by developing a powerful adaptive attacker with Group Relative Policy Optimization. Experiments show that against this superior adversary, ABack improves the overall privacy utility score by up to 15% over strong baselines, avoiding the performance trade-offs of prior methods.

[LG-16] Stepwise Fine and Gray: Subject-Specific Variable Selection Shows When Hemodynamic Data Improves Prognostication of Comatose Post-Cardiac Arrest Patients

链接: https://arxiv.org/abs/2508.06023
作者: Xiaobin Shen,Jonathan Elmer,George H. Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prognostication for comatose post-cardiac arrest patients is a critical challenge that directly impacts clinical decision-making in the ICU. Clinical information that informs prognostication is collected serially over time. Shortly after cardiac arrest, various time-invariant baseline features are collected (e.g., demographics, cardiac arrest characteristics). After ICU admission, additional features are gathered, including time-varying hemodynamic data (e.g., blood pressure, doses of vasopressor medications). We view these as two phases in which we collect new features. In this study, we propose a novel stepwise dynamic competing risks model that improves the prediction of neurological outcomes by automatically determining when to take advantage of time-invariant features (first phase) and time-varying features (second phase). Notably, our model finds patients for whom this second phase (time-varying hemodynamic) information is beneficial for prognostication and also when this information is beneficial (as we collect more hemodynamic data for a patient over time, how important these data are for prognostication varies). Our approach extends the standard Fine and Gray model to explicitly model the two phases and to incorporate neural networks to flexibly capture complex nonlinear feature relationships. Evaluated on a retrospective cohort of 2,278 comatose post-arrest patients, our model demonstrates robust discriminative performance for the competing outcomes of awakening, withdrawal of life-sustaining therapy, and death despite maximal support. Our approach generalizes to more than two phases in which new features are collected and could be used in other dynamic prediction tasks, where it may be helpful to know when and for whom newly collected features significantly improve prediction.

[LG-17] Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM -Based Optimization

链接: https://arxiv.org/abs/2508.05995
作者: Fei Xu Yu,Gina Adam,Nathaniel D. Bastian,Tian Lan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation and structured reasoning; however, their performance often degrades on complex tasks that require consistent multi-step planning. Recent work has explored combining LLMs with Monte Carlo Tree Search (MCTS), yet existing approaches primarily focus on generating heuristic-based code for optimization or target simpler tasks where correctness alone is sufficient. In this work, we propose MCTS-OPS, a novel neural-symbolic framework that formulates prompt selection as a sequential decision process guided by MCTS. Our method explores and refines multi-step prompt sequences for the goal of improving code generation quality and enhancing the problem-solving capabilities of LLMs in general optimization. Experiments on network optimization show significant improvement over the baselines, both in the success rate of executing the generated code and in the optimization results with the specified objective and constraints (2 \sim 4 \times higher reward and 3 \times lower standard deviation). Moreover, it improves the chance of attaining the optimal solution by about 10% of cases, compared to baseline methods in hard problems. These results highlight the promise of combining symbolic planning with LLMs for robust, high-quality code generation in complex domains.

[LG-18] Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

链接: https://arxiv.org/abs/2508.05988
作者: Wenhao Zeng,Yaoning Wang,Chao Hu,Yuling Shi,Chengcheng Wan,Hongyu Zhang,Xiaodong Gu
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Code and model available at this https URL

点击查看摘要

Abstract:Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.

[LG-19] Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to Q-Learning

链接: https://arxiv.org/abs/2508.05984
作者: Ankur Naskar,Gugan Thoppe,Vijay Gupta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithms for solving \textitnonlinear fixed-point equations – such as average-reward \textit Q -learning and \textitTD-learning – often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm’s contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free \tildeO(1/\sqrtt) optimal rates for Q -learning in both average-reward and exponentially discounted settings, where t denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

[LG-20] LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning

链接: https://arxiv.org/abs/2508.05977
作者: Aoming Liang,Chi Cheng,Dashuai Chen,Boai Sun,Dixia Fan
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In the domain of scientific machine learning, designing effective reward functions remains a challenge in reinforcement learning (RL), particularly in environments where task goals are difficult to specify numerically. Reward functions in existing work are predominantly based on heuristics, manual engineering, or task-specific tuning. In this work, we introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction using a Sentence-Bidirectional Encoder Representations from Transformers (SBERT). Instead of relying on manually defined reward functions, the policy receives feedback based on the reward, which is a cosine similarity between the goal textual description and the statement description in the episode. We evaluated our approach in several environments and showed that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions. Our study demonstrates a correlation between the language embedding space and the conventional Euclidean space. This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models (LLMs) and fluid control applications.

[LG-21] Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

链接: https://arxiv.org/abs/2508.05928
作者: Si Shen,Peijun Shen,Wenhua Zhao,Danhao Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emphThink-Answer Mismatch, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO’s effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO’s potential for more robust and effective training of large-scale reasoning models. \footnoteCode and data are available at: this https URL

[LG-22] Fast Convex and Conditioned Network for Multi-Fidelity Vectors and Stiff Univariate Differential Equations

链接: https://arxiv.org/abs/2508.05921
作者: Siddharth Rout
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Representation Theory (math.RT); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accuracy in neural PDE solvers often breaks down not because of limited expressivity, but due to poor optimisation caused by ill-conditioning, especially in multi-fidelity and stiff problems. We study this issue in Physics-Informed Extreme Learning Machines (PIELMs), a convex variant of neural PDE solvers, and show that asymptotic components in governing equations can produce highly ill-conditioned activation matrices, severely limiting convergence. We introduce Shifted Gaussian Encoding, a simple yet effective activation filtering step that increases matrix rank and expressivity while preserving convexity. Our method extends the solvable range of Peclet numbers in steady advection-diffusion equations by over two orders of magnitude, achieves up to six orders lower error on multi-frequency function learning, and fits high-fidelity image vectors more accurately and faster than deep networks with over a million parameters. This work highlights that conditioning, not depth, is often the bottleneck in scientific neural solvers and that simple architectural changes can unlock substantial gains.

[LG-23] Dual Signal Decomposition of Stochastic Time Series

链接: https://arxiv.org/abs/2508.05915
作者: Alex Glushkovsky
类目: Machine Learning (cs.LG)
*备注: 21 pages, 9 figures, 1 table

点击查看摘要

Abstract:The research paper addresses decomposition of a stochastic time series into three time series representing a dual signal i.e., the mean and the dispersion, with noise isolated. Decomposition is done by applying machine learning to fit a dual signal. Machine learning minimizes the loss function which compromises between fitting the original time series and penalizing irregularities of the dual signal. The latter includes terms based on the first and second order derivatives along time. To preserve special patterns, weighting of the regularization components of the loss function has been introduced based on Statistical Process Control methodology. The proposed decomposition can be applied as a smoothing algorithm against the mean and dispersion of the time series. By isolating noise, the proposed decomposition can be seen as a denoising algorithm. Two approaches of the learning process have been considered: sequential and jointly. The former approach learns the mean signal first and then dispersion. The latter approach fits the dual signal jointly. Jointly learning can uncover complex relationships for the time series with heteroskedasticity. Learning has been set by solving the direct non-linear unconstrained optimization problem or by applying neural networks that have sequential or twin output architectures. Tuning of the loss function hyperparameters focuses on the isolated noise to be a stationary stochastic process without autocorrelation properties. Depending on the applications, the hyperparameters of the learning can be tuned towards either the discrete states by stepped signal or smoothed series. The decomposed dual signal can be represented on the 2D space and used to learn inherent structures, to forecast both mean and dispersion, or to analyze cross effects in case of multiple time series.

[LG-24] he Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More)

链接: https://arxiv.org/abs/2508.05905
作者: Jeffrey Uhlmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization is usually regarded as a means to trade quality of performance for reduced compute requirements, i.e., as a suboptimal approximation. However, if examined in terms of a fixed overall resource budget, a very different perspective arises. We introduce Signed-Zero Ternary (SZT), a 2-bit quantization that deterministically provides gradient information with no forward-path penalty. Our analysis provides evidence that it may improve information density compared to non-quantized alternatives.

[LG-25] raining chord recognition models on artificially generated audio

链接: https://arxiv.org/abs/2508.05878
作者: Martyna Majchrzak,Jacek Mańdziuk
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the challenging problems in Music Information Retrieval is the acquisition of enough non-copyrighted audio recordings for model training and evaluation. This study compares two Transformer-based neural network models for chord sequence recognition in audio recordings and examines the effectiveness of using an artificially generated dataset for this purpose. The models are trained on various combinations of Artificial Audio Multitracks (AAM), Schubert’s Winterreise Dataset, and the McGill Billboard Dataset and evaluated with three metrics: Root, MajMin and Chord Content Metric (CCM). The experiments prove that even though there are certainly differences in complexity and structure between artificially generated and human-composed music, the former can be useful in certain scenarios. Specifically, AAM can enrich a smaller training dataset of music composed by a human or can even be used as a standalone training set for a model that predicts chord sequences in pop music, if no other data is available.

[LG-26] A Markov Decision Process Framework for Early Maneuver Decisions in Satellite Collision Avoidance

链接: https://arxiv.org/abs/2508.05876
作者: Francesca Ferrara,Lander W. Schillinger Arana,Florian Dörfler,Sarah H. Q. Li
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Emerging Technologies (cs.ET)
*备注: 16 pages, 13 figures, submitted to the 2025 Astrodynamics Specialist Conference

点击查看摘要

Abstract:This work presents a Markov decision process (MDP) framework to model decision-making for collision avoidance maneuver (CAM) and a reinforcement learning policy gradient (RL-PG) algorithm to train an autonomous guidance policy using historic CAM data. In addition to maintaining acceptable collision risks, this approach seeks to minimize the average fuel consumption of CAMs by making early maneuver decisions. We model CAM as a continuous state, discrete action and finite horizon MDP, where the critical decision is determining when to initiate the maneuver. The MDP model also incorporates analytical models for conjunction risk, propellant consumption, and transit orbit geometry. The Markov policy effectively trades-off maneuver delay-which improves the reliability of conjunction risk indicators-with propellant consumption-which increases with decreasing maneuver time. Using historical data of tracked conjunction events, we verify this framework and conduct an extensive ablation study on the hyper-parameters used within the MDP. On synthetic conjunction events, the trained policy significantly minimizes both the overall and average propellant consumption per CAM when compared to a conventional cut-off policy that initiates maneuvers 24 hours before the time of closest approach (TCA). On historical conjunction events, the trained policy consumes more propellant overall but reduces the average propellant consumption per CAM. For both historical and synthetic conjunction events, the trained policy achieves equal if not higher overall collision risk guarantees.

[LG-27] Stochastic Bandits for Crowdsourcing and Multi-Platform Autobidding

链接: https://arxiv.org/abs/2508.05844
作者: François Bachoc,Nicolò Cesa-Bianchi,Tommaso Cesari,Roberto Colomboni
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by applications in crowdsourcing, where a fixed sum of money is split among K workers, and autobidding, where a fixed budget is used to bid in K simultaneous auctions, we define a stochastic bandit model where arms belong to the K -dimensional probability simplex and represent the fraction of budget allocated to each task/auction. The reward in each round is the sum of K stochastic rewards, where each of these rewards is unlocked with a probability that varies with the fraction of the budget allocated to that task/auction. We design an algorithm whose expected regret after T steps is of order K\sqrtT (up to log factors) and prove a matching lower bound. Improved bounds of order K (\log T)^2 are shown when the function mapping budget to probability of unlocking the reward (i.e., terminating the task or winning the auction) satisfies additional diminishing-returns conditions.

[LG-28] An Effective Approach for Node Classification in Textual Graphs

链接: https://arxiv.org/abs/2508.05836
作者: Rituparna Datta,Nibir Chandra Mandal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Textual Attribute Graphs (TAGs) are critical for modeling complex networks like citation networks, but effective node classification remains challenging due to difficulties in integrating rich semantics from text with structural graph information. Existing methods often struggle with capturing nuanced domain-specific terminology, modeling long-range dependencies, adapting to temporal evolution, and scaling to massive datasets. To address these issues, we propose a novel framework that integrates TAPE (Text-Attributed Graph Representation Enhancement) with Graphormer. Our approach leverages a large language model (LLM), specifically ChatGPT, within the TAPE framework to generate semantically rich explanations from paper content, which are then fused into enhanced node representations. These embeddings are combined with structural features using a novel integration layer with learned attention weights. Graphormer’s path-aware position encoding and multi-head attention mechanisms are employed to effectively capture long-range dependencies across the citation network. We demonstrate the efficacy of our framework on the challenging ogbn-arxiv dataset, achieving state-of-the-art performance with a classification accuracy of 0.772, significantly surpassing the best GCN baseline of 0.713. Our method also yields strong results in precision (0.671), recall (0.577), and F1-score (0.610). We validate our approach through comprehensive ablation studies that quantify the contribution of each component, demonstrating the synergy between semantic and structural information. Our framework provides a scalable and robust solution for node classification in dynamic TAGs, offering a promising direction for future research in knowledge systems and scientific discovery.

[LG-29] Optimal Linear Baseline Models for Scientific Machine Learning

链接: https://arxiv.org/abs/2508.05831
作者: Alexander DeLise,Kyle Loh,Krish Patel,Meredith Teague,Andrea Arnold,Matthias Chung
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 40 pages, 10 Figures, 9 Tables

点击查看摘要

Abstract:Across scientific domains, a fundamental challenge is to characterize and compute the mappings from underlying physical processes to observed signals and measurements. While nonlinear neural networks have achieved considerable success, they remain theoretically opaque, which hinders adoption in contexts where interpretability is paramount. In contrast, linear neural networks serve as a simple yet effective foundation for gaining insight into these complex relationships. In this work, we develop a unified theoretical framework for analyzing linear encoder-decoder architectures through the lens of Bayes risk minimization for solving data-driven scientific machine learning problems. We derive closed-form, rank-constrained linear and affine linear optimal mappings for forward modeling and inverse recovery tasks. Our results generalize existing formulations by accommodating rank-deficiencies in data, forward operators, and measurement processes. We validate our theoretical results by conducting numerical experiments on datasets from simple biomedical imaging, financial factor analysis, and simulations involving nonlinear fluid dynamics via the shallow water equations. This work provides a robust baseline for understanding and benchmarking learned neural network models for scientific machine learning problems.

[LG-30] Machine Learning-Based Nonlinear Nudging for Chaotic Dynamical Systems

链接: https://arxiv.org/abs/2508.05778
作者: Jaemin Oh,Jinsil Lee,Youngjoon Hong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 21 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Nudging is an empirical data assimilation technique that incorporates an observation-driven control term into the model dynamics. The trajectory of the nudged system approaches the true system trajectory over time, even when the initial conditions differ. For linear state space models, such control terms can be derived under mild assumptions. However, designing effective nudging terms becomes significantly more challenging in the nonlinear setting. In this work, we propose neural network nudging, a data-driven method for learning nudging terms in nonlinear state space models. We establish a theoretical existence result based on the Kazantzis–Kravaris–Luenberger observer theory. The proposed approach is evaluated on three benchmark problems that exhibit chaotic behavior: the Lorenz 96 model, the Kuramoto–Sivashinsky equation, and the Kolmogorov flow.

[LG-31] A Graph Neural Network Approach for Mapping the Conceptual Structure and Inter-Branch Connectivity of Physics

链接: https://arxiv.org/abs/2508.05724
作者: Massimiliano Romiti
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:This work introduces a novel framework for representing and analyzing physical laws as a weighted knowledge graph. We constructed a database of 659 distinct physical equations, subjected to rigorous semantic cleaning to resolve notational ambiguities, resulting in a corpus of 400 advanced physics equations. We developed an enhanced graph representation where both physical concepts and equations are nodes, connected by weighted inter-equation bridges. These weights are objectively defined using normalized metrics for variable overlap, physics-informed importance scores, and bibliometric data. A Graph Attention Network (GAT) was trained for link prediction, achieving a test AUC of 0.9742 +/- 0.0018 across five independent runs, significantly outperforming both classical heuristics (best baseline AUC: 0.9487) and established GNN architectures like GraphSAGE (AUC: 0.9504, p = 0.029). Statistical testing confirmed significance of all comparisons (p 0.05), with 2.7% improvement over the best baseline. Our analysis reveals three key findings: (i) The model autonomously rediscovers the known macroscopic structure of physics, identifying strong conceptual axes between Electromagnetism and Statistical Mechanics. (ii) It identifies central hub equations that serve as critical bridges between multiple physical domains. (iii) The model generates stable, computationally-derived hypotheses for cross-domain relationships, identifying both known principles and suggesting novel mathematical analogies for further theoretical investigation. The framework can generate hundreds of such hypotheses, enabling the creation of specialized datasets for targeted analysis of specific physics subfields. Code and data available at this https URL

[LG-32] G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation

链接: https://arxiv.org/abs/2508.05709
作者: Boyu Chen,Siran Chen,Zhengrong Yue,Kainan Yan,Chenyun Yu,Beibei Kong,Cheng Lei,Chengxiang Zhuo,Zang Li,Yali Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:User feedback is critical for refining recommendation systems, yet explicit feedback (e.g., likes or dislikes) remains scarce in practice. As a more feasible alternative, inferring user preferences from massive implicit feedback has shown great potential (e.g., a user quickly skipping a recommended video usually indicates disinterest). Unfortunately, implicit feedback is often noisy: a user might skip a video due to accidental clicks or other reasons, rather than disliking it. Such noise can easily misjudge user interests, thereby undermining recommendation performance. To address this issue, we propose a novel Group-aware User Behavior Simulation (G-UBS) paradigm, which leverages contextual guidance from relevant user groups, enabling robust and in-depth interpretation of implicit feedback for individual users. Specifically, G-UBS operates via two key agents. First, the User Group Manager (UGM) effectively clusters users to generate group profiles utilizing a ``summarize-cluster-reflect" workflow based on LLMs. Second, the User Feedback Modeler (UFM) employs an innovative group-aware reinforcement learning approach, where each user is guided by the associated group profiles during the reinforcement learning process, allowing UFM to robustly and deeply examine the reasons behind implicit feedback. To assess our G-UBS paradigm, we have constructed a Video Recommendation benchmark with Implicit Feedback (IF-VR). To the best of our knowledge, this is the first multi-modal benchmark for implicit feedback evaluation in video recommendation, encompassing 15k users, 25k videos, and 933k interaction records with implicit feedback. Extensive experiments on IF-VR demonstrate that G-UBS significantly outperforms mainstream LLMs and MLLMs, with a 4.0% higher proportion of videos achieving a play rate 30% and 14.9% higher reasoning accuracy on IF-VR.

[LG-33] MambaITD: An Efficient Cross-Modal Mamba Network for Insider Threat Detection ICDM

链接: https://arxiv.org/abs/2508.05695
作者: Kaichuan Kong,Dongjie Liu,Xiaobo Jin,Zhiying Li,Guanggang Geng,Jian Weng
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to the 2025 IEEE International Conference on Data Mining (ICDM)

点击查看摘要

Abstract:Enterprises are facing increasing risks of insider threats, while existing detection methods are unable to effectively address these challenges due to reasons such as insufficient temporal dynamic feature modeling, computational efficiency and real-time bottlenecks and cross-modal information island problem. This paper proposes a new insider threat detection framework MambaITD based on the Mamba state space model and cross-modal adaptive fusion. First, the multi-source log preprocessing module aligns heterogeneous data through behavioral sequence encoding, interval smoothing, and statistical feature extraction. Second, the Mamba encoder models long-range dependencies in behavioral and interval sequences, and combines the sequence and statistical information dynamically in combination with the gated feature fusion mechanism. Finally, we propose an adaptive threshold optimization method based on maximizing inter-class variance, which dynamically adjusts the decision threshold by analyzing the probability distribution, effectively identifies anomalies, and alleviates class imbalance and concept drift. Compared with traditional methods, MambaITD shows significant advantages in modeling efficiency and feature fusion capabilities, outperforming Transformer-based methods, and provides a more effective solution for insider threat detection.

[LG-34] Leverag ing large language models for SQL behavior-based database intrusion detection

链接: https://arxiv.org/abs/2508.05690
作者: Meital Shlezinger,Shay Akirav,Lei Zhou,Liang Guo,Avi Kessel,Guoliang Li
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Database systems are extensively used to store critical data across various domains. However, the frequency of abnormal database access behaviors, such as database intrusion by internal and external attacks, continues to rise. Internal masqueraders often have greater organizational knowledge, making it easier to mimic employee behavior effectively. In contrast, external masqueraders may behave differently due to their lack of familiarity with the organization. Current approaches lack the granularity needed to detect anomalies at the operational level, frequently misclassifying entire sequences of operations as anomalies, even though most operations are likely to represent normal behavior. On the other hand, some anomalous behaviors often resemble normal activities, making them difficult for existing detection methods to identify. This paper introduces a two-tiered anomaly detection approach for Structured Query Language (SQL) using the Bidirectional Encoder Representations from Transformers (BERT) model, specifically DistilBERT, a more efficient, pre-trained version. Our method combines both unsupervised and supervised machine learning techniques to accurately identify anomalous activities while minimizing the need for data labeling. First, the unsupervised method uses ensemble anomaly detectors that flag embedding vectors distant from learned normal patterns of typical user behavior across the database (out-of-scope queries). Second, the supervised method uses fine-tuned transformer-based models to detect internal attacks with high precision (in-scope queries), using role-labeled classification, even on limited labeled SQL data. Our findings make a significant contribution by providing an effective solution for safeguarding critical database systems from sophisticated threats.

[LG-35] MM-FusionNet: Context-Aware Dynamic Fusion for Multi-modal Fake News Detection with Large Vision-Language Models

链接: https://arxiv.org/abs/2508.05684
作者: Junhao He,Tianyu Liu,Jingyuan Zhao,Benjamin Turner
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of multi-modal fake news on social media poses a significant threat to public trust and social stability. Traditional detection methods, primarily text-based, often fall short due to the deceptive interplay between misleading text and images. While Large Vision-Language Models (LVLMs) offer promising avenues for multi-modal understanding, effectively fusing diverse modal information, especially when their importance is imbalanced or contradictory, remains a critical challenge. This paper introduces MM-FusionNet, an innovative framework leveraging LVLMs for robust multi-modal fake news detection. Our core contribution is the Context-Aware Dynamic Fusion Module (CADFM), which employs bi-directional cross-modal attention and a novel dynamic modal gating network. This mechanism adaptively learns and assigns importance weights to textual and visual features based on their contextual relevance, enabling intelligent prioritization of information. Evaluated on the large-scale Multi-modal Fake News Dataset (LMFND) comprising 80,000 samples, MM-FusionNet achieves a state-of-the-art F1-score of 0.938, surpassing existing multi-modal baselines by approximately 0.5% and significantly outperforming single-modal approaches. Further analysis demonstrates the model’s dynamic weighting capabilities, its robustness to modality perturbations, and performance remarkably close to human-level, underscoring its practical efficacy and interpretability for real-world fake news detection.

[LG-36] Domain-Specific Fine-Tuning and Prompt-Based Learning: A Comparative Study for developing Natural Language-Based BIM Information Retrieval Systems

链接: https://arxiv.org/abs/2508.05676
作者: Han Gao,Timo Hartmann,Botao Zhong,Kai Lia,Hanbin Luo
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building Information Modeling (BIM) is essential for managing building data across the entire lifecycle, supporting tasks from design to maintenance. Natural Language Interface (NLI) systems are increasingly explored as user-friendly tools for information retrieval in Building Information Modeling (BIM) environments. Despite their potential, accurately extracting BIM-related data through natural language queries remains a persistent challenge due to the complexity use queries and specificity of domain knowledge. This study presents a comparative analysis of two prominent approaches for developing NLI-based BIM information retrieval systems: domain-specific fine-tuning and prompt-based learning using large language models (LLMs). A two-stage framework consisting of intent recognition and table-based question answering is implemented to evaluate the effectiveness of both approaches. To support this evaluation, a BIM-specific dataset of 1,740 annotated queries of varying types across 69 models is constructed. Experimental results show that domain-specific fine-tuning delivers superior performance in intent recognition tasks, while prompt-based learning, particularly with GPT-4o, shows strength in table-based question answering. Based on these findings, this study identify a hybrid configuration that combines fine-tuning for intent recognition with prompt-based learning for question answering, achieving more balanced and robust performance across tasks. This integrated approach is further tested through case studies involving BIM models of varying complexity. This study provides a systematic analysis of the strengths and limitations of each approach and discusses the applicability of the NLI to real-world BIM scenarios. The findings offer insights for researchers and practitioners in designing intelligent, language-driven BIM systems.

[LG-37] Diagrams-to-Dynamics (D2D): Exploring Causal Loop Diagram Leverag e Points under Uncertainty

链接: https://arxiv.org/abs/2508.05659
作者: Jeroen F. Uleman,Loes Crielaard,Leonie K. Elsenburg,Guido A. Veldhuis,Karien Stronks,Naja Hulvej Rod,Rick Quax,Vítor V. Vasconcelos
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 21 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Causal loop diagrams (CLDs) are widely used in health and environmental research to represent hypothesized causal structures underlying complex problems. However, as qualitative and static representations, CLDs are limited in their ability to support dynamic analysis and inform intervention strategies. Additionally, quantitative CLD analysis methods like network centrality analysis often lead to false inference. We propose Diagrams-to-Dynamics (D2D), a method for converting CLDs into exploratory system dynamics models (SDMs) in the absence of empirical data. With minimal user input - following a protocol to label variables as stocks, flows/auxiliaries, or constants - D2D leverages the structural information already encoded in CLDs, namely, link existence and polarity, to simulate hypothetical interventions and explore potential leverage points under uncertainty. Results suggest that D2D helps distinguish between high- and low-ranked leverage points. We compare D2D to a data-driven SDM constructed from the same CLD and variable labeling. D2D showed greater consistency with the data-driven model than network centrality analysis, while providing uncertainty estimates and guidance for future data collection. The method is implemented in an open-source Python package and a web-based application to support further testing and lower the barrier to dynamic modeling for researchers working with CLDs. We expect additional validation will further establish the approach’s utility across a broad range of cases and domains.

[LG-38] AI Guided Accelerator For Search Experience SIGIR

链接: https://arxiv.org/abs/2508.05649
作者: Jayanth Yetukuri,Mehran Elyasi,Samarth Agrawal,Aritra Mandal,Rui Kong,Harish Vempati,Ishita Khan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at SIGIR eCom’25. this https URL

点击查看摘要

Abstract:Effective query reformulation is pivotal in narrowing the gap between a user’s exploratory search behavior and the identification of relevant products in e-commerce environments. While traditional approaches predominantly model query rewrites as isolated pairs, they often fail to capture the sequential and transitional dynamics inherent in real-world user behavior. In this work, we propose a novel framework that explicitly models transitional queries–intermediate reformulations occurring during the user’s journey toward their final purchase intent. By mining structured query trajectories from eBay’s large-scale user interaction logs, we reconstruct query sequences that reflect shifts in intent while preserving semantic coherence. This approach allows us to model a user’s shopping funnel, where mid-journey transitions reflect exploratory behavior and intent refinement. Furthermore, we incorporate generative Large Language Models (LLMs) to produce semantically diverse and intent-preserving alternative queries, extending beyond what can be derived through collaborative filtering alone. These reformulations can be leveraged to populate Related Searches or to power intent-clustered carousels on the search results page, enhancing both discovery and engagement. Our contributions include (i) the formal identification and modeling of transitional queries, (ii) the introduction of a structured query sequence mining pipeline for intent flow understanding, and (iii) the application of LLMs for scalable, intent-aware query expansion. Empirical evaluation demonstrates measurable gains in conversion and engagement metrics compared to the existing Related Searches module, validating the effectiveness of our approach in real-world e-commerce settings.

[LG-39] DP-SPRT: Differentially Private Sequential Probability Ratio Tests

链接: https://arxiv.org/abs/2508.06377
作者: Thomas Michel,Debabrota Basu,Emilie Kaufmann
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We revisit Wald’s celebrated Sequential Probability Ratio Test for sequential tests of two simple hypotheses, under privacy constraints. We propose DP-SPRT, a wrapper that can be calibrated to achieve desired error probabilities and privacy constraints, addressing a significant gap in previous work. DP-SPRT relies on a private mechanism that processes a sequence of queries and stops after privately determining when the query results fall outside a predefined interval. This OutsideInterval mechanism improves upon naive composition of existing techniques like AboveThreshold, potentially benefiting other sequential algorithms. We prove generic upper bounds on the error and sample complexity of DP-SPRT that can accommodate various noise distributions based on the practitioner’s privacy needs. We exemplify them in two settings: Laplace noise (pure Differential Privacy) and Gaussian noise (Rényi differential privacy). In the former setting, by providing a lower bound on the sample complexity of any \epsilon -DP test with prescribed type I and type II errors, we show that DP-SPRT is near optimal when both errors are small and the two hypotheses are close. Moreover, we conduct an experimental study revealing its good practical performance.

[LG-40] Decorrelated feature importance from local sample weighting

链接: https://arxiv.org/abs/2508.06337
作者: Benedikt Fröhlich,Alison Durst,Merle Behr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI often tends to be distributed among all features which are in correlation with the response-generating signal features. Even worse, if multiple signal features are in strong correlation with a noise feature, while being only modestly correlated with one another, this can result in a noise feature having a distinctly larger FI score than any signal feature. Here we propose local sample weighting (losaw) which can flexibly be integrated into many ML algorithms to improve FI scores in the presence of feature correlation in the training data. Our approach is motivated from inverse probability weighting in causal inference and locally, within the ML model, uses a sample weighting scheme to decorrelate a target feature from the remaining features. This reduces model bias locally, whenever the effect of a potential signal feature is evaluated and compared to others. Moreover, losaw comes with a natural tuning parameter, the minimum effective sample size of the weighted population, which corresponds to an interpretation-prediction-tradeoff, analog to a bias-variance-tradeoff as for classical ML tuning parameters. We demonstrate how losaw can be integrated within decision tree-based ML methods and within mini-batch training of neural networks. We investigate losaw for random forest and convolutional neural networks in a simulation study on settings showing diverse correlation patterns. We found that losaw improves FI consistently. Moreover, it often improves prediction accuracy for out-of-distribution, while maintaining a similar accuracy for in-distribution test data.

[LG-41] Enhancing the Scalability of Classical Surrogates for Real-World Quantum Machine Learning Applications

链接: https://arxiv.org/abs/2508.06131
作者: Philip Anton Hernicht,Alona Sakhnenko,Corey O’Meara,Giorgio Cortiana,Jeanette Miriam Lorenz
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 8 figures

点击查看摘要

Abstract:Quantum machine learning (QML) presents potential for early industrial adoption, yet limited access to quantum hardware remains a significant bottleneck for deployment of QML solutions. This work explores the use of classical surrogates to bypass this restriction, which is a technique that allows to build a lightweight classical representation of a (trained) quantum model, enabling to perform inference on entirely classical devices. We reveal prohibiting high computational demand associated with previously proposed methods for generating classical surrogates from quantum models, and propose an alternative pipeline enabling generation of classical surrogates at a larger scale than was previously possible. Previous methods required at least a high-performance computing (HPC) system for quantum models of below industrial scale (ca. 20 qubits), which raises questions about its practicality. We greatly minimize the redundancies of the previous approach, utilizing only a minute fraction of the resources previously needed. We demonstrate the effectiveness of our method on a real-world energy demand forecasting problem, conducting rigorous testing of performance and computation demand in both simulations and on quantum hardware. Our results indicate that our method achieves high accuracy on the testing dataset while its computational resource requirements scale linearly rather than exponentially. This work presents a lightweight approach to transform quantum solutions into classically deployable versions, facilitating faster integration of quantum technology in industrial settings. Furthermore, it can serve as a powerful research tool in search practical quantum advantage in an empirical setup.

[LG-42] IOCC: Aligning Semantic and Cluster Centers for Few-shot Short Text Clustering

链接: https://arxiv.org/abs/2508.06126
作者: Jixuan Yin,Zhihao Yao,Wenshuai Huo,Xinmiao Yu,Xiaocheng Feng,Bo Li
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In clustering tasks, it is essential to structure the feature space into clear, well-separated distributions. However, because short text representations have limited expressiveness, conventional methods struggle to identify cluster centers that truly capture each category’s underlying semantics, causing the representations to be optimized in suboptimal directions. To address this issue, we propose IOCC, a novel few-shot contrastive learning method that achieves alignment between the cluster centers and the semantic centers. IOCC consists of two key modules: Interaction-enhanced Optimal Transport (IEOT) and Center-aware Contrastive Learning (CACL). Specifically, IEOT incorporates semantic interactions between individual samples into the conventional optimal transport problem, and generate pseudo-labels. Based on these pseudo-labels, we aggregate high-confidence samples to construct pseudo-centers that approximate the semantic centers. Next, CACL optimizes text representations toward their corresponding pseudo-centers. As training progresses, the collaboration between the two modules gradually reduces the gap between cluster centers and semantic centers. Therefore, the model will learn a high-quality distribution, improving clustering performance. Extensive experiments on eight benchmark datasets show that IOCC outperforms previous methods, achieving up to 7.34% improvement on challenging Biomedical dataset and also excelling in clustering stability and efficiency. The code is available at: this https URL.

[LG-43] Ensemble-Based Graph Representation of fMRI Data for Cognitive Brain State Classification

链接: https://arxiv.org/abs/2508.06118
作者: Daniil Vlasenko,Vadim Ushakov,Alexey Zaikin,Denis Zakharov
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and classifying human cognitive brain states based on neuroimaging data remains one of the foremost and most challenging problems in neuroscience, owing to the high dimensionality and intrinsic noise of the signals. In this work, we propose an ensemble-based graph representation method of functional magnetic resonance imaging (fMRI) data for the task of binary brain-state classification. Our method builds the graph by leveraging multiple base machine-learning models: each edge weight reflects the difference in posterior probabilities between two cognitive states, yielding values in the range [-1, 1] that encode confidence in a given state. We applied this approach to seven cognitive tasks from the Human Connectome Project (HCP 1200 Subject Release), including working memory, gambling, motor activity, language, social cognition, relational processing, and emotion processing. Using only the mean incident edge weights of the graphs as features, a simple logistic-regression classifier achieved average accuracies from 97.07% to 99.74%. We also compared our ensemble graphs with classical correlation-based graphs in a classification task with a graph neural network (GNN). In all experiments, the highest classification accuracy was obtained with ensemble graphs. These results demonstrate that ensemble graphs convey richer topological information and enhance brain-state discrimination. Our approach preserves edge-level interpretability of the fMRI graph representation, is adaptable to multiclass and regression tasks, and can be extended to other neuroimaging modalities and pathological-state classification.

[LG-44] Lightweight Auto-bidding based on Traffic Prediction in Live Advertising

链接: https://arxiv.org/abs/2508.06069
作者: Bo Yang,Ruixuan Luo,Junqi Jin,Han Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream control of campaigns is auto-bidding, where the performance depends on the decision of the bidding algorithm in each request. The most widely used auto-bidding algorithms include Proportional-Integral-Derivative (PID) control, linear programming (LP), reinforcement learning (RL), etc. Existing methods either do not consider the entire time traffic, or have too high computational complexity. In this paper, the live advertising has high requirements for real-time bidding (second-level control) and faces the difficulty of unknown future traffic. Therefore, we propose a lightweight bidding algorithm Binary Constrained Bidding (BiCB), which neatly combines the optimal bidding formula given by mathematical analysis and the statistical method of future traffic estimation, and obtains good approximation to the optimal result through a low complexity solution. In addition, we complement the form of upper and lower bound constraints for traditional auto-bidding modeling and give theoretical analysis of BiCB. Sufficient offline and online experiments prove BiCB’s good performance and low engineering cost.

[LG-45] Data-Driven Density Steering via the Gromov-Wasserstein Optimal Transport Distance

链接: https://arxiv.org/abs/2508.06052
作者: Haruto Nakashima,Siddhartha Ganguly,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: To be presented at the IEEE CDC, Rio de Janeiro, 2025

点击查看摘要

Abstract:We tackle the data-driven chance-constrained density steering problem using the Gromov-Wasserstein metric. The underlying dynamical system is an unknown linear controlled recursion, with the assumption that sufficiently rich input-output data from pre-operational experiments are available. The initial state is modeled as a Gaussian mixture, while the terminal state is required to match a specified Gaussian distribution. We reformulate the resulting optimal control problem as a difference-of-convex program and show that it can be efficiently and tractably solved using the DC algorithm. Numerical results validate our approach through various data-driven schemes.

[LG-46] Hybrid Physics-Machine Learning Models for Quantitative Electron Diffraction Refinements

链接: https://arxiv.org/abs/2508.05908
作者: Shreshth A. Malik,Tiarnan A.S. Doherty,Benjamin Colmey,Stephen J. Roberts,Yarin Gal,Paul A. Midgley
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-fidelity electron microscopy simulations required for quantitative crystal structure refinements face a fundamental challenge: while physical interactions are well-described theoretically, real-world experimental effects are challenging to model analytically. To address this gap, we present a novel hybrid physics-machine learning framework that integrates differentiable physical simulations with neural networks. By leveraging automatic differentiation throughout the simulation pipeline, our method enables gradient-based joint optimization of physical parameters and neural network components representing experimental variables, offering superior scalability compared to traditional second-order methods. We demonstrate this framework through application to three-dimensional electron diffraction (3D-ED) structure refinement, where our approach learns complex thickness distributions directly from diffraction data rather than relying on simplified geometric models. This method achieves state-of-the-art refinement performance across synthetic and experimental datasets, recovering atomic positions, thermal displacements, and thickness profiles with high fidelity. The modular architecture proposed can naturally be extended to accommodate additional physical phenomena and extended to other electron microscopy techniques. This establishes differentiable hybrid modeling as a powerful new paradigm for quantitative electron microscopy, where experimental complexities have historically limited analysis.

[LG-47] Stochastic Trace Optimization of Parameter Dependent Matrices Based on Statistical Learning Theory

链接: https://arxiv.org/abs/2508.05764
作者: Arvind K. Saibaba,Ilse C.F. Ipsen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 3 figures

点击查看摘要

Abstract:We consider matrices \boldsymbolA(\boldsymbol\theta)\in\mathbbR^m\times m that depend, possibly nonlinearly, on a parameter \boldsymbol\theta from a compact parameter space \Theta . We present a Monte Carlo estimator for minimizing \texttrace(\boldsymbolA(\boldsymbol\theta)) over all \boldsymbol\theta\in\Theta , and determine the sampling amount so that the backward error of the estimator is bounded with high probability. We derive two types of bounds, based on epsilon nets and on generic chaining. Both types predict a small sampling amount for matrices \boldsymbolA(\boldsymbol\theta) with small offdiagonal mass, and parameter spaces \Theta of small ``size.‘’ Dependence on the matrix dimension~ m is only weak or not explicit. The bounds based on epsilon nets are easier to evaluate and come with fully specified constants. In contrast, the bounds based on chaining depend on the Talagrand functionals which are difficult to evaluate, except in very special cases. Comparisons between the two types of bounds are difficult, although the literature suggests that chaining bounds can be superior.

[LG-48] Evaluating Universal Machine Learning Force Fields Against Experimental Measurements

链接: https://arxiv.org/abs/2508.05762
作者: Sajid Mannan,Vaibhav Bihani,Carmelo Gonzales,Kin Long Kelvin Lee,Nitya Nand Gosvami,Sayan Ranu,Santiago Miret,N M Anoop Krishnan
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal machine learning force fields (UMLFFs) promise to revolutionize materials science by enabling rapid atomistic simulations across the periodic table. However, their evaluation has been limited to computational benchmarks that may not reflect real-world performance. Here, we present UniFFBench, a comprehensive framework for evaluating UMLFFs against experimental measurements of ~1,500 carefully curated mineral structures spanning diverse chemical environments, bonding types, structural complexity, and elastic properties. Our systematic evaluation of six state-of-the-art UMLFFs reveals a substantial reality gap: models achieving impressive performance on computational benchmarks often fail when confronted with experimental complexity. Even the best-performing models exhibit higher density prediction error than the threshold required for practical applications. Most strikingly, we observe disconnects between simulation stability and mechanical property accuracy, with prediction errors correlating with training data representation rather than the modeling method. These findings demonstrate that while current computational benchmarks provide valuable controlled comparisons, they may overestimate model reliability when extrapolated to experimentally complex chemical spaces. Altogether, UniFFBench establishes essential experimental validation standards and reveals systematic limitations that must be addressed to achieve truly universal force field capabilities.

[LG-49] Detecting Model Misspecification in Cosmology with Scale-Dependent Normalizing Flows

链接: https://arxiv.org/abs/2508.05744
作者: Aizhan Akhmetzhanova,Carolina Cuesta-Lazaro,Siddharth Mishra-Sharma
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 14 + 5 pages, 6 + 4 figures

点击查看摘要

Abstract:Current and upcoming cosmological surveys will produce unprecedented amounts of high-dimensional data, which require complex high-fidelity forward simulations to accurately model both physical processes and systematic effects which describe the data generation process. However, validating whether our theoretical models accurately describe the observed datasets remains a fundamental challenge. An additional complexity to this task comes from choosing appropriate representations of the data which retain all the relevant cosmological information, while reducing the dimensionality of the original dataset. In this work we present a novel framework combining scale-dependent neural summary statistics with normalizing flows to detect model misspecification in cosmological simulations through Bayesian evidence estimation. By conditioning our neural network models for data compression and evidence estimation on the smoothing scale, we systematically identify where theoretical models break down in a data-driven manner. We demonstrate a first application to our approach using matter and gas density fields from three CAMELS simulation suites with different subgrid physics implementations.

[LG-50] Reduction Techniques for Survival Analysis

链接: https://arxiv.org/abs/2508.05715
作者: Johannes Piller,Léa Orsini,Simon Wiegrebe,John Zobolas,Lukas Burk,Sophie Hanna Langbein,Philip Studener,Markus Goeswein,Andreas Bender
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we discuss what we refer to as reduction techniques for survival analysis, that is, techniques that “reduce” a survival task to a more common regression or classification task, without ignoring the specifics of survival data. Such techniques particularly facilitate machine learning-based survival analysis, as they allow for applying standard tools from machine and deep learning to many survival tasks without requiring custom learners. We provide an overview of different reduction techniques and discuss their respective strengths and weaknesses. We also provide a principled implementation of some of these reductions, such that they are directly available within standard machine learning workflows. We illustrate each reduction using dedicated examples and perform a benchmark analysis that compares their predictive performance to established machine learning methods for survival analysis.

[LG-51] Random Walk Learning and the Pac-Man Attack

链接: https://arxiv.org/abs/2508.05663
作者: Xingran Chen,Parimal Parag,Rohit Bhagat,Zonghong Liu,Salim El Rouayheb
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the ``Pac-Man’’ attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the Average Crossing (AC) algorithm–a fully decentralized mechanism for duplicating RWs to prevent RW extinction in the presence of Pac-Man. Our theoretical analysis establishes that (i) the RW population remains almost surely bounded under AC and (ii) RW-based stochastic gradient descent remains convergent under AC, even in the presence of Pac-Man, with a quantifiable deviation from the true optimum. Our extensive empirical results on both synthetic and real-world datasets corroborate our theoretical findings. Furthermore, they uncover a phase transition in the extinction probability as a function of the duplication threshold. We offer theoretical insights by analyzing a simplified variant of the AC, which sheds light on the observed phase transition.

[LG-52] Moment Estimate and Variational Approach for Learning Generalized Diffusion with Non-gradient Structures

链接: https://arxiv.org/abs/2508.01854
作者: Fanze Kong,Chen-Chih Lai,Yubin Lu
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Adaptation and Self-Organizing Systems (nlin.AO)
*备注:

点击查看摘要

Abstract:This paper proposes a data-driven learning framework for identifying governing laws of generalized diffusions with non-gradient components. By combining energy dissipation laws with a physically consistent penalty and first-moment evolution, we design a two-stage method to recover the pseudo-potential and rotation in the pointwise orthogonal decomposition of a class of non-gradient drifts in generalized diffusions. Our two-stage method is applied to complex generalized diffusion processes including dissipation-rotation dynamics, rough pseudo-potentials and noisy data. Representative numerical experiments demonstrate the effectiveness of our approach for learning physical laws in non-gradient generalized diffusions.

信息检索

[IR-0] M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation

链接: https://arxiv.org/abs/2508.06328
作者: Zhiyou Xiao,Qinhan Yu,Binghui Li,Geng Chen,Chong Chen,Wentao Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Current research on Multimodal Retrieval-Augmented Generation (MRAG) enables diverse multimodal inputs but remains limited to single-modality outputs, restricting expressive capacity and practical utility. In contrast, real-world applications often demand both multimodal inputs and multimodal outputs for effective communication and grounded reasoning. Motivated by the recent success of Reinforcement Learning (RL) in complex reasoning tasks for Large Language Models (LLMs), we adopt RL as a principled and effective paradigm to address the multi-step, outcome-driven challenges inherent in multimodal output generation. Here, we introduce M2IO-R1, a novel framework for Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) that supports both multimodal inputs and outputs. Central to our framework is an RL-based inserter, Inserter-R1-3B, trained with Group Relative Policy Optimization to guide image selection and placement in a controllable and semantically aligned manner. Empirical results show that our lightweight 3B inserter achieves strong reasoning capabilities with significantly reduced latency, outperforming baselines in both quality and efficiency.

[IR-1] Improving Table Retrieval with Question Generation from Partial Tables ACL2025

链接: https://arxiv.org/abs/2508.06168
作者: Hsing-Ping Liang,Che-Wei Chang,Yao-Chung Fan
类目: Information Retrieval (cs.IR)
*备注: TRL@ACL2025

点击查看摘要

Abstract:Recent advances in open-domain question answering over tables have widely adopted large language models (LLMs) under the Retriever-Reader architecture. Prior works have effectively leveraged LLMs to tackle the complex reasoning demands of the Reader component, such as text-to-text, text-to-SQL, and multi hop reasoning. In contrast, the Retriever component has primarily focused on optimizing the query representation-training retrievers to retrieve relevant tables based on questions, or to select keywords from questions for matching table segments. However, little attention has been given to enhancing how tables themselves are represented in embedding space to better align with questions. To address this, we propose QGpT (Question Generation from Partial Tables), a simple yet effective method that uses an LLM to generate synthetic questions based on small portions of a table. These questions are generated to simulate how a user might query the content of the table currently under consideration. The generated questions are then jointly embedded with the partial table segments used for generation, enhancing semantic alignment with user queries. Without the need to embed entire tables, our method significantly improves retrieval performance across multiple benchmarks for both dense and late-interaction retrievers.

[IR-2] When a Paper Has 1000 Authors: Rethinking Citation Metrics in the Era of LLM s

链接: https://arxiv.org/abs/2508.06004
作者: Weihang Guo,Zhao Song,Jiahao Zhang
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Author-level citation metrics provide a practical, interpretable, and scalable signal of scholarly influence in a complex research ecosystem. It has been widely used as a proxy in hiring decisions. However, the past five years have seen the rapid emergence of large-scale publications in the field of large language models and foundation models, with papers featuring hundreds to thousands of co-authors and receiving tens of thousands of citations within months. For example, Gemini has 1361 authors and has been cited around 4600 times in 19 months. In such cases, traditional metrics, such as total citation count and the h -index, fail to meaningfully distinguish individual contributions. Therefore, we propose the following research question: How can one identify standout researchers among thousands of co-authors in large-scale LLM papers? This question is particularly important in scenarios such as academic hiring and funding decisions. In this paper, we introduce a novel citation metric designed to address this challenge by balancing contributions across large-scale and small-scale publications. We propose the SBCI index, analyze its theoretical properties, and evaluate its behavior on synthetic publication datasets. Our results demonstrate that the proposed metric provides a more robust and discriminative assessment of individual scholarly impact in the era of large-scale collaborations.

[IR-3] Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts CIKM2025

链接: https://arxiv.org/abs/2508.05993
作者: Yunke Qu,Liang Qu,Tong Chen,Quoc Viet Hung Nguyen,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: Accepted to CIKM 2025

点击查看摘要

Abstract:Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users’ latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates. To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency. Comments: Accepted to CIKM 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.05993 [cs.IR] (or arXiv:2508.05993v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.05993 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] Dual prototype attentive graph network for cross-market recommendation ICONIP2025

链接: https://arxiv.org/abs/2508.05969
作者: Li Fan,Menglin Kong,Yang Xiang,Chong Zhang,Chengtao Ji
类目: Information Retrieval (cs.IR)
*备注: Accepted by ICONIP 2025 (Oral)

点击查看摘要

Abstract:Cross-market recommender systems (CMRS) aim to utilize historical data from mature markets to promote multinational products in emerging markets. However, existing CMRS approaches often overlook the potential for shared preferences among users in different markets, focusing primarily on modeling specific preferences within each market. In this paper, we argue that incorporating both market-specific and market-shared insights can enhance the generalizability and robustness of CMRS. We propose a novel approach called Dual Prototype Attentive Graph Network for Cross-Market Recommendation (DGRE) to address this. DGRE leverages prototypes based on graph representation learning from both items and users to capture market-specific and market-shared insights. Specifically, DGRE incorporates market-shared prototypes by clustering users from various markets to identify behavioural similarities and create market-shared user profiles. Additionally, it constructs item-side prototypes by aggregating item features within each market, providing valuable market-specific insights. We conduct extensive experiments to validate the effectiveness of DGRE on a real-world cross-market dataset, and the results show that considering both market-specific and market-sharing aspects in modelling can improve the generalization and robustness of CMRS.

[IR-5] WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

链接: https://arxiv.org/abs/2508.05748
作者: Xinyu Geng,Peng Xia,Zhen Zhang,Xinyu Wang,Qiuchen Wang,Ruixue Ding,Chenxi Wang,Jialong Wu,Yida Zhao,Kuan Li,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.

[IR-6] LLM 4ES: Learning User Embeddings from Event Sequences via Large Language Models

链接: https://arxiv.org/abs/2508.05688
作者: Aleksei Shestov,Omar Zoloev,Maksim Makarenko,Mikhail Orlov,Egor Fadeev,Ivan Kireev,Andrey Savchenko
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper presents LLM4ES, a novel framework that exploits large pre-trained language models (LLMs) to derive user embeddings from event sequences. Event sequences are transformed into a textual representation, which is subsequently used to fine-tune an LLM through next-token prediction to generate high-quality embeddings. We introduce a text enrichment technique that enhances LLM adaptation to event sequence data, improving representation quality for low-variability domains. Experimental results demonstrate that LLM4ES achieves state-of-the-art performance in user classification tasks in financial and other domains, outperforming existing embedding methods. The resulting user embeddings can be incorporated into a wide range of applications, from user segmentation in finance to patient outcome prediction in healthcare.

附件下载

点击下载今日全部论文列表