本篇博文主要内容为 2025-08-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-18)

今日共更新415篇论文,其中:

  • 自然语言处理63篇(Computation and Language (cs.CL))
  • 人工智能103篇(Artificial Intelligence (cs.AI))
  • 计算机视觉118篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习105篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Controlling Multimodal LLM s via Reward-guided Decoding ICCV2025

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解任务中存在对象幻觉(object hallucination)的问题,即模型生成的文本描述与图像内容不一致,导致视觉定位准确性不足。其解决方案的关键在于提出一种基于奖励引导的解码方法(reward-guided decoding),通过构建两个独立的奖励模型(reward models)分别控制输出结果中的对象精度(precision)和召回率(recall),从而实现对推理过程的动态调控:一方面允许用户在解码过程中调整两个奖励函数的相对权重,以灵活权衡精度与召回;另一方面通过调节搜索广度(breadth of the search)来平衡测试时计算资源与视觉定位质量之间的关系。实验表明,该方法在标准对象幻觉基准上显著提升了可控性,并优于现有幻觉缓解技术。

链接: https://arxiv.org/abs/2508.11616
作者: Oscar Mañas,Pierluca D’Oro,Koustuv Sinha,Adriana Romero-Soriano,Michal Drozdzal,Aishwarya Agrawal
机构: Mila - Quebec AI Institute (蒙特利尔魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); Meta FAIR (Meta人工智能研究院); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at ICCV 2025

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM’s decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model’s output. Our approach enables on-the-fly controllability of an MLLM’s inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.
zh

[NLP-1] nyTim: A Family of Language Models for Divergent Generation NEURIPS

【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)在高度复杂且非线性文本中挖掘潜在的创造性知识表示问题。解决方案的关键在于通过微调(fine-tuning)构建名为TinyTim的专用语言模型家族,使其专注于詹姆斯·乔伊斯(James Joyce)的《芬尼根的守灵夜》(Finnegans Wake),从而生成具有高词汇多样性(lexical diversity)和低语义连贯性(semantic coherence)的文本模式。这种独特生成特征被解释为一种“发散性知识源”(divergent knowledge source),可在更复杂的创意架构中用于自动化发现机制,推动跨场景下的创造性问题求解能力。

链接: https://arxiv.org/abs/2508.11607
作者: Christopher J. Agostino
机构: NPC Worldwide
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures, submitted to NeurIPS Creative AI track, code and model available at this https URL

点击查看摘要

Abstract:This work introduces TinyTim, a family of large language models fine-tuned on James Joyce’s `Finnegans Wake’. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings.
zh

[NLP-2] Dataset Creation for Visual Entailment using Generative AI

【速读】: 该论文旨在解决视觉蕴含(Visual Entailment, VE)模型训练中数据稀缺的问题。现有视觉蕴含数据集相较于文本蕴含数据集规模小且稀疏,而人工构建此类数据集成本高昂。其解决方案的关键在于利用已有的大规模文本蕴含数据集SNLI,将其中的假设文本作为输入提示(prompt)送入生成式图像模型Stable Diffusion,从而自动生成与文本语义一致的图像,构建合成视觉蕴含数据集。实验表明,使用该合成数据训练的模型在SNLI-VE和SICK-VTE两个基准上仅出现轻微性能下降(F-score分别从0.703降至0.686、从0.400降至0.384),验证了合成数据在缓解数据稀缺问题上的有效性与可行性。

链接: https://arxiv.org/abs/2508.11605
作者: Rob Reijtenbach,Suzan Verberne,Gijs Wijnholds
机构: Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注: NALOMA: Natural Logic meets Machine Learning workshop @ ESSLLI 2025

点击查看摘要

Abstract:In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.
zh

[NLP-3] Representing Speech Through Autoregressive Prediction of Cochlear Tokens

【速读】: 该论文旨在解决当前语音表示学习模型在模拟人类听觉处理机制和提升跨任务泛化能力方面的不足。其核心问题是现有方法往往缺乏对生物启发性声学特征提取与高层语义建模的协同设计,导致模型难以同时实现高质量的语音表征与下游任务性能。解决方案的关键在于提出AuriStream框架,该框架采用两阶段架构:第一阶段基于人耳耳蜗(cochlea)的时频转换机制生成离散的耳蜗标记(cochlear tokens),模拟初级听觉皮层的感知编码;第二阶段使用自回归序列模型对这些标记进行建模,从而学习到具有明确音素和词汇语义信息的高级表示。此结构不仅提升了在SUPERB等多样语音任务上的表现,还具备音频续写与可视化能力,为理解模型决策提供了可解释性支持。

链接: https://arxiv.org/abs/2508.11598
作者: Greta Tuckute,Klemen Kotar,Evelina Fedorenko,Daniel L.K. Yamins
机构: Massachusetts Institute of Technology (麻省理工学院); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbfcochlear tokens. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream’s strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model’s predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.
zh

[NLP-4] Aware First Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因长链式思维(Long Chain-of-Thought, CoT)导致的冗余计算问题,该问题显著降低了计算效率并延缓了实时应用响应速度。现有方法依赖人工定义的难度先验,与模型自身对任务难度的认知不一致,从而限制了优化效果。解决方案的关键在于提出动态推理边界自感知框架(Dynamic Reasoning-Boundary Self-Awareness Framework, DR. SAF),其核心机制包括:边界自感知对齐(Boundary Self-Awareness Alignment)、自适应奖励管理(Adaptive Reward Management)和边界保持机制(Boundary Preservation Mechanism),使模型能够根据任务复杂度动态调整推理深度,在保证准确率的同时显著提升token效率与训练效率。

链接: https://arxiv.org/abs/2508.11582
作者: Qiguang Chen,Dengyun Peng,Jinhao Liu,HuiKang Su,Jiannan Guan,Libo Qin,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM’s self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.
zh

[NLP-5] Agent Mental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

【速读】: 该论文旨在解决传统心理评估依赖临床医生、资源匮乏,且现有自动化方法多基于静态文本分析、难以捕捉动态交互中深层信息的问题。其解决方案的关键在于提出一种多智能体框架,模拟医患对话流程,通过专用代理分别负责提问、响应适配性评估、评分与记忆更新;其中核心创新为引入自适应提问机制,由评估代理判断用户回答的充分性并生成针对性追问以消除歧义和填补信息缺口,同时采用树状结构记忆系统动态存储和更新用户信息(根节点为基本信息,子节点按症状类别和交互轮次组织),从而提升信息提取效率与上下文感知能力。

链接: https://arxiv.org/abs/2508.11567
作者: Jinpeng Hu,Ao Wang,Qianqian Xie,Hui Ma,Zhuo Li,Dan Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user’s basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.
zh

[NLP-6] Language models align with brain regions that represent concepts across modalities

【速读】: 该论文旨在解决语言模型(Language Models, LMs)中语言表征与概念意义表征难以分离的问题,这一问题同样存在于认知科学和神经科学领域。其解决方案的关键在于通过两个神经度量指标来评估语言模型与大脑激活模式的对齐程度:一是句子处理过程中脑区的激活水平,用于聚焦语言加工;二是跨模态意义一致性(meaning consistency across input modalities),即利用fMRI数据量化同一脑区在不同输入范式(句子、词云、图像)下对相同概念响应的一致性。实验结果表明,无论是否包含视觉信息的语言模型,均在具有更高跨模态意义一致性的脑区表现出更强的信号预测能力,即使这些区域对语言处理不敏感,这暗示语言模型可能内部表征了跨模态的概念意义。

链接: https://arxiv.org/abs/2508.11536
作者: Maria Ryskina,Greta Tuckute,Alexander Fung,Ashley Malkin,Evelina Fedorenko
机构: Vector Institute for AI (向量研究所); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2025. Code and data can be found at this https URL

点击查看摘要

Abstract:Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today’s language models (LMs), we investigate the relationship between LM–brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.
zh

[NLP-7] Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否存在物种歧视(speciesist bias)的问题,即模型是否在道德判断和价值评估中表现出对非人类动物的系统性偏见,并探讨其如何体现或强化人类社会中关于动物剥削的文化规范。解决方案的关键在于通过三个维度进行系统性实证分析:一是构建SpeciesismBench基准测试以量化模型对物种主义陈述的认知与道德评价;二是引入心理学测量方法比较模型与人类在物种态度上的差异;三是设计开放式文本生成任务考察模型对物种主义合理化言论的响应模式。研究发现,尽管LLMs能识别物种主义言论,但往往不加批判地接受其道德正当性,且在权衡人类与动物利益时更倾向于优先保护人类个体,尤其当动物被描述为具备更高认知能力时则可能优先于低能力人类——这暗示模型可能基于认知能力而非物种身份进行价值排序。因此,论文主张将非人类道德主体纳入AI公平性与对齐框架,是减少此类偏见、防止物种主义观念在AI系统及社会中固化的核心路径。

链接: https://arxiv.org/abs/2508.11534
作者: Monika Jotautaitė,Lucius Caviola,David A. Brewster,Thilo Hagendorff
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias – discrimination based on species membership – and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence.
zh

[NLP-8] Reference Points in LLM Sentiment Analysis: The Role of Structured Context

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的情感分析方法在营销研究中受限于仅依赖评论文本本身,而忽视了消费者评价中关键的参考点信息(如预期与实际体验的差异),从而导致情感判断不够准确的问题。其解决方案的关键在于引入结构化提示(structured prompting),具体通过JSON格式的提示模板嵌入额外的参考信息(如顾客预期、产品属性等),使轻量级3B参数模型在无需微调的情况下显著提升情感分析性能——实验表明,在Yelp餐厅和夜生活两个类别上,Macro-F1分别提升1.6%和4%,均方根误差(RMSE)分别降低16%和9.1%,且性能提升源于模型对上下文的真实推理能力,而非标签代理效应,从而为资源受限边缘设备上的高效部署提供了可行路径。

链接: https://arxiv.org/abs/2508.11454
作者: Junichiro Niimi
机构: Meijo University (明治大学); RIKEN AIP (理化学研究所先进智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation–disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.
zh

[NLP-9] Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)评估体系中存在的重要问题:现有基准测试(如MMLU)和排行榜(如Chatbot Arena)主要依赖静态数据集或众包的通用领域提示,难以真实反映模型在实际应用场景中的性能表现。为填补这一差距,作者提出Inclusion Arena,一个基于AI驱动应用中直接收集的人类反馈进行实时排名的平台。其解决方案的关键在于:将成对模型比较嵌入自然用户交互流程中以确保评估贴近实际使用场景,并采用改进的Bradley-Terry模型——引入两项核心创新:(1) Placement Matches机制,用于新模型冷启动时快速估计初始评分;(2) Proximity Sampling策略,优先选择能力相近模型进行比较,从而最大化信息增益并提升评分稳定性。实证分析表明,该方法可生成更可靠、稳定的排名,且显著降低恶意操纵风险。

链接: https://arxiv.org/abs/2508.11452
作者: Kangyu Wang,Hongliang He,Lin Liu,Ruiqi Liang,Zhenzhong Lan,Jianguo Li
机构: Inclusion AI; Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Our platform is publicly accessible at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at this https URL.
zh

[NLP-10] CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity

【速读】: 该论文旨在解决在联合训练单一编码器以同时优化信息检索(Information Retrieval, IR)和语义文本相似度(Semantic Textual Similarity, STS)任务时,因两任务目标本质差异导致的负迁移问题,即 naive co-training 常引发显著性能折损。解决方案的关键在于系统性地解耦训练流程中的任务特异性学习信号,提出 CoDiEmb 框架:其核心创新包括(1)采用任务专用的目标函数与动态采样器构建单任务批次并平衡更新,避免梯度干扰——IR 使用多正样本和难负样本的对比损失,STS 采用顺序感知目标直接优化相关性和排序一致性;(2)引入 delta-guided 模型融合策略,基于参数偏离预训练初始化的程度计算细粒度合并权重,优于传统 Model Soups;(3)设计高效单阶段训练流程,实现稳定收敛。实验证明该框架有效缓解跨任务权衡,并提升嵌入空间的几何特性。

链接: https://arxiv.org/abs/2508.11442
作者: Bowen Zhang,Zixin Song,Chunquan Chen,Qian-Wen Zhang,Di Yin,Xing Sun
机构: Tsinghua University (清华大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter’s deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.
zh

[NLP-11] Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在内容审核中对反性别歧视言论(anti-sexist speech)的误判问题,即模型常将挑战性别歧视的言论错误分类为有害内容,从而可能压制女性政治人物等边缘群体的声音。其解决方案的关键在于:首先,摒弃简单的“有害/非有害”二元分类框架;其次,在高敏感事件期间引入人工审核机制(human-in-the-loop review);最后,在训练数据中明确纳入反性别歧视言论(counter-speech),以提升模型对抵抗性话语的识别能力。这一方法融合了女性主义理论、事件驱动分析与模型评估,揭示了数字政治空间中保障抗争性言论安全的复杂社会技术挑战。

链接: https://arxiv.org/abs/2508.11434
作者: Aditi Dutta,Susan Banducci
机构: University of Exeter (埃克塞特大学); University of Birmingham (伯明翰大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces.
zh

[NLP-12] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor

【速读】: 该论文旨在解决生成式 AI 在自动化幽默生成中普遍存在的问题,即生成的笑话往往缺乏多样性、重复性高或与听众的文化背景和情境脱节(context sensitivity),从而导致幽默效果不佳。其核心解决方案是提出 HumorPlanSearch 模块化流水线,关键在于将上下文建模贯穿于整个生成流程:通过 Plan-Search 实现多样化且主题定制化的策略选择,利用 Humor Chain-of-Thought (HuCoT) 模板捕捉文化与风格相关的推理逻辑,借助知识图谱(Knowledge Graph)检索并适配历史高效策略,结合语义嵌入进行新颖性过滤,并引入迭代式判别驱动的修订循环以优化输出质量。这一设计使模型在策略规划到多信号评估的每个阶段都显式考虑上下文因素,显著提升了幽默内容的连贯性、适应性和文化贴合度。

链接: https://arxiv.org/abs/2508.11429
作者: Shivam Dubey
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener’s cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.
zh

[NLP-13] Survey-to-Behavior: Downstream Alignment of Human Values in LLM s via Survey Questions

【速读】: 该论文试图解决的问题是如何在不依赖大规模训练数据的前提下,有效调整大语言模型(Large Language Models, LLMs)的价值观倾向,使其在下游任务中表现出与人类价值观更一致的行为。解决方案的关键在于通过微调(fine-tuning)模型来响应价值调查问卷(value survey questions),即让模型学习对一系列涵盖20种不同人类价值观的描述进行评分,从而显式地引导其价值系统。实验表明,这种基于价值问卷的微调不仅能显著改变模型在域内问卷上的回答,还能在域外情境(如Reddit帖子中的道德判断和文本冒险游戏行为)中引发显著的价值对齐(value alignment)变化,证明了该方法的有效性和泛化能力。

链接: https://arxiv.org/abs/2508.11414
作者: Shangrui Nie,Florian Mai,David Kaczér,Charles Welch,Zhixue Zhao,Lucie Flek
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages 1 figure

点击查看摘要

Abstract:Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model’s value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model’s behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model’s behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model’s behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model’s answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.
zh

[NLP-14] Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training

【速读】: 该论文旨在解决理性化Transformer分类器在训练过程中存在的不稳定性问题,以及现有方法难以同时实现高精度分类与可解释性(即生成人类标注对齐的输入token重要性评分)的挑战。其解决方案的关键在于提出一种端到端可微分的训练范式,通过构建一个单一模型同时承担三个角色——理由选择器(rationale selector)、分类器(classifier)和补集分类器(complement classifier),从而避免传统三玩家博弈框架中因多模型协同训练导致的训练不稳定问题,并显著提升模型在无显式监督条件下生成类特定理由(class-wise rationales)的能力,实现与人类标注的高度一致。

链接: https://arxiv.org/abs/2508.11393
作者: Marc Brinner,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.
zh

[NLP-15] Model Interpretability and Rationale Extraction by Input Mask Optimization

【速读】: 该论文旨在解决神经网络模型预测结果缺乏可解释性的问题,即如何生成高质量的提取式解释(extractive explanations),以揭示模型在做出决策时所依赖的关键输入特征。其解决方案的关键在于提出一种基于梯度优化的掩码方法,通过在输入中屏蔽对分类不具指示性的部分,并引入一种新的正则化策略来同时保证解释的充分性(sufficiency)、全面性(comprehensiveness)和紧凑性(compactness)。该方法无需训练专用解释模型,仅依赖已训练好的分类器即可实现对文本和图像等多种输入类型的高质解释,从而打通了模型可解释性与自然语言处理中理由提取(rationale extraction)之间的壁垒。

链接: https://arxiv.org/abs/2508.11388
作者: Marc Brinner,Sina Zarriess
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.
zh

[NLP-16] Retrieval-augmented reasoning with lean language models

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统依赖大规模模型和外部API所带来的性能瓶颈与隐私风险,尤其是在资源受限或安全要求较高的环境中难以部署的问题。其解决方案的关键在于构建一个轻量级语言模型架构,通过集成密集检索器与微调后的Qwen2.5-Instruct模型,并利用前沿模型(如DeepSeek-R1)生成合成查询和推理轨迹对特定领域语料(NHS A-to-Z条件页面)进行训练,从而实现高效、准确的复杂领域问答能力。该方法在不牺牲性能的前提下显著提升了本地部署的可行性,同时通过摘要式文档压缩和推理感知微调进一步优化了答案准确性与一致性。

链接: https://arxiv.org/abs/2508.11386
作者: Ryan Sze-Yin Chan,Federico Nanni,Tomas Lazauskas,Rosie Wood,Penelope Yong,Lionel Tarassenko,Mark Girolami,James Geddes,Andrew Duncan
机构: The Alan Turing Institute (艾伦·图灵研究所); University of Oxford (牛津大学); University of Cambridge (剑桥大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
zh

[NLP-17] When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对提示(prompt)中细微但非语义层面的措辞和格式变化高度敏感的问题,即提示鲁棒性(prompt robustness)不足,这会严重影响其在现实应用中的稳定性和可靠性。解决方案的关键在于构建一个统一的实验框架,系统评估五种提升提示鲁棒性的方法——涵盖微调(fine-tuned)与上下文学习(in-context learning)两类范式,并在来自Llama、Qwen和Gemma系列的8个模型上,针对Natural Instructions数据集中的52项任务进行基准测试,同时考察这些方法在多种分布偏移(distribution shifts)下的泛化能力。最终还扩展至GPT-4.1和DeepSeek V3等前沿模型,以揭示当前先进模型对格式扰动的鲁棒性现状。

链接: https://arxiv.org/abs/2508.11383
作者: Mikhail Seleznyov,Mikhail Chaichuk,Gleb Ershov,Alexander Panchenko,Elena Tutubalina,Oleg Somov
机构: AIRI; Skoltech; Yandex; MIPT; HSE University; Sber AI; ISP RAS Research Center for Trusted AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models’ current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: this https URL.
zh

[NLP-18] Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

【速读】: 该论文旨在解决如何自动化生成高质量、信息丰富的形成性反馈(formative feedback)以支持学生学习和教师教学效率的问题。其解决方案的关键在于利用大语言模型(LLM)如Llama 3.1从学生作业中自动提取与反馈标准相关的指标(indicators),并通过实证分析验证这些由模型生成的指标与人工评分之间存在显著强相关性,即使在未预见的指标-标准组合下亦然。这一方法为未来实现可解释、透明的自动化反馈生成提供了可靠的基础。

链接: https://arxiv.org/abs/2508.11364
作者: Sylvio Rüdian,Yassin Elsir,Marvin Kretschmer,Sabine Cayrou,Niels Pinkwart
机构: Humboldt-Universität zu Berlin (柏林洪堡大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: 11 pages, one table

点击查看摘要

Abstract:Automated feedback generation has the potential to enhance students’ learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students’ submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students’ submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.
zh

[NLP-19] SpecDetect: Simple Fast and Training-Free Detection of LLM -Generated Text via Spectral Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的高质量文本日益增多所带来的检测难题,现有无训练方法多依赖表面统计特征,忽视了文本生成过程中深层次的信号特性。其解决方案的关键在于将检测任务重新建模为信号处理问题,通过分析token对数概率序列在频域中的谱特性——利用全局离散傅里叶变换(Discrete Fourier Transform, DFT)和局部短时傅里叶变换(Short-Time Fourier Transform, STFT)发现:人类写作文本具有显著更高的频域能量,这反映了人类写作中更大幅度的波动性,而LLM生成文本则表现出被抑制的动力学特征。基于此关键洞察,作者构建了仅依赖DFT总能量这一单一鲁棒特征的SpecDetect检测器,并进一步提出SpecDetect++,引入采样差异机制以增强鲁棒性,从而在效率和性能上均优于当前最优方法。

链接: https://arxiv.org/abs/2508.11343
作者: Haitong Luo,Weiyao Zhang,Suhang Wang,Wenji Zou,Chungang Lin,Xuying Meng,Yujun Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal’s spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.
zh

[NLP-20] Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning

【速读】: 该论文旨在解决图预训练与提示调优(Graph Pre-training and Prompt-tuning)在实际应用中因谱特性不匹配而导致的知识迁移效率低下问题,尤其是在不同同质性(homophily)和异质性(heterophily)的图结构中。现有方法依赖于基于同质性的低频知识,无法适应真实世界图数据中多样的谱分布,导致在有限监督下难以有效迁移。解决方案的关键在于提出HS-GPPT模型,其核心是通过理论揭示的“谱特异性原则”——即最优知识迁移要求预训练谱滤波器与下游图的内在谱结构对齐。为此,该模型采用混合谱滤波骨干网络和局部-全局对比学习以获取丰富的谱知识,并设计提示图(prompt graphs)来调整预训练阶段的谱分布,从而实现跨同质性与异质性场景下的谱对齐与高效知识迁移。

链接: https://arxiv.org/abs/2508.11328
作者: Haitong Luo,Suhang Wang,Weiyao Zhang,Ruiqi Meng,Xuying Meng,Yujun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Graph ``pre-training and prompt-tuning’’ aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, existing methods rely on homophily-based low-frequency knowledge, failing to handle diverse spectral distributions in real-world graphs with varying homophily. Our theoretical analysis reveals a spectral specificity principle: optimal knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. Under limited supervision, large spectral gaps between pre-training and downstream tasks impede effective adaptation. To bridge this gap, we propose the HS-GPPT model, a novel framework that ensures spectral alignment throughout both pre-training and prompt-tuning. We utilize a hybrid spectral filter backbone and local-global contrastive learning to acquire abundant spectral knowledge. Then we design prompt graphs to align the spectral distribution with pretexts, facilitating spectral knowledge transfer across homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings. Our code is available at this https URL.
zh

[NLP-21] LLM Compression: How Far Can We Go in Balancing Size and Performance?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因高内存占用和计算成本导致的可访问性受限问题。为实现模型压缩与性能之间的平衡,研究提出采用4-bit分组缩放量化(Group Scaling Quantization, GSQ)和生成式预训练变换器量化(Generative Pretrained Transformer Quantization, GPTQ)技术对LLaMA 1B、Qwen 0.5B和PHI 1.5B等小型模型进行低比特量化。关键解决方案在于通过量化降低模型参数精度,在保持任务性能的前提下显著减少推理延迟和提升吞吐量(每秒生成的总输出token数),从而为资源受限环境下的部署提供可行方案,并为未来相关实验提供基准参考。

链接: https://arxiv.org/abs/2508.11318
作者: Sahil Sk,Debasish Dhal,Sonal Khosla,Sk Shahid,Sambit Shekhar,Akash Dhaka,Shantipriya Parida,Dilip K. Prasad,Ondřej Bojar
机构: Odia Generative AI, India; AMD Silo AI, Finland; The Arctic University of Norway, Norway; Charles University, MFF, ÚFAL, Czech Republic
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at the RANLP 2025 conference

点击查看摘要

Abstract:Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.
zh

[NLP-22] SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

【速读】: 该论文旨在解决自动文献综述生成(Automatic Survey Generation, ASG)领域中评估方法的不足问题,包括现有指标存在偏倚、缺乏人类偏好数据以及过度依赖大语言模型作为评判者等局限。其解决方案的关键在于提出SGSimEval——一个基于相似性增强评估的综合性基准,通过融合大纲、内容与参考文献三个维度的评估,并结合LLM评分与量化指标,构建多角度评价框架;同时引入人类偏好度量以强化对生成质量与人类相似性的关注,从而实现更客观、可靠且贴近实际需求的ASG系统评估。

链接: https://arxiv.org/abs/2508.11310
作者: Beichen Guo,Zhiyuan Wen,Yu Yang,Peng Gao,Ruosong Yang,Jiaxing Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to The 21st International Conference on Advanced Data Mining and Applications (ADMA2025)

点击查看摘要

Abstract:The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.
zh

[NLP-23] SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中日益显著的“过度拒绝”(over-refusal)问题,即安全机制导致模型错误地拒绝本应合法且无害的指令,尤其在使用常见提示模板或特定任务(如情感分析、语言翻译)时严重影响实用性。解决方案的关键在于提出SafeConstellations方法,该方法通过在推理阶段追踪任务特异的嵌入空间轨迹模式(constellation patterns),识别并引导表示走向非拒绝路径,从而仅对易发生过度拒绝的任务进行行为修正,同时保持模型整体性能不变,实现最高达73%的过度拒绝率降低。

链接: https://arxiv.org/abs/2508.11290
作者: Utsav Maskey,Sumit Yadav,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学); IOE, Pulchowk Campus (IOE,普尔科克校区)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.
zh

[NLP-24] AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models Responses to Depression Anxiety and Stress Queries

【速读】: 该论文试图解决的问题是:不同大型语言模型(Large Language Models, LLMs)在回应关于抑郁、焦虑和压力等心理健康的实际问题时,其情感表达模式是否存在显著差异,以及这些差异是否受用户人口统计特征(如性别、年龄、教育背景)的影响。解决方案的关键在于系统性地比较八种主流LLM对二十个情境化提问的响应,并采用先进的情感分析工具对2880条回答进行量化评估,从而揭示模型类型与心理状况类别对情绪输出的主导作用,而用户人口统计特征的影响则相对微弱。这一方法使研究能够识别各模型的情感签名(emotional signature),为心理健康应用中LLM的选择提供实证依据。

链接: https://arxiv.org/abs/2508.11285
作者: Arya VarastehNezhad,Reza Tavasoli,Soroush Elyasi,MohammadHossein LotfiNia,Hamed Farbeh
机构: University of Tehran(德黑兰大学); University of South Carolina(南卡罗来纳大学); University of West London(西伦敦大学); Azad University(伊斯兰阿扎德大学); Amirkabir University of Technology(阿米尔·卡比尔理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.
zh

[NLP-25] oxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

【速读】: 该论文旨在解决法语(French)中毒性内容检测(toxicity detection)任务的性能瓶颈问题,其核心挑战在于缺乏文化相关且规模较大的标注数据集。为应对这一问题,作者提出了TOXIFRENCH——一个包含53,622条法语在线评论的公开基准数据集,通过半自动化标注流程将人工标注比例降至10%,显著提升了数据构建效率。解决方案的关键在于:首先发现小语言模型(Small Language Models, SLMs)在鲁棒性和泛化能力上优于大型语言模型(Large Language Models, LLMs),进而提出一种基于动态加权损失函数的链式思维(Chain-of-Thought, CoT)微调策略,强化模型最终决策的忠实性(faithfulness)。该方法使4B参数规模的模型在F1分数上较基线提升13%,超越GPT-40和Gemini-2.5等主流LLMs,并展现出良好的跨语言迁移能力,验证了该范式在多语言安全分类任务中的普适性。

链接: https://arxiv.org/abs/2508.11281
作者: Axel Delaval,Shujian Yang,Haicheng Wang,Han Qiu,Jialiang Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 5 figures, 8 tables. This paper introduces TOXIFRENCH, a new large-scale benchmark for French toxicity detection, and proposes a Chain-of-Thought (CoT) fine-tuning method with a dynamic weighted loss. The resulting fine-tuned 4B parameter model, ToxiFrench, achieves state-of-the-art performance, outperforming larger models like GPT-4o

点击查看摘要

Abstract:Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model’s final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.
zh

[NLP-26] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

【速读】: 该论文旨在解决在旅游领域(tourism)中对大语言模型(Large Language Models, LLMs)进行评估时面临的两大挑战:一是标注基准数据集的高昂成本,二是模型常见的幻觉(hallucination)问题。其解决方案的关键在于提出一种无需标签数据的评估框架——基于专家思维树(Expert Tree-of-Thought, ToT)的LETToT方法,该方法通过专家构建的推理结构替代人工标注,从而系统性地量化LLM在特定领域的输出质量。具体而言,研究首先通过与通用质量维度及专家反馈对齐迭代优化ToT组件,实现高质量的专家引导式评估;其次将优化后的专家ToT应用于不同规模模型(32B–671B参数),揭示了专业领域中仍存在缩放定律(scaling laws),但增强推理能力的小模型可显著缩小性能差距,并且在子72B模型中,显式推理架构在准确性和简洁性上优于传统架构(p < 0.05)。此工作建立了可扩展、无标签的领域专用LLM评估范式,为替代传统依赖标注数据的基准提供了可靠路径。

链接: https://arxiv.org/abs/2508.11280
作者: Ruiyan Qi,Congding Wen,Weibo Zhou,Shangsong Liang,Lingbo Li
机构: University of Warwick (华威大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose \textbfL able-Free \textbfE valuation of LLM on \textbfT ourism using Expert \textbfT ree- \textbfo f- \textbfT hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT’s optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ( p0.05 ). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
zh

[NLP-27] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLM s?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言(low-resource languages)中进行语言学推理任务时表现不佳的问题,尤其聚焦于国际语言学奥林匹克竞赛(Linguistics Olympiad, LO)中的谜题。这些问题提供了一个污染极少的环境,用于评估LLMs在跨语言场景下的推理能力。解决方案的关键在于通过标注每道题目所涉及的语言学特征(linguistically informed features),系统性分析LLMs的弱点,并发现其在处理高形态复杂度(morphological complexity)的语言任务时表现较差,而在涉及英语中也存在的语言特征时表现较好;此外,将词切分为词素(morpheme)作为预处理步骤能显著提升解题成功率,表明当前通用tokeniser存在不足,亟需更符合语言特性的分词机制。这一发现为改进LLMs在低资源语言中的语言学推理能力提供了关键方向。

链接: https://arxiv.org/abs/2508.11260
作者: Mukund Choudhary,KV Aditya Srivatsa,Gaurja Aeron,Antara Raaghavi Bhattacharya,Dang Khoa Dang Dinh,Ikhlasul Akmal Hanif,Daria Kotova,Ekaterina Kochmar,Monojit Choudhury
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); IIT Gandhinagar (印度理工学院甘地纳格尔分校); Harvard University (哈佛大学); VinUniversity (越南大学); Universitas Indonesia (印度尼西亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs’ linguistic reasoning abilities across low-resource languages. This work analyses LLMs’ performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.
zh

[NLP-28] Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLM s via Post-Processing

【速读】: 该论文旨在解决在闭源权重大语言模型(closed-weight LLMs)的上下文学习(in-context learning)设置下,如何实现群体公平性(group fairness)的问题。传统公平算法通常依赖于对模型进行微调或头层嵌入(head-tuning)来约束预测结果,但在无法访问模型参数的闭源场景中不再适用。其解决方案的关键在于:将LLM视为特征提取器,通过设计针对特定公平准则的提示(prompt),从模型的概率输出(如token log probabilities)中提取足够统计量(sufficient statistics),进而利用这些特征训练一个轻量级的后处理公平分类器(post-hoc fair classifier)。该方法在多个数据集上验证了其在准确率与公平性之间具有优越的权衡能力,尤其在数据效率方面显著优于基于LLM嵌入或原始表格特征训练的公平分类器。

链接: https://arxiv.org/abs/2508.11258
作者: Ruicheng Xian,Yuxuan Wan,Han Zhao
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness – preventing disparate impacts across demographic groups – is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features.
zh

[NLP-29] Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在数学问题求解评估中仅关注已定义明确的问题,而忽视了真实智能体应具备的主动获取缺失信息的能力这一关键缺陷。其解决方案的核心在于构建一个包含两类不完整问题的新数据集,这些问题是基于多样化的上下文设计的,从而能够系统性地评估LRM是否能在信息不足时主动提问。通过该数据集的实证分析,研究揭示了现有LRM在主动性行为上的显著不足,并进一步识别出与过度思考(overthinking)和幻觉(hallucination)相关的行为模式,为未来通过监督微调(supervised fine-tuning)提升模型的主动交互能力提供了潜在方向与挑战洞察。

链接: https://arxiv.org/abs/2508.11252
作者: Youcheng Huang,Bowen Qin,Chen Huang,Duanyu Feng,Xi Yang,Wenqiang Lei
机构: Sichuan University (四川大学); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education (教育部机器学习与产业智能工程研究中心); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institute of Data Science, National University of Singapore (新加坡国立大学数据科学研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable problem-solving abilities in mathematics, as evaluated by existing benchmarks exclusively on well-defined problems. However, such evaluation setup constitutes a critical gap, since a genuine intelligent agent should not only solve problems (as a math quiz solver), but also be able~to ask for information when the problems lack sufficient information, enabling proactivity in responding users’ requests. To bridge such gap, we proposes a new dataset consisting of two types of incomplete problems with diverse contexts. Based on the dataset, our systematical evaluation of LRMs reveals their inability in proactively asking for information. In addition, we uncover the behaviors related to overthinking and hallucination of LRMs, and highlight the potential and challenges of supervised fine-tuning in learning such ability. We hope to provide new insights in developing LRMs with genuine intelligence, rather than just solving problems.
zh

[NLP-30] Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)任务中传统检索增强生成(Retrieval-Augmented Generation, RAG)方法因仅依赖粗粒度文本语义相似性而忽视分散知识间结构关联的问题,以及现有图RAG(GraphRAG)方法过度依赖结构信息、忽略细粒度语义特征导致的性能瓶颈。其解决方案的关键在于提出一种基于超图(Hypergraph)的新型RAG框架——HGRAG,通过构建以细粒度实体为节点、粗粒度段落为超边的实体超图来显式建模知识间的结构关联,并设计超图扩散机制融合实体级与段落级语义相似性,实现跨粒度的信息整合;同时引入检索增强模块对检索结果进行语义与结构双重优化,显著提升问答准确率与检索效率。

链接: https://arxiv.org/abs/2508.11247
作者: Changjian Wang,Weihong Deng,Weili Guan,Quan Lu,Ning Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6 \times speedup in retrieval efficiency.
zh

[NLP-31] Benchmarking Prosody Encoding in Discrete Speech Tokens

【速读】: 该论文旨在解决当前基于自监督学习(Self-Supervised Learning, SSL)模型生成的离散标记(discrete tokens)在语音语言模型中对韵律特征(prosodic features)建模能力不足的问题。现有方法通常在语言模型训练前独立预训练离散标记,导致其设计依赖于启发式选择(如SSL模型类型或聚类数量),且缺乏对韵律信息编码能力的系统评估。解决方案的关键在于通过人工修改韵律特征并分析离散标记对此类扰动的敏感性,从而提供一套可操作的指南,以优化离散标记的设计,使其更好地捕捉和保留语音中的韵律信息,进而提升语音语言模型在语义与韵律双重维度上的理解与生成能力。

链接: https://arxiv.org/abs/2508.11224
作者: Kentaro Onda,Satoru Fukayama,Daisuke Saito,Nobuaki Minematsu
机构: The University of Tokyo (东京大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ASRU2025

点击查看摘要

Abstract:Recently, discrete tokens derived from self-supervised learning (SSL) models via k-means clustering have been actively studied as pseudo-text in speech language models and as efficient intermediate representations for various tasks. However, these discrete tokens are typically learned in advance, separately from the training of language models or downstream tasks. As a result, choices related to discretization, such as the SSL model used or the number of clusters, must be made heuristically. In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features. Yet, there has been limited research on the ability of discrete tokens to capture prosodic information. To address this gap, this study conducts a comprehensive analysis focusing on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.
zh

[NLP-32] ORFuzz: Fuzzing the “Other Side” of LLM Safety – Testing Over-Refusal

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的“过度拒绝”(over-refusal)问题,即模型因过于保守的安全机制而错误地拒绝良性查询,从而严重影响其可靠性与可用性。现有测试方法存在基准 flawed 和测试生成能力有限等缺陷,难以系统性识别此类行为。论文提出首个进化测试框架 ORFuzz,其核心创新在于三方面:(1) 基于安全类别感知的种子选择策略以实现全面测试覆盖;(2) 利用推理型语言模型自适应优化变异器,生成高有效性的测试用例;(3) 引入 OR-Judge 模型作为人类对毒性与拒绝感知对齐的评判标准,确保测试结果的真实性。该方案显著提升了过拒绝实例的发现效率(平均达6.98%,超过主流基线两倍以上),并构建了可迁移性强的新基准 ORFuzzSet(含1855个测试用例),在10种不同LLM上平均过拒绝率达63.56%,为开发更可靠、可信的LLM系统提供了自动化测试工具和高质量数据资源。

链接: https://arxiv.org/abs/2508.11222
作者: Haonan Zhang,Dongxia Wang,Yi Liu,Kexin Chen,Jiashui Wang,Xinlei Ying,Long Liu,Wenhai Wang
机构: Zhejiang University (浙江大学); Huzhou Institute of Industrial Control Technology (湖州工业控制技术研究院); Quantstamp; Ant Group (蚂蚁集团)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz’s outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.
zh

[NLP-33] How Causal Abstraction Underpins Computational Explanation

【速读】: 该论文试图解决的问题是:如何确定一个系统在其内部表征载体(representational vehicles)上实现了特定计算(computation)的条件。这一问题在认知科学与计算哲学中具有核心意义,尤其涉及对计算实现(computational implementation)的本质理解。论文提出的关键解决方案是:基于因果抽象(causal abstraction)理论来构建计算实现的框架。该方法强调通过因果关系识别和抽象出高阶结构,从而解释系统如何在复杂动态过程中稳定地执行计算,同时阐明表征(representation)在其中的作用——即表征并非静态符号,而是由因果结构所定义的功能性实体。此视角不仅连接了传统计算哲学中的议题,也适用于现代深度学习模型的泛化与预测能力分析。

链接: https://arxiv.org/abs/2508.11214
作者: Atticus Geiger,Jacqueline Harding,Thomas Icard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explanations of cognitive behavior often appeal to computations over representations. What does it take for a system to implement a given computation over suitable representational vehicles within that system? We argue that the language of causality – and specifically the theory of causal abstraction – provides a fruitful lens on this topic. Drawing on current discussions in deep learning with artificial neural networks, we illustrate how classical themes in the philosophy of computation and cognition resurface in contemporary machine learning. We offer an account of computational implementation grounded in causal abstraction, and examine the role for representation in the resulting picture. We argue that these issues are most profitably explored in connection with generalization and prediction.
zh

[NLP-34] E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection

【速读】: 该论文旨在解决社交媒体中多模态虚假信息(multimodal misinformation)检测的挑战,具体包括模态间不一致性、时间模式变化以及类别严重不平衡等问题。现有方法通常独立处理每条帖子,忽略了跨时间和模态的事件级结构。其解决方案的关键在于提出E-CaTCH框架,通过基于文本相似性和时间邻近性的聚类将帖子组织为伪事件(pseudo-events),并在每个事件内利用预训练的BERT和ResNet提取文本与视觉特征,结合 intra-modal 自注意力机制进行特征精炼,并通过双向跨模态注意力实现模态对齐;进一步采用软门控机制融合表示以生成上下文感知的内容嵌入;同时引入趋势感知的LSTM(增强语义漂移和动量信号)建模叙事随时间演进,最终在事件层面进行分类,从而更好地匹配现实世界中虚假信息的传播动态。此外,通过自适应类别权重、时间一致性正则化和困难样本挖掘缓解类别不平衡并提升训练稳定性。

链接: https://arxiv.org/abs/2508.11197
作者: Ahmad Mousavi,Yeganeh Abdollahinejad,Roberto Corizzo,Nathalie Japkowicz,Zois Boukouvalas
机构: American University (美国大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios.
zh

[NLP-35] Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation INTERSPEECH2025

【速读】: 该论文旨在解决多语言语音到文本翻译模型在本地部署时因参数量过大而导致的推理效率与性能难以平衡的问题。其解决方案的关键在于提出了一种创新的寄生式双尺度方法(Parasitic Dual-Scale Approach),该方法融合了增强的推测采样(speculative sampling)技术、模型压缩与知识蒸馏(knowledge distillation)策略,并在此基础上构建了新的KVSPN模块,实现了在不损失BLEU得分的前提下提升40%的推理速度,同时通过蒸馏进一步实现2.6倍的加速并保持更优性能。

链接: https://arxiv.org/abs/2508.11189
作者: Chenyang Le,Yinfeng Xia,Huiyan Li,Manhong Wang,Yutao Sun,Xingyang Ma,Yanmin Qian
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6 \times speedup over the original Whisper Medium with superior performance.
zh

[NLP-36] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

【速读】: 该论文旨在解决传统多选题(Multiple-Choice Questions, MCQs)中生成的干扰项(distractors)难以捕捉个体学生特定认知错误的问题。现有基于大语言模型(Large Language Models, LLMs)的方法通常生成的是群体层面的干扰项,虽能反映共性错误模式,但无法针对每个学生的独特推理偏差进行个性化诊断。为应对这一挑战,作者提出“个性化干扰项生成”任务,并设计了一种无需训练的两阶段框架:第一阶段利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)从学生历史错误作答中重构其推理轨迹,构建个性化的概念误解原型(student-specific misconception prototype);第二阶段以此原型引导对新问题的推理模拟,从而生成与学生重复性认知错误高度一致的个性化干扰项。该方案的核心创新在于通过无监督推理重建机制实现个体认知偏差的精准建模,克服了单个学生数据稀疏导致的传统训练方法失效的问题。

链接: https://arxiv.org/abs/2508.11184
作者: Tao Wu,Jingyuan Chen,Wang Lin,Jian Zhan,Mengze Li,Kun Kuang,Fei Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student’s past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student’s underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student’s reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student’s reasoning on new questions, enabling the generation of personalized distractors that align with the student’s recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.
zh

[NLP-37] Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

【速读】: 该论文旨在解决低资源德拉威语(Tulu)在代码混杂社交媒体文本中仇恨语言识别(Offensive Language Identification, OLI)任务缺乏高质量标注数据与有效模型的问题。其关键解决方案是构建首个针对代码混杂Tulu语社交评论的基准数据集,包含3,845条高一致性标注(Krippendorff’s alpha = 0.984)的评论,分为四类:非攻击性、非Tulu内容、无目标攻击性和有目标攻击性;并通过系统评估多种深度学习模型发现,结合自注意力机制的双向门控循环单元(BiGRU)模型在该任务上表现最佳(准确率82%,宏F1分数0.81),而主流多语言预训练模型(如mBERT和XLM-RoBERTa)性能较差,凸显了在低资源、代码混杂场景下迁移学习的局限性。

链接: https://arxiv.org/abs/2508.11166
作者: Anusha M D,Deepthi Vikram,Bharathi Raja Chakravarthi,Parameshwar R Hegde
机构: University of Galway (加利福尼亚大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 3 tables, 3 figures. Submitted to Language Resources and Evaluation (Springer)

点击查看摘要

Abstract:Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff’s alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.
zh

[NLP-38] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在人类移动数据语义理解能力评估方面的不足,即现有模型虽能较好预测人类移动模式,但其对这些模式背后原因或语义含义的解释能力尚不明确。为此,作者提出了MobQA这一基准数据集,其关键在于构建了一个涵盖三种互补类型问题的综合评估框架:事实检索(factual retrieval,精确提取数据)、多选推理(multiple-choice reasoning,语义推断)和自由形式解释(free-form explanation, interpretive description),所有问题均需结合空间、时间和语义推理能力进行解答。该设计使得研究者能够系统性地评估LLMs在不同层次语义理解上的表现,揭示了当前主流模型在事实层面表现良好但在语义推理与解释任务中存在显著局限,且轨迹长度对模型效果有显著影响。

链接: https://arxiv.org/abs/2508.11163
作者: Hikaru Asano,Hiroki Ouchi,Akira Kasuga,Ryo Yonetani
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所先进智能研究中心); Nara Institute of Science and Technology (奈良先端科学技术大学院大学); CyberAgentJapan (CyberAgent日本); CyberAgentJapan (CyberAgent日本)
类目: Computation and Language (cs.CL)
备注: 23 pages, 12 figures

点击查看摘要

Abstract:This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnoteMobQA dataset is available at this https URL. Comments: 23 pages, 12 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.11163 [cs.CL] (or arXiv:2508.11163v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.11163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations

【速读】: 该论文旨在解决现有谣言检测方法中忽视图像内容及其与文本在不同视觉尺度下内在关联的问题,从而导致关键谣言识别信息丢失。其解决方案的关键在于提出一种基于对比学习的跨模态谣言检测框架——多尺度图像与上下文相关性探索算法(Multi-scale Image and Context Correlation exploration algorithm, MICC),通过设计SCLIP编码器实现文本与多尺度图像块的统一语义嵌入,并利用交叉模态相关矩阵进行Top-K选择策略以定位最相关的图像区域;进一步引入尺度感知融合网络,根据语义重要性和跨模态相关性自适应加权融合多尺度图像特征与全局文本特征,显著提升了谣言检测性能。

链接: https://arxiv.org/abs/2508.11141
作者: Bin Ma,Yifei Zhang,Yongjin Xian,Qi Li,Linna Zhou,Gongxun Miao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing rumor detection methods often neglect the content within images as well as the inherent relationships between contexts and images across different visual scales, thereby resulting in the loss of critical information pertinent to rumor identification. To address these issues, this paper presents a novel cross-modal rumor detection scheme based on contrastive learning, namely the Multi-scale Image and Context Correlation exploration algorithm (MICC). Specifically, we design an SCLIP encoder to generate unified semantic embeddings for text and multi-scale image patches through contrastive pretraining, enabling their relevance to be measured via dot-product similarity. Building upon this, a Cross-Modal Multi-Scale Alignment module is introduced to identify image regions most relevant to the textual semantics, guided by mutual information maximization and the information bottleneck principle, through a Top-K selection strategy based on a cross-modal relevance matrix constructed between the text and multi-scale image patches. Moreover, a scale-aware fusion network is designed to integrate the highly correlated multi-scale image features with global text features by assigning adaptive weights to image regions based on their semantic importance and cross-modal relevance. The proposed methodology has been extensively evaluated on two real-world datasets. The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection, highlighting its effectiveness and potential for practical applications.
zh

[NLP-40] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents ACL

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估基准中缺乏自然、复杂且耗时的信息查询类问题的问题。现有问答(QA)基准通常不包含需要大量中间推理步骤才能解答的真实世界问题,导致对模型实际信息检索与推理能力的评估不足。为应对这一挑战,作者提出了MoNaCo基准,其核心创新在于设计了一个分解式标注流程(decomposed annotation pipeline),用于大规模收集并人工解答自然生成的、需数十甚至上百步推理的复杂问题(共1,315个)。实验表明,前沿LLMs在MoNaCo上的表现受限于低召回率和幻觉现象,F1得分最高仅达61.2%,凸显了当前模型在处理真实世界复杂信息需求时的局限性,同时也验证了MoNaCo作为衡量推理能力进步的有效工具的价值。

链接: https://arxiv.org/abs/2508.11133
作者: Tomer Wolfson,Harsh Trivedi,Mor Geva,Yoav Goldberg,Dan Roth,Tushar Khot,Ashish Sabharwal,Reut Tsarfaty
机构: University of Pennsylvania (宾夕法尼亚大学); Allen Institute for AI (艾伦人工智能研究所); Tel Aviv University (特拉维夫大学); Bar-Ilan University (巴伊兰大学); Oracle AI (甲骨文人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2025. Authors pre-print

点击查看摘要

Abstract:Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve – far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions – with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: this https URL
zh

[NLP-41] VeriRel: Verification Feedback to Enhance Document Retrieval for Scientific Fact Checking CIKM’25

【速读】: 该论文旨在解决科学事实核查中证据检索的准确性问题,即现有方法依赖通用信息检索算法,仅根据文档相关性排序,而未考虑其对核查主张的支持或反驳能力。解决方案的关键在于提出+VeriRel框架,将验证成功率(verification success)纳入文档排序机制,从而在检索阶段就优先选择能有效支持或反驳主张的文档。实验表明,该方法在SciFact、SciFact-Open和Check-Covid三个数据集上均实现了领先的文档证据检索性能,并显著提升了下游事实核查效果,证明了引入验证反馈以优化文档相关性评估的有效性。

链接: https://arxiv.org/abs/2508.11122
作者: Xingyu Deng,Xi Wang,Mark Stevenson
机构: University of Sheffield(谢菲尔德大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accpeted for the 34th ACM International Conference on Information and Knowledge Management (CIKM’25)

点击查看摘要

Abstract:Identification of appropriate supporting evidence is critical to the success of scientific fact checking. However, existing approaches rely on off-the-shelf Information Retrieval algorithms that rank documents based on relevance rather than the evidence they provide to support or refute the claim being checked. This paper proposes +VeriRel which includes verification success in the document ranking. Experimental results on three scientific fact checking datasets (SciFact, SciFact-Open and Check-Covid) demonstrate consistently leading performance by +VeriRel for document evidence retrieval and a positive impact on downstream verification. This study highlights the potential of integrating verification feedback to document relevance assessment for effective scientific fact checking systems. It shows promising future work to evaluate fine-grained relevance when examining complex documents for advanced scientific fact checking.
zh

[NLP-42] owards Reliable Multi-Agent Systems for Marketing Applications via Reflection Memory and Planning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实应用场景中可靠性不足的问题,特别是在营销领域的受众筛选(audience curation)任务中。其解决方案的关键在于提出一个名为RAMP的多智能体框架,该框架通过迭代式规划、工具调用、输出验证与改进建议生成来提升结果质量,并引入长期记忆存储机制(long-term memory store),用于保存客户特定事实和历史查询信息。实验表明,该方法在88个评估查询上将准确率提升28个百分点,在更具模糊性的挑战集上,随着验证与反思迭代次数增加,召回率提升约20个百分点,同时显著提高用户满意度,从而为部署可靠LLM系统提供了实用指导。

链接: https://arxiv.org/abs/2508.11120
作者: Lorenzo Jaime Yu Flores,Junyi Shen,Xiaoyuan Gu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.
zh

[NLP-43] PaperRegister: Boosting Flexible-grained Paper Search via Hierarchical Register Indexing

【速读】: 该论文旨在解决传统文献检索系统在面对细粒度查询需求时的不足问题,例如研究人员希望基于模块配置等具体细节而非粗粒度主题进行论文搜索,而现有系统主要依赖论文摘要构建索引,缺乏支持细粒度检索所需的信息。其解决方案的关键在于提出PaperRegister框架,通过离线的分层索引(hierarchical indexing)与在线自适应检索(adaptive retrieval)相结合的方式,将传统的基于摘要的索引转换为分层索引树结构,从而实现对不同粒度查询的有效支持。实验表明,PaperRegister在多种粒度的文献检索任务中达到当前最优性能,尤其在细粒度场景下表现突出,展现出在实际应用中的强大潜力。

链接: https://arxiv.org/abs/2508.11116
作者: Zhuoqun Li,Xuanang Chen,Hongyu Lin,Yaojie Lu,Xianpei Han,Le Sun
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Paper search is an important activity for researchers, typically involving using a query with description of a topic to find relevant papers. As research deepens, paper search requirements may become more flexible, sometimes involving specific details such as module configuration rather than being limited to coarse-grained topics. However, previous paper search systems are unable to meet these flexible-grained requirements, as these systems mainly collect paper abstracts to construct index of corpus, which lack detailed information to support retrieval by finer-grained queries. In this work, we propose PaperRegister, consisted of offline hierarchical indexing and online adaptive retrieval, transforming traditional abstract-based index into hierarchical index tree for paper search, thereby supporting queries at flexible granularity. Experiments on paper search tasks across a range of granularity demonstrate that PaperRegister achieves the state-of-the-art performance, and particularly excels in fine-grained scenarios, highlighting the good potential as an effective solution for flexible-grained paper search in real-world applications. Code for this work is in this https URL.
zh

[NLP-44] Diffusion is a code repair operator and generator

【速读】: 该论文旨在解决代码生成过程中“最后一公里修复”(last-mile repair)问题,即如何高效地修复不完整或存在错误的代码片段。其核心解决方案在于利用预训练的代码扩散模型(code diffusion model)的特性:在扩散过程的后期,当代码表示接近收敛时,不同代码片段之间的差异可视为对破损或不完整代码的微调修复。关键创新点在于两点:一是通过向破损代码添加噪声并重新启动扩散过程,实现针对性修复;二是利用扩散模型采样中间状态和最终状态的程序对,高效生成用于训练最后一公里修复任务的数据,从而降低计算成本并提升修复效果。实验在Python、Excel和PowerShell三个领域验证了该方法的有效性。

链接: https://arxiv.org/abs/2508.11110
作者: Mukul Singh,Gust Verbruggen,Vu Le,Sumit Gulwani
机构: Microsoft(微软); Microsoft(微软); Microsoft(微软); Microsoft(微软)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Code diffusion models generate code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, differences between discrete representations of these snippets look like last-mile repairs applied to broken or incomplete code. We evaluate the extent to which this resemblance can be exploited to leverage pre-trained code diffusion models for the problem of last-mile repair by considering two applications with significant potential. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate arbitrary amount of training data for last-mile repair tasks (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments on 3 domains (Python, Excel and PowerShell) to evaluate applications, as well as analyze properties.
zh

[NLP-45] Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs

【速读】: 该论文旨在解决语义表示中符号接地(symbol grounding)问题,即如何将自然语言中的符号与其所指称的现实世界意义有效关联。解决方案的关键在于将真实数字词典嵌入抽象语义表示(Abstract Meaning Representation, AMR)的有向无环图(digraph)结构中,并利用先进的预训练大语言模型对这些图进行共形化约简(confluent reduction),从而在保持其电路空间(circuit space)不变的前提下压缩图结构,进而分析约简后图的性质以揭示符号与语义之间的深层关联。

链接: https://arxiv.org/abs/2508.11068
作者: Nicolas Goulet,Alexandre Blondin Massé,Moussa Abdendi
机构: Université du Québec à Montréal (魁北克大学蒙特利尔分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem.
zh

[NLP-46] BIPOLAR: Polarization-based granular framework for LLM bias evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理敏感话题时存在的极化相关偏见(polarisation-related biases)问题,尤其是在政治话语、性别认同、种族关系或国家刻板印象等场景下,模型可能表现出系统性的情感倾向或不均衡响应。其解决方案的关键在于提出一个可复用、细粒度且主题无关的评估框架,该框架结合极化敏感的情感度量指标与合成生成的平衡冲突语句数据集,并基于预定义的语义类别进行分析。通过案例研究(俄罗斯-乌克兰战争),该方法不仅揭示了不同模型在整体上对乌克兰更积极的情感倾向,还发现了语义类别层面的显著差异和模型行为模式的多样性,同时验证了提示词修改会加剧预设语言和国籍相关的偏见。此框架支持自动化数据生成与细粒度偏见评估,适用于多种由极化驱动的情境,且与现有偏见评估策略正交。

链接: https://arxiv.org/abs/2508.11061
作者: Martin Pavlíček,Tomáš Filip,Petr Sosík
机构: Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.11061 [cs.CL] (or arXiv:2508.11061v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.11061 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-47] Hell or High Water: Evaluating Agent ic Recovery from External Failures

【速读】: 该论文旨在解决语言模型代理(language model agents)在面对外部环境失败(如函数调用突然不可用)时,能否有效制定并执行备选计划以达成目标的问题。其核心挑战在于评估代理在复杂搜索空间中适应环境反馈、灵活调整策略的能力,而非单纯依赖正确函数的选择。解决方案的关键是设计了一个专门的代理规划基准(agentic planning benchmark),该基准通过引入可控的外部故障(如函数失效),同时确保任务仍可解,从而系统性地测试代理的鲁棒性和适应性。实验表明,尽管先进模型能识别正确的函数使用场景,但在应对环境反馈和探索替代路径方面表现不足,揭示了当前生成式 AI 在动态环境中自主规划与纠错能力的局限性。

链接: https://arxiv.org/abs/2508.11027
作者: Andrew Wang,Sophia Hager,Adi Asija,Daniel Khashabi,Nicholas Andrews
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted to COLM 2025

点击查看摘要

Abstract:As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent’s performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.
zh

[NLP-48] Can Multi-modal (reasoning ) LLM s detect document manipulation?

【速读】: 该论文旨在解决文档欺诈(document fraud)检测难题,即如何有效识别伪造或篡改的交易文档,以保障依赖安全可验证文件的行业(如金融、法律等)的安全性。解决方案的关键在于利用多模态大语言模型(multi-modal large language models, MLLMs)对文档内容进行端到端分析,通过优化提示(prompt optimization)和深入解析模型推理过程,识别诸如文本篡改、格式错位及交易金额不一致等细微欺诈特征。研究发现,顶尖的多模态LLM在零样本泛化能力上显著优于传统方法,尤其在分布外数据集上表现突出,而模型规模与推理能力并非决定性能的核心因素,凸显了任务特定微调(task-specific fine-tuning)的重要性。

链接: https://arxiv.org/abs/2508.11021
作者: Zisheng Liang,Kidus Zewde,Rudra Pratap Singh,Disha Patil,Zexi Chen,Jiayu Xue,Yao Yao,Yifei Chen,Qinzhe Liu,Simiao Ren
机构: Duke University (杜克大学); Scam.ai; Indian Institute of Technology, Roorkee (印度理工学院,鲁尔基分校); New York University (纽约大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Wisconsin Madison (威斯康星大学麦迪逊分校); Columnbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2503.20084

点击查看摘要

Abstract:Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models’ reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies.
zh

[NLP-49] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨语言知识迁移中出现的幻觉问题,即当模型以一种语言提问而相关事实存在于另一种语言的训练数据中时,模型容易生成错误信息。其解决方案的关键在于通过在合成多语言数据集上从头训练小型Transformer模型,构建受控实验环境以探究该现象的成因与动态机制。研究发现,模型在训练过程中会经历一个关键学习阶段,在此阶段内模型要么为不同语言中的相同事实建立独立表征,要么实现统一表征;实验证明,跨语言知识迁移能力依赖于这种统一表征的形成。进一步地,作者揭示了统一程度受事实与训练语言间互信息(mutual information)以及语言提取难易度的影响,并据此提出通过调控数据分布和分词策略来调节跨语言迁移水平的方法,同时引入量化指标与可视化工具用于系统评估统一效果。

链接: https://arxiv.org/abs/2508.11017
作者: Carter Blum,Katja Filipova,Ann Yuan,Asma Ghandeharioun,Julian Zimmert,Fred Zhang,Jessica Hoffmann,Tal Linzen,Martin Wattenberg,Lucas Dixon,Mor Geva
机构: Google DeepMind(谷歌深度思维); Tel Aviv University (特拉维夫大学); Harvard University (哈佛大学); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.
zh

[NLP-50] SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估框架在儿童与青少年群体中适用性不足的问题,现有框架多基于成人用户设计,未能充分覆盖不同发展阶段(早期儿童0–6岁、中期儿童7–12岁、青少年13–18岁)所特有的认知、情感和社会风险。其解决方案的关键在于提出SproutBench——一个包含1,283个基于发展心理学原理的对抗性提示(adversarial prompts)的评测套件,能够系统性探测如情感依赖、隐私泄露及危险行为模仿等关键风险;通过实证评估47种LLMs,研究揭示了多个维度间的强相关性(如安全与风险预防)以及互动性与适龄性之间的显著负相关关系,从而为构建以儿童为中心的AI设计与部署提供可操作的安全指南。

链接: https://arxiv.org/abs/2508.11009
作者: Wenpeng Xing,Lanyi Wei,Haixiao Hu,Rongchang Li,Mohan Li,Changting Lin,Meng Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0–6), middle childhood (7–12), and adolescence (13–18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
zh

[NLP-51] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling ECAI2025

【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDMs)在生成质量上的优化问题,尤其是在推理阶段如何通过更有效的策略提升生成文本的质量。其关键解决方案是提出一种基于验证器(verifier)的推理时缩放方法,该方法利用预训练嵌入模型构建简单但高效的软值(soft-value)验证机制,在每个去噪步骤后对候选生成进行筛选与引导,从而在不改变模型结构的前提下显著提升生成质量,尤其在无分类器引导(classifier-free guidance)等现有设置基础上实现了明显改进。

链接: https://arxiv.org/abs/2508.10995
作者: Tejomay Kishor Padole,Suyash P Awate,Pushpak Bhattacharyya
机构: Indian Institute of Technology, Bombay (印度理工学院,孟买分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as a main conference submission in the European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.
zh

[NLP-52] Match Choose: Model Selection Framework for Fine-tuning Text-to-Image Diffusion Models

【速读】: 该论文旨在解决预训练文本到图像(Text-to-Image, T2I)模型在特定目标数据域上微调时的模型选择难题,即如何高效地从模型平台(如HuggingFace)中挑选出最适合目标域微调的模型,而无需对所有候选模型进行耗时的全量微调。其解决方案的关键在于提出首个模型选择框架MC,其核心是一个匹配图(matching graph),该图包含模型与数据集节点,以及刻画微调性能(model-data边)和数据相似性(data-data边)的边结构;通过结合输入模型/数据特征与从匹配图中提取的图嵌入特征,构建预测模型以准确估计哪个预训练T2I模型在目标域微调后能获得最佳生成质量。

链接: https://arxiv.org/abs/2508.10993
作者: Basile Lewandowski,Robert Birke,Lydia Y. Chen
机构: University of Neuchâtel (纳沙泰尔大学); University of Turin (都灵大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models based on diffusion and transformer architectures advance rapidly. They are often pretrained on large corpora, and openly shared on a model platform, such as HuggingFace. Users can then build up AI applications, e.g., generating media contents, by adopting pretrained T2I models and fine-tuning them on the target dataset. While public pretrained T2I models facilitate the democratization of the models, users face a new challenge: which model can be best fine-tuned based on the target data domain? Model selection is well addressed in classification tasks, but little is known in (pretrained) T2I models and their performance indication on the target domain. In this paper, we propose the first model selection framework, MC, which enables users to efficiently choose a pretrained T2I model from a model platform without exhaustively fine-tuning them all on the target dataset. The core of MC is a matching graph, which consists of: (i) nodes of available models and profiled datasets, and (ii) edges of model-data and data-data pairs capturing the fine-tuning performance and data similarity, respectively. We then build a model that, based on the inputs of model/data feature, and, critically, the graph embedding feature, extracted from the matching graph, predicts the model achieving the best quality after fine-tuning for the target domain. We evaluate MC on choosing across ten T2I models for 32 datasets against three baselines. Our results show that MC successfully predicts the best model for fine-tuning in 61.3% of the cases and a closely performing model for the rest.
zh

[NLP-53] BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)预训练中因数据量增长趋于饱和而遇到的“数据墙”问题,即单纯增加数据规模难以持续提升模型性能。为此,作者提出了一种名为BeyondWeb的合成数据生成框架,其核心解决方案在于通过系统性优化多个关键因素——包括数据重写策略、模型规模与架构选择、以及数据质量评估机制——来生成高质量的合成预训练数据。不同于以往依赖简单数据扩充的方法,BeyondWeb在14项基准测试中显著优于现有最优合成数据集(如Cosmopedia和Nemotron-Synth),并实现高达7.7倍的训练速度提升,证明了高质量合成数据并非单一技术突破的结果,而是多维度协同优化的产物。

链接: https://arxiv.org/abs/2508.10975
作者: Pratyush Maini,Vineeth Dorna,Parth Doshi,Aldo Carranza,Fan Pan,Jack Urbanek,Paul Burstein,Alex Fang,Alvin Deng,Amro Abbas,Brett Larsen,Cody Blakeney,Charvi Bannur,Christina Baek,Darren Teh,David Schwab,Haakon Mongstad,Haoli Yin,Josh Wills,Kaleigh Mentzer,Luke Merrick,Ricardo Monti,Rishabh Adiga,Siddharth Joshi,Spandan Das,Zhengping Wang,Bogdan Gaza,Ari Morcos,Matthew Leavitt
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC’s high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there’s no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.
zh

[NLP-54] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中挖掘出的逻辑规则难以被人类理解的问题,其根源在于规则本身的复杂性以及不同KG在实体和关系标注上的异构性。为提升知识图谱的可访问性和可用性,作者提出了一种名为Rule2Text的综合性框架,其核心解决方案是利用大语言模型(Large Language Models, LLMs)将挖掘出的逻辑规则自动转化为自然语言解释。该框架通过系统性地评估多种LLM及提示策略(包括零样本、少样本、变量类型融合与思维链推理),结合人工评估与LLM作为裁判(LLM-as-a-judge)机制,构建高质量标注数据集并用于微调开源模型(如Zephyr),从而显著提升生成解释的准确性和清晰度,尤其在领域特定数据集上表现突出。此外,还引入类型推断模块以支持缺乏显式类型信息的知识图谱。

链接: https://arxiv.org/abs/2508.10971
作者: Nasim Shirvani-Mahdavi,Chengkai Li
机构: University of Texas at Arlington(德克萨斯大学阿灵顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2507.23740

点击查看摘要

Abstract:Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models’ performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at this https URL.
zh

[NLP-55] Empowering Multimodal LLM s with External Tools: A Comprehensive Survey

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在高质量多模态数据获取、复杂下游任务表现不足以及评估协议不完善等方面的局限性,从而提升其可靠性与跨领域适用性。解决方案的关键在于引入外部工具(如APIs、专家模型和知识库)来增强MLLM的能力,具体体现在四个维度:(1) 利用外部工具辅助高质量多模态数据的采集与标注;(2) 通过工具调用提升MLLM在挑战性下游任务中的性能;(3) 实现更全面、准确的MLLM评估;(4) 探讨当前局限与未来发展方向。这一策略借鉴人类借助外部工具进行推理与问题解决的能力,为突破MLLM瓶颈提供了系统性路径。

链接: https://arxiv.org/abs/2508.10955
作者: Wenbin An,Jiahao Nie,Yaqiang Wu,Feng Tian,Shijian Lu,Qinghua Zheng
机构: Xi’an Jiaotong University (西安交通大学); Nanyang Technological University (南洋理工大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 21 pages, 361 references

点击查看摘要

Abstract:By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.
zh

[NLP-56] Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News

【速读】: 该论文旨在解决如何自动从新闻文章中提取公司风险因素的问题,以辅助投资者和金融市场监管。其解决方案的关键在于构建一个包含七类风险维度(如供应链、监管和竞争)的计算框架,并通过标注744篇新闻文章对多种机器学习模型进行基准测试。实验表明,尽管大语言模型(LLMs)在多数自然语言处理任务中表现优异,但在零样本或少样本提示下识别风险因素时性能仅达到中等至较低水平;相比之下,微调后的预训练语言模型在多数风险类别上表现更优,从而为大规模分析金融新闻中的风险信息提供了有效工具。

链接: https://arxiv.org/abs/2508.10927
作者: Jiaxin Pei,Soumya Vadlamannati,Liang-Kang Huang,Daniel Preotiuc-Pietro,Xinyu Hua
机构: University of Michigan (密歇根大学); Bloomberg (彭博)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries.
zh

[NLP-57] gpt -oss-120b gpt -oss-20b Model Card

【速读】: 该论文旨在解决当前大型语言模型在推理准确性与推理成本之间的权衡问题,以及如何有效提升模型的智能体(agentic)能力以支持复杂任务执行。其解决方案的关键在于采用高效的专家混合(Mixture-of-Experts, MoE)Transformer架构,并结合大规模知识蒸馏(distillation)与强化学习(reinforcement learning)进行训练,从而在保持较低推理开销的同时显著增强模型在数学、编程和安全性等多类基准测试中的表现。此外,通过渲染对话格式(rendered chat format)实现清晰的指令遵循与角色划分,进一步提升了模型对工具调用(如Python工具使用、开发者自定义函数支持)和深度研究浏览等高级功能的整合能力。

链接: https://arxiv.org/abs/2508.10925
作者: OpenAI:Sandhini Agarwal,Lama Ahmad,Jason Ai,Sam Altman,Andy Applebaum,Edwin Arbus,Rahul K. Arora,Yu Bai,Bowen Baker,Haiming Bao,Boaz Barak,Ally Bennett,Tyler Bertao,Nivedita Brett,Eugene Brevdo,Greg Brockman,Sebastien Bubeck,Che Chang,Kai Chen,Mark Chen,Enoch Cheung,Aidan Clark,Dan Cook,Marat Dukhan,Casey Dvorak,Kevin Fives,Vlad Fomenko,Timur Garipov,Kristian Georgiev,Mia Glaese,Tarun Gogineni,Adam Goucher,Lukas Gross,Katia Gil Guzman,John Hallman,Jackie Hehir,Johannes Heidecke,Alec Helyar,Haitang Hu,Romain Huet,Jacob Huh,Saachi Jain,Zach Johnson,Chris Koch,Irina Kofman,Dominik Kundel,Jason Kwon,Volodymyr Kyrylov,Elaine Ya Le,Guillaume Leclerc,James Park Lennon,Scott Lessans,Mario Lezcano-Casado,Yuanzhi Li,Zhuohan Li,Ji Lin,Jordan Liss,Lily(Xiaoxuan)Liu,Jiancheng Liu,Kevin Lu,Chris Lu,Zoran Martinovic,Lindsay McCallum,Josh McGrath,Scott McKinney,Aidan McLaughlin,Song Mei,Steve Mostovoy,Tong Mu,Gideon Myles,Alexander Neitz,Alex Nichol,Jakub Pachocki,Alex Paino,Dana Palmie,Ashley Pantuliano,Giambattista Parascandolo,Jongsoo Park,Leher Pathak,Carolina Paz,Ludovic Peran,Dmitry Pimenov,Michelle Pokrass,Elizabeth Proehl,Huida Qiu,Gaby Raila,Filippo Raso,Hongyu Ren,Kimmy Richardson,David Robinson,Bob Rotsted,Hadi Salman,Suvansh Sanjeev,Max Schwarzer,D. Sculley,Harshit Sikchi,Kendal Simon,Karan Singhal,Yang Song
机构: OpenAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
zh

[NLP-58] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins ACL2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在用户建模中难以捕捉个体多维特征(如人口统计学、行为和心理测量数据)的问题,从而导致生成内容缺乏个性化与情感细腻度。解决方案的关键在于提出PersonaTwin框架——一种多层提示条件化机制,通过整合多源异构数据构建自适应数字孪生体(digital twins),并在医疗健康场景下基于超过8500名个体的综合数据集进行系统评估。该框架不仅在文本相似性指标上达到与理想基准相当的仿真保真度,且下游模型使用persona-twins训练后,在预测性能与公平性指标上均逼近直接基于真实个体训练的效果,验证了其在生成真实且具情绪维度的用户模拟中的潜力。

链接: https://arxiv.org/abs/2508.10906
作者: Sihan Chen,John P. Lalor,Yi Yang,Ahmed Abbasi
机构: Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院); Human-centered Analytics Lab, University of Notre Dame(圣母大学人类中心分析实验室); Department of IT, Analytics, and Operations, University of Notre Dame(圣母大学信息技术、分析与运营系); Department of Information Systems, Business Statistics and Operations Management, HKUST(香港科技大学信息系统、商业统计与运营管理系)
类目: Computation and Language (cs.CL)
备注: Presented at the Generation, Evaluation Metrics (GEM) Workshop at ACL 2025

点击查看摘要

Abstract:While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.
zh

[NLP-59] A2HCoder: An LLM -Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation

【速读】: 该论文旨在解决无线通信系统中算法设计与硬件实现之间存在的显著鸿沟问题,这一鸿沟导致从高级编程语言(如MATLAB)到硬件描述语言(HDL,如Verilog)的部署效率低下且易出错。传统方法依赖大量领域知识和手工开发,难以应对超低延迟和低功耗等严苛需求。解决方案的关键在于提出A2HCoder:一个基于大语言模型(LLM)的分层算法到HDL编码代理,其核心创新为双维度结构设计——水平方向将复杂算法分解为模块化功能块以提升代码一致性;垂直方向采用逐步、细粒度的翻译策略,结合MATLAB与Vitis HLS等外部工具链进行调试与电路级综合,从而有效抑制LLM生成代码中的幻觉问题并保障硬件正确性。

链接: https://arxiv.org/abs/2508.10904
作者: Jie Lei,Ruofan Jia,J. Andrew Zhang,Hao Zhang
机构: University of Technology Sydney (悉尼科技大学); Xidian University (西安电子科技大学)
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.
zh

[NLP-60] he Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers ICTIR’25 SIGIR

【速读】: 该论文旨在解决科学事实核查(scientific fact-checking)中因学术文献的结构性复杂性、长篇幅多模态表达以及科学知识动态演变所导致的挑战,这些问题在现有基于摘要的小规模数据集方法中未被充分考虑。其解决方案的关键在于构建一个专门针对真实应用场景的检索系统(IR system),重点突破五大核心研究挑战:(1)基于证据驱动的检索以克服语义局限性和主题不平衡问题;(2)引入时间感知的证据检索与引用追踪机制以减少过时信息的影响;(3)结构化文档解析以利用长距离上下文;(4)处理包含表格、图表及领域术语在内的复杂科学表达;(5)评估科学文献的可信度。通过初步实验验证这些挑战并探索潜在优化路径,论文为提升科学事实核查系统的性能提供了系统性方向。

链接: https://arxiv.org/abs/2506.20844
作者: Xingyu Deng,Xi Wang,Mark Stevenson
机构: University of Sheffield(谢菲尔德大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted for ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR’25)

点击查看摘要

Abstract:Scientific fact-checking aims to determine the veracity of scientific claims by retrieving and analysing evidence from research literature. The problem is inherently more complex than general fact-checking since it must accommodate the evolving nature of scientific knowledge, the structural complexity of academic literature and the challenges posed by long-form, multimodal scientific expression. However, existing approaches focus on simplified versions of the problem based on small-scale datasets consisting of abstracts rather than full papers, thereby avoiding the distinct challenges associated with processing complete documents. This paper examines the limitations of current scientific fact-checking systems and reveals the many potential features and resources that could be exploited to advance their performance. It identifies key research challenges within evidence retrieval, including (1) evidence-driven retrieval that addresses semantic limitations and topic imbalance (2) time-aware evidence retrieval with citation tracking to mitigate outdated information, (3) structured document parsing to leverage long-range context, (4) handling complex scientific expressions, including tables, figures, and domain-specific terminology and (5) assessing the credibility of scientific literature. Preliminary experiments were conducted to substantiate these challenges and identify potential solutions. This perspective paper aims to advance scientific fact-checking with a specialised IR system tailored for real-world applications.
zh

[NLP-61] Emphasis Sensitivity in Speech Representations

【速读】: 该论文旨在解决现代语音模型是否对语调强调(prosodic emphasis)敏感的问题,即它们能否以系统化的方式区分强调词与非强调词的表征。此前研究多依赖孤立的声学特征(如音高、时长)或标签预测,忽视了强调关系的结构特性。论文提出一种基于残差的框架,将强调定义为成对的中性词与强调词表征之间的差异,从而捕捉其相对性本质;关键创新在于利用这种残差表示来揭示语音模型中强调信息的编码机制——实验表明,自监督模型中的残差与时长变化强相关但难以用于词身份预测,说明其编码具有结构性和关系性;而在ASR微调后的模型中,残差占据的子空间更紧凑(比预训练模型小50%),表明强调被编码为一种任务相关的低维变换,且随着下游任务学习变得更加结构化。

链接: https://arxiv.org/abs/2508.11566
作者: Shaun Cassini,Thomas Hain,Anton Ragni
机构: University of Sheffield (谢菲尔德大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to IEEE ASRU 2025

点击查看摘要

Abstract:This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasized and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasized word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and perform poorly at word identity prediction, indicating a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning.
zh

[NLP-62] Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

【速读】: 该论文试图解决的是**表达式语音检索(expressive speech retrieval)**问题,即根据自然语言描述的说话风格(如情绪、语调等)从语音库中检索出相应风格的语音片段,而非仅基于语音内容本身进行检索。其解决方案的关键在于训练一个联合嵌入空间,将语音和文本风格描述分别编码为共享的潜在表示,从而实现通过自由形式的文本提示(如“愤怒地说话”)来检索匹配的表达式语音段。该方法的核心创新包括:设计合适的语音与文本编码器架构、优化跨模态对齐的训练目标,以及采用提示增强策略以提升对任意文本查询的泛化能力。

链接: https://arxiv.org/abs/2508.11187
作者: Wonjune Kang,Deb Roy
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.
zh

计算机视觉

[CV-0] hyme: Think Beyond Images

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理任务中对图像信息利用不足的问题,尤其是缺乏自主执行多样化图像处理与逻辑计算的能力。现有方法多依赖于“思考图像”(think with images)范式,难以实现动态、自主的图像操作和代码驱动的推理增强。解决方案的关键在于提出 Thyme(Think Beyond Images)——一种通过可执行代码自动生成并执行图像处理操作(如裁剪、旋转、对比度增强)及数学计算的新范式,从而实现高自主性的图像理解与推理协同优化。其核心创新包括两阶段训练策略:首先在50万样本数据集上进行监督微调(SFT)以习得代码生成能力,随后采用GRPO-ATS(Group Relative Policy Optimization with Adaptive Temperature Sampling)算法进行强化学习(RL)训练,通过差异化温度采样平衡文本推理探索与代码执行精度,显著提升模型在高分辨率感知与复杂推理任务中的性能表现。

链接: https://arxiv.org/abs/2508.11630
作者: Yi-Fan Zhang,Xingyu Lu,Shukang Yin,Chaoyou Fu,Wei Chen,Xiao Hu,Bin Wen,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Haonan Fan,Kaibing Chen,Jiankang Chen,Haojie Ding,Kaiyu Tang,Zhang Zhang,Liang Wang,Fan Yang,Tingting Gao,Guorui Zhou
机构: Kwai Keye; CASIA (中国科学院自动化研究所); NJU (南京大学); THU (清华大学); USTC (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Following OpenAI’s introduction of the thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing think with images’’ approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.
zh

[CV-1] Is ChatGPT -5 Ready for Mammogram VQA?

【速读】:该论文旨在解决乳腺X线摄影视觉问答(mammogram visual question answering, VQA)中如何有效利用大语言模型(large language models, LLMs)进行BI-RADS评估、异常检测与恶性肿瘤分类的问题。其关键解决方案在于系统性评估GPT-5系列模型及GPT-4o在多个公开乳腺影像数据集(EMBED、InBreast、CMMD、CBIS-DDSM)上的表现,发现GPT-5在各项任务中均优于其他GPT变体,尽管其性能仍低于人类专家和领域微调模型,但相较于GPT-4o的显著提升表明通用LLMs通过针对性领域适配与优化后,具备辅助乳腺癌筛查的潜力。

链接: https://arxiv.org/abs/2508.11628
作者: Qiang Li,Shansong Wang,Mingzhe Hu,Mojtaba Safari,Zachary Eidex,Xiaofeng Yang
机构: Emory University School of Medicine (埃默里大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.
zh

[CV-2] LoRAtorio: An intrinsic approach to LoRA Skill Composition

【速读】:该论文旨在解决多LoRA(Low-Rank Adaptation)适配器在文本到图像扩散模型中难以有效组合的问题,尤其是在开放场景下,当所需技能数量和类型未知时,现有方法性能显著下降。解决方案的关键在于提出LoRAtorio框架,其核心创新是利用模型内在行为特性:首先观察到LoRA适配器在窄域训练后会产生偏离基础模型的去噪输出,而当输入分布外时,其行为更接近基础模型;基于此,LoRAtorio在潜在空间中将图像划分为空间块,计算每个块预测噪声与基础模型噪声的余弦相似度,构建空间感知权重矩阵以加权聚合多个LoRA输出;同时引入改进的无分类器引导机制,融合基础模型的无条件得分以缓解领域漂移问题,并进一步扩展至推理时动态选择相关LoRA适配器的模块,从而实现高效、准确的多LoRA组合。

链接: https://arxiv.org/abs/2508.11624
作者: Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki
机构: Toshiba Europe (东芝欧洲); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 17 figures

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch’s predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model’s unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.
zh

[CV-3] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

【速读】:该论文旨在解决文本驱动的3D场景编辑中跨视角一致性不足的问题,现有方法通常将预训练的2D图像编辑器适配到多视图输入,但由于缺乏对多视图信息交互的显式控制,常导致编辑效果不充分且细节模糊。其解决方案的关键在于提出一种对应约束注意力机制(correspondence-constrained attention mechanism),该机制在扩散去噪过程中强制像素间保持预期的一致性交互,同时结合去噪过程中估计的语义相似性,提升对应关系建模的可靠性,从而实现更鲁棒的多视图编辑;此外,还设计了选择性编辑管道以增强用户可控性。

链接: https://arxiv.org/abs/2508.11603
作者: Zhe Zhu,Honghua Chen,Peng Li,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.
zh

[CV-4] DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring

【速读】:该论文旨在解决城市环境中对路边植被和基础设施进行低成本、实时、高精度的物体级结构评估与地理定位问题,尤其针对传统遥感(Remote Sensing, RS)方法如LiDAR或图像分析在成本高、部署慢、难以实现高频监测等方面的局限性。其解决方案的关键在于构建一个端到端的框架,融合单目深度估计(monocular depth estimation)、深度误差校正和基于GPS的几何三角测量技术:首先利用先进的单目深度模型生成初始深度图,再通过梯度提升回归模型显著修正远距离物体的深度低估问题(R² = 0.92,MAE = 0.31),进而结合GPS数据进行物体位置 triangulation,并采用针孔相机模型计算物体高度,从而实现在普通车载摄像头视频流中实时提取准确的空间与结构信息。

链接: https://arxiv.org/abs/2508.11591
作者: Durga Joshi(1),Chandi Witharana(1),Robert Fahey(1),Thomas Worthley(1),Zhe Zhu(1),Diego Cerrai(2) ((1) Department of Natural Resources and the Environment, Eversource Energy Center, University of Connecticut, Storrs, CT, USA (2) Department of Civil and Environmental Engineering, Eversource Energy Center, University of Connecticut, Storrs, CT, USA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 35 Pages, 15 figures

点击查看摘要

Abstract:Our study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure with commonly available but underutilized dashboard camera (dashcam) video data. We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from street-level video streams from vehicle-mounted dashcams. Depth maps were first estimated using a state-of-the-art monocular depth model, then refined via a gradient-boosted regression framework to correct underestimations, particularly for distant objects. The depth correction model achieved strong predictive performance (R2 = 0.92, MAE = 0.31 on transformed scale), significantly reducing bias beyond 15 m. Further, object locations were estimated using GPS-based triangulation, while object heights were calculated using pin hole camera geometry. Our method was evaluated under varying conditions of camera placement and vehicle speed. Low-speed vehicle with inside camera gave the highest accuracy, with mean geolocation error of 2.83 m, and mean absolute error (MAE) in height estimation of 2.09 m for trees and 0.88 m for poles. To the best of our knowledge, it is the first framework to combine monocular depth modeling, triangulated GPS-based geolocation, and real-time structural assessment for urban vegetation and infrastructure using consumer-grade video data. Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure, making it especially valuable for utility companies, and urban planners aiming for scalable and frequent assessments in dynamic urban environments.
zh

[CV-5] Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

【速读】:该论文旨在解决在资源受限的机器人平台部署多个机器学习模型进行不同感知任务时,因重复计算、内存占用大及集成复杂而导致的效率低下问题。解决方案的关键在于提出一种名为Visual Perception Engine (VPEngine) 的模块化框架,其核心设计是基于共享的基础模型骨干网络(如DINOv2)提取图像表征,并在多个专用任务头(如深度估计、目标检测和语义分割)之间高效共享这些特征,避免GPU与CPU间的冗余内存传输;同时借助CUDA Multi-Process Service (MPS) 实现动态任务优先级调度和运行时可调的每任务推理频率,从而在保持恒定内存占用的前提下显著提升GPU利用率,实现在NVIDIA Jetson Orin AGX平台上≥50 Hz的端到端实时性能。

链接: https://arxiv.org/abs/2508.11584
作者: Jakub Łucki,Jonathan Becktor,Georgios Georgakis,Robert Royce,Shehryar Khattak
机构: Jet Propulsion Laboratory, California Institute of Technology (加州理工学院); Swiss Federal Institute of Technology (ETH Zürich) (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework’s capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at \geq 50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.
zh

[CV-6] Causality Matters: How Temporal Information Emerges in Video Language Models

【速读】:该论文旨在解决视频语言模型(VideoLMs)在时间理解(temporal understanding)方面的核心挑战,即如何有效建模事件顺序、持续时间和跨时间的关系。以往研究普遍依赖位置编码(positional encodings, PEs)来捕捉时间结构,但本文发现:移除或修改PEs对性能影响甚微,而反转帧序列却导致显著性能下降,表明时间信息并非由显式PE主导,而是通过因果注意力机制下帧间视觉token的交互隐式编码。关键洞察在于识别出一条因果信息路径:时间线索通过帧间注意力逐步合成,在最后一帧聚合后融入查询token,从而实现时间推理。基于此,作者提出两种效率优化策略——分阶段跨模态注意力和时间退出机制(temporal exit mechanism),用于早期token截断,实验验证了其有效性。

链接: https://arxiv.org/abs/2508.11576
作者: Yumeng Shi,Quanyu Long,Yin Wu,Wenya Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first work to systematically investigate video temporal understanding in VideoLMs, offering insights for future model improvement.
zh

[CV-7] rajSV: A Trajectory-based Model for Sports Video Representations and Applications

【速读】:该论文旨在解决体育视频分析领域中三个关键问题:(1)数据不可获取性,(2)缺乏基于轨迹的统一框架,以及(3)对大量监督标签的依赖。解决方案的核心是提出TrajSV,一个端到端的轨迹驱动框架,包含三个模块:数据预处理模块用于从广播视频中提取球员与球的轨迹;Clip Representation Network (CRNet) 利用轨迹增强的Transformer模块学习片段表示;Video Representation Network (VRNet) 通过编码器-解码器结构聚合片段表示与视觉特征以生成视频表示。此外,引入三重对比损失实现无监督优化,从而在不依赖大量标注数据的情况下提升模型性能。实验表明,该方法在体育视频检索、动作定位和视频字幕三个下游任务中均达到领先水平。

链接: https://arxiv.org/abs/2508.11569
作者: Zheng Wang,Shihao Xu,Wei Shi
机构: Huawei Technologies, Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: This paper has been accepted by TCSVT

点击查看摘要

Abstract:Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.
zh

[CV-8] raining-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中因异常数据稀缺而导致模型训练困难的问题。现有异常生成方法普遍存在生成质量不高或需额外训练数据的局限性。为此,作者提出了一种无需训练的异常生成框架AAG(Anomaly-Aware Generation),其核心在于利用Stable Diffusion(SD)强大的图像生成能力,在给定正常图像、掩码(mask)和简单文本提示(text prompt)条件下,精准生成特定区域内的自然且合理的异常图像,同时保持其他区域内容不变。关键创新点包括:1)Cross-Attention Enhancement(CAE),通过重构SD中的交叉注意力机制,增强指定区域视觉token与文本嵌入之间的相似性,使生成异常符合文本描述;2)Self-Attention Enhancement(SAE),提升正常视觉token与异常视觉token间的相似性,确保生成异常与原图结构一致、语义合理。实验表明,AAG在MVTec AD和VisA数据集上均能有效生成高质量异常图像,并显著提升下游异常检测任务性能。

链接: https://arxiv.org/abs/2508.11550
作者: Zuo Zuo,Jiahao Dong,Yanyun Qu,Zongze Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial anomaly detection (AD) plays a significant role in manufacturing where a long-standing challenge is data scarcity. A growing body of works have emerged to address insufficient anomaly data via anomaly generation. However, these anomaly generation methods suffer from lack of fidelity or need to be trained with extra data. To this end, we propose a training-free anomaly generation framework dubbed AAG, which is based on Stable Diffusion (SD)'s strong generation ability for effective anomaly image generation. Given a normal image, mask and a simple text prompt, AAG can generate realistic and natural anomalies in the specific regions and simultaneously keep contents in other regions unchanged. In particular, we propose Cross-Attention Enhancement (CAE) to re-engineer the cross-attention mechanism within Stable Diffusion based on the given mask. CAE increases the similarity between visual tokens in specific regions and text embeddings, which guides these generated visual tokens in accordance with the text description. Besides, generated anomalies need to be more natural and plausible with object in given image. We propose Self-Attention Enhancement (SAE) which improves similarity between each normal visual token and anomaly visual tokens. SAE ensures that generated anomalies are coherent with original pattern. Extensive experiments on MVTec AD and VisA datasets demonstrate effectiveness of AAG in anomaly generation and its utility. Furthermore, anomaly images generated by AAG can bolster performance of various downstream anomaly inspection tasks.
zh

[CV-9] Reinforcing Video Reasoning Segmentation to Think Before It Segments

【速读】:该论文旨在解决视频推理分割(Video Reasoning Segmentation, VRS)中因依赖大型视觉语言模型(Large Vision Language Models, LVLMs)进行掩码预测而导致的推理过程可解释性差与时空推理能力不足的问题。其解决方案的关键在于提出一个专门针对VRS任务设计的LVLM——Veason-R1,该模型通过两种核心机制实现优化:一是基于链式思维(Chain-of-Thought, CoT)初始化的高质量训练数据构建结构化推理轨迹,从而提升视频级语义与帧级空间定位之间的对齐;二是采用分组相对策略优化(Group Relative Policy Optimization, GRPO)进行强化学习微调,引入综合奖励机制以协同增强空间对齐性和时间一致性,从而显著改善关键帧定位和细粒度对象接地性能。

链接: https://arxiv.org/abs/2508.11538
作者: Sitong Gong,Lu Zhang,Yunzhi Zhuge,Xu Jia,Pingping Zhang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into SEG tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J F in ReVOS and +10.0 J F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.
zh

[CV-10] An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture

【速读】:该论文旨在解决在资源受限的计算环境中实现高效且高精度的医学图像分类问题。其核心解决方案在于对ConvNeXt-Tiny架构进行结构优化与损失函数设计:首先引入双全局池化(Global Average Pooling 和 Global Max Pooling)特征融合策略,以同时保留全局统计特征和显著响应信息;其次设计轻量级通道注意力模块——挤压与激励向量(Squeeze-and-Excitation Vector, SEVector),在最小化参数开销的前提下提升通道权重的自适应分配能力;最后,在损失函数中加入特征平滑损失(Feature Smoothing Loss),增强类内特征一致性并抑制类内方差。该方法在仅使用CPU(8线程)条件下,于10个训练周期内达到89.10%的测试准确率,验证了其在资源受限场景下的有效性与稳定性。

链接: https://arxiv.org/abs/2508.11532
作者: Jingsong Xia,Yue Yin,Xiuhan Li
机构: Nanjing Medical University (南京医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis. However, achieving efficient and high-accuracy image classification in resource-constrained computational environments remains challenging. This study proposes a medical image classification method based on an improved ConvNeXt-Tiny architecture. Through structural optimization and loss function design, the proposed method enhances feature extraction capability and classification performance while reducing computational complexity. Specifically, the method introduces a dual global pooling (Global Average Pooling and Global Max Pooling) feature fusion strategy into the ConvNeXt-Tiny backbone to simultaneously preserve global statistical features and salient response information. A lightweight channel attention module, termed Squeeze-and-Excitation Vector (SEVector), is designed to improve the adaptive allocation of channel weights while minimizing parameter overhead. Additionally, a Feature Smoothing Loss is incorporated into the loss function to enhance intra-class feature consistency and suppress intra-class variance. Under CPU-only conditions (8 threads), the method achieves a maximum classification accuracy of 89.10% on the test set within 10 training epochs, exhibiting a stable convergence trend in loss values. Experimental results demonstrate that the proposed method effectively improves medical image classification performance in resource-limited settings, providing a feasible and efficient solution for the deployment and promotion of medical imaging analysis models.
zh

[CV-11] Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction

【速读】:该论文旨在解决高效跟踪器(efficient tracker)在追求计算效率时因模型参数和计算复杂度降低而导致特征表示能力弱化的问题,从而限制了其在复杂环境下准确捕捉目标状态的能力。解决方案的关键在于提出多状态跟踪器(Multi-State Tracker, MST),其核心创新包括:1)多状态生成(Multi-State Generation, MSG)模块,在特征提取过程中生成多阶段的状态表示;2)状态特定增强(State-Specific Enhancement, SSE)模块,对各状态特征进行针对性优化以突出目标相关特征;3)跨状态交互(Cross-State Interaction, CSI)模块,实现不同状态间的信息交互与自适应融合,从而增强特征表达能力。上述模块采用轻量级的隐状态自适应状态空间二元性(Hidden State Adaptation-based State Space Duality, HSA-SSD)设计,仅引入0.1 GFLOPs计算量和0.66 M参数,显著提升了跟踪精度与鲁棒性,优于现有最先进高效跟踪方法。

链接: https://arxiv.org/abs/2508.11531
作者: Shilei Wang,Gong Cheng,Pujian Lai,Dong Gao,Junwei Han
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at this https URL.
zh

[CV-12] A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11

【速读】:该论文旨在解决快速发展的长江三角洲地区交通基础设施加速老化背景下,混凝土裂缝检测效率低下的问题,尤其是传统人工巡检效率不足以及现有深度学习模型在复杂背景中对小目标裂缝检测性能不佳的局限性。其解决方案的关键在于提出一种基于YOLOv11n架构的多任务混凝土裂缝检测与分割模型——YOLOv11-KW-TA-FP,通过三个核心模块实现性能提升:(1)在骨干网络中嵌入动态KernelWarehouse卷积(KWConv),利用动态核共享机制增强特征表示能力;(2)在特征金字塔中引入三重注意力机制(Triple Attention, TA),强化通道-空间交互建模;(3)设计FP-IoU损失函数以实现自适应边界框回归惩罚。实验表明,该模型在精度、召回率和mAP@50上分别达到91.3%、76.6%和86.4%,且在数据稀缺和噪声干扰下仍保持稳定性能,展现出显著的工程应用价值。

链接: https://arxiv.org/abs/2508.11517
作者: Shaoze Huang,Qi Liu,Chao Chen,Yuhang Chen
机构: MSE, AHUT, China; IoT, AHUT, China; IST, SDUST, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated aging of transportation infrastructure in the rapidly developing Yangtze River Delta region necessitates efficient concrete crack detection, as crack deterioration critically compromises structural integrity and regional economic growth. To overcome the limitations of inefficient manual inspection and the suboptimal performance of existing deep learning models, particularly for small-target crack detection within complex backgrounds, this paper proposes YOLOv11-KW-TA-FP, a multi-task concrete crack detection and segmentation model based on the YOLOv11n architecture. The proposed model integrates a three-stage optimization framework: (1) Embedding dynamic KernelWarehouse convolution (KWConv) within the backbone network to enhance feature representation through a dynamic kernel sharing mechanism; (2) Incorporating a triple attention mechanism (TA) into the feature pyramid to strengthen channel-spatial interaction modeling; and (3) Designing an FP-IoU loss function to facilitate adaptive bounding box regression penalization. Experimental validation demonstrates that the enhanced model achieves significant performance improvements over the baseline, attaining 91.3% precision, 76.6% recall, and 86.4% mAP@50. Ablation studies confirm the synergistic efficacy of the proposed modules. Furthermore, robustness tests indicate stable performance under conditions of data scarcity and noise interference. This research delivers an efficient computer vision solution for automated infrastructure inspection, exhibiting substantial practical engineering value.
zh

[CV-13] AIM: Amending Inherent Interpretability via Self-Supervised Masking ICCV

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在训练过程中倾向于依赖虚假特征(spurious features)而非真实有意义的特征(genuine features)的问题,从而导致模型可解释性差且泛化能力受限。解决方案的关键在于提出“通过自监督掩码修正内在可解释性”(Amending Inherent Interpretability via Self-Supervised Masking, AIM),该方法利用多阶段编码特征引导一种样本特定的自监督特征掩码机制,促使网络更倾向于使用真实特征进行决策,而无需额外标注信息。实验表明,AIM 在多个具有挑战性的数据集上同时提升了模型的准确率和可解释性(以Energy Pointing Game, EPG分数衡量),验证了其促进真实特征学习的有效性与普适性。

链接: https://arxiv.org/abs/2508.11502
作者: Eyad Alshami,Shashank Agnihotri,Bernt Schiele,Margret Keuper
机构: Max-Planck-Institute for Informatics (马普研究所信息学); RTG Neuroexplicit Models of Language, Vision, and Action (神经显式语言、视觉和动作研究训练项目); Data and Web Science Group, University of Mannheim (数据与网络科学组,曼海姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Conference on Computer Vision (ICCV) 2025

点击查看摘要

Abstract:It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose “Amending Inherent Interpretability via Self-Supervised Masking” (AIM), a simple yet interestingly effective method that promotes the network’s utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.
zh

[CV-14] Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

【速读】:该论文旨在解决历史手写文本识别(Historical Handwritten Text Recognition, HTR)在档案文献数字化过程中面临的三大挑战:标注数据稀缺、语言变异以及书写风格高度多样化。其解决方案的关键在于引入针对历史手写特征定制的图像预处理与四种新型数据增强方法,并结合集成学习策略以融合多个增强训练模型的优势。实验表明,采用弹性变换(Elastic)增强的单模型在Gwalther拉丁文手稿数据集上达到1.86%的字符错误率(Character Error Rate, CER),而基于前五名投票的集成模型进一步将CER降至1.60%,相较先前最优结果提升42%,验证了领域特定增强与集成策略对提升HTR性能的有效性。

链接: https://arxiv.org/abs/2508.11499
作者: Erez Meoded
机构: Mississippi State University (密西西比州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.
zh

[CV-15] Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在视觉识别任务中因依赖规则网格结构而导致的建模复杂拓扑关系和非局部语义能力受限的问题。其核心解决方案是提出一种层次化图特征增强(Hierarchical Graph Feature Enhancement, HGFE)框架,通过构建两个互补的图结构层次——窗口内图卷积用于捕捉局部空间依赖性,窗口间超节点交互用于建模全局语义关系;同时引入自适应频率调制模块,动态平衡低频与高频信号传播,在保留边缘和纹理信息的同时缓解过平滑问题。HGFE模块轻量、端到端可训练,并能无缝集成至标准CNN主干网络中,实验证明其显著提升了结构感知能力和整体识别性能。

链接: https://arxiv.org/abs/2508.11497
作者: Feiyue Zhao,Zhichao Zhang
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Hubei University (湖北大学); Shanghai Jiao Tong University (上海交通大学); Hainan Normal University (海南师范大学); Neijiang Normal University (内江师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HGFE builds two complementary levels of graph structures: intra-window graph convolution to cap ture local spatial dependencies and inter-window supernode interactions to model global semantic relationships. Moreover, we introduce an adaptive frequency modulation module that dynamically balances low-frequency and high-frequency signal propagation, preserving critical edge and texture information while mitigating over-smoothing. The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks. Extensive experiments on CIFAR-100 (classification), PASCAL VOC, and VisDrone (detection), as well as CrackSeg and CarParts (segmentation), validated the effectiveness of the HGFE in improving structural representation and enhancing overall recognition performance. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.11497 [cs.CV] (or arXiv:2508.11497v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.11497 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feiyue Zhao [view email] [v1] Fri, 15 Aug 2025 14:19:50 UTC (36,503 KB) Full-text links: Access Paper: View a PDF of the paper titled Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition, by Feiyue Zhao and Zhichao ZhangView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-08 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-16] Relative Position Matters: Trajectory Prediction and Planning with Polar Representation

【速读】:该论文旨在解决自动驾驶中轨迹预测与规划任务的挑战,即在动态环境中准确建模周围交通参与者(surrounding agents)的运动行为以及决策主体车辆(ego agent)的动作规划问题。传统方法通常在笛卡尔坐标系(Cartesian coordinates)中编码地图和车辆位置,并解码未来轨迹,但这种表示方式难以自然捕捉不同交通元素对决策主体车辆的影响差异,尤其是在相对距离和方向变化下的影响机制。其解决方案的关键在于提出一种全新的基于极坐标系(Polar coordinate system)的方法——Polaris,该方法将位置表示为半径和角度,从而更直观地建模空间变化和相对关系;通过专门设计的编码与精修模块,显式地建模距离与方向的变化特性,实现更具结构化和空间感知能力的轨迹预测与规划,在Argoverse 2和nuPlan等基准上均达到当前最优性能。

链接: https://arxiv.org/abs/2508.11492
作者: Bozhou Zhang,Nan Song,Bingzhao Gao,Li Zhang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trajectory prediction and planning in autonomous driving are highly challenging due to the complexity of predicting surrounding agents’ movements and planning the ego agent’s actions in dynamic environments. Existing methods encode map and agent positions and decode future trajectories in Cartesian coordinates. However, modeling the relationships between the ego vehicle and surrounding traffic elements in Cartesian space can be suboptimal, as it does not naturally capture the varying influence of different elements based on their relative distances and directions. To address this limitation, we adopt the Polar coordinate system, where positions are represented by radius and angle. This representation provides a more intuitive and effective way to model spatial changes and relative relationships, especially in terms of distance and directional influence. Based on this insight, we propose Polaris, a novel method that operates entirely in Polar coordinates, distinguishing itself from conventional Cartesian-based approaches. By leveraging the Polar representation, this method explicitly models distance and direction variations and captures relative relationships through dedicated encoding and refinement modules, enabling more structured and spatially aware trajectory prediction and planning. Extensive experiments on the challenging prediction (Argoverse 2) and planning benchmarks (nuPlan) demonstrate that Polaris achieves state-of-the-art performance.
zh

[CV-17] Perception in Plan: Coupled Perception and Planning for End-to-End Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶中感知与规划模块分离导致的优化效率低下问题,即传统“感知-规划”范式难以实现规划导向的精准感知。其解决方案的关键在于提出一种“感知融入规划”(perception-in-plan)的框架设计,通过将感知模块嵌入到规划过程中,使感知能够根据动态演进的规划目标进行定向引导;具体而言,VeteranAD利用多模态锚定轨迹作为规划先验,驱动感知模块聚焦于这些轨迹上的交通要素,从而实现更全面且有针对性的环境理解,并结合自回归策略逐步预测未来轨迹,确保感知始终服务于规划需求,最终显著提升驾驶行为的准确性与可靠性。

链接: https://arxiv.org/abs/2508.11488
作者: Bozhou Zhang,Jingyu Li,Nan Song,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has achieved remarkable advancements in recent years. Existing methods primarily follow a perception-planning paradigm, where perception and planning are executed sequentially within a fully differentiable framework for planning-oriented optimization. We further advance this paradigm through a perception-in-plan framework design, which integrates perception into the planning process. This design facilitates targeted perception guided by evolving planning objectives over time, ultimately enhancing planning performance. Building on this insight, we introduce VeteranAD, a coupled perception and planning framework for end-to-end autonomous driving. By incorporating multi-mode anchored trajectories as planning priors, the perception module is specifically designed to gather traffic elements along these trajectories, enabling comprehensive and targeted perception. Planning trajectories are then generated based on both the perception results and the planning priors. To make perception fully serve planning, we adopt an autoregressive strategy that progressively predicts future trajectories while focusing on relevant regions for targeted perception at each step. With this simple yet effective design, VeteranAD fully unleashes the potential of planning-oriented end-to-end methods, leading to more accurate and reliable driving behavior. Extensive experiments on the NAVSIM and Bench2Drive datasets demonstrate that our VeteranAD achieves state-of-the-art performance.
zh

[CV-18] Automated Building Heritage Assessment Using Street-Level Imagery

【速读】:该论文旨在解决在不损害文化遗产价值的前提下,量化建筑节能措施(如围护结构改造)所面临的难题,尤其是在大规模能源改造场景中如何兼顾遗产保护与能效提升。其解决方案的关键在于利用生成式AI(Generative AI)中的大语言模型GPT从立面图像中提取文化价值特征,并结合建筑登记数据,通过机器学习模型实现对斯德哥尔摩多户及非住宅建筑的文化遗产价值分类。实验表明,融合GPT提取特征与登记数据的模型可达到0.71的宏观F1分数,显著优于仅依赖GPT数据的0.60得分,从而为高质量数据库构建提供支持,助力精细化节能改造决策与遗产价值的集成考量。

链接: https://arxiv.org/abs/2508.11486
作者: Kristina Dabrock,Tim Johansson,Anna Donarelli,Mikael Mangold,Noah Pflugradt,Jann Michael Weinand,Jochen Linßen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detailed data is required to quantify energy conservation measures in buildings, such as envelop retrofits, without compromising cultural heritage. Novel artificial intelligence tools may improve efficiency in identifying heritage values in buildings compared to costly and time-consuming traditional inventories. In this study, the large language model GPT was used to detect various aspects of cultural heritage value in façade images. Using this data and building register data as features, machine learning models were trained to classify multi-family and non-residential buildings in Stockholm, Sweden. Validation against an expert-created inventory shows a macro F1-score of 0.71 using a combination of register data and features retrieved from GPT, and a score of 0.60 using only GPT-derived data. The presented methodology can contribute to a higher-quality database and thus support careful energy efficiency measures and integrated consideration of heritage value in large-scale energetic refurbishment scenarios.
zh

[CV-19] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

【速读】:该论文旨在解决当前视频生成模型在多镜头(multi-shot)视频合成中的关键瓶颈问题,即现有方法在镜头切换(shot transition)方面能力薄弱且不稳定,导致生成结果通常局限于单镜头序列,难以实现符合电影剪辑风格的连贯多镜头视频。解决方案的关键在于提出CineTrans框架,其核心创新包括:构建包含详细镜头标注的多镜头视频-文本数据集Cine250K以揭示电影剪辑规律;发现扩散模型中的注意力图与镜头边界存在对应关系,并据此设计一种无需训练的基于掩码(mask-based)的控制机制,可精确引导过渡发生在任意位置;在此基础上对模型进行微调后,CineTrans能够生成符合电影编辑逻辑的多镜头序列,显著提升过渡控制精度、时间一致性及整体视频质量。

链接: https://arxiv.org/abs/2508.11484
作者: Xiaoxue Wu,Bingjie Gao,Yu Qiao,Yaohui Wang,Xinyuan Chen
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 20 figures

点击查看摘要

Abstract:Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.
zh

[CV-20] OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

【速读】:该论文旨在解决当前建筑领域视觉数据集存在多样性与不一致性的问题,具体表现为数据规模、模态、标注质量及真实场景代表性差异较大,导致研究社区难以全面理解数据资源现状、识别关键空白并指导未来AI应用的发展方向。解决方案的关键在于系统性地收集和分类公开可用的视觉数据集(共51个,时间跨度为2005–2024),构建一个结构化的数据特征框架,涵盖数据基础信息、模态类型、标注方式和下游应用场景,并发布开源目录OpenConstruction以支持数据驱动方法开发;同时提出基于FAIR(可发现性、可访问性、互操作性和可重用性)原则的数据基础设施建设路线图,推动建筑领域AI研究向更有效、可靠和可扩展的方向发展。

链接: https://arxiv.org/abs/2508.11482
作者: Ruoxin Xiong,Yanyu Wang,Jiannan Cai,Kaijian Liu,Yuansheng Zhu,Pingbo Tang,Nora El-Gohary
机构: Kent State University (肯特州立大学); Louisiana State University (路易斯安那州立大学); The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); Stevens Institute of Technology (史蒂文斯理工学院); Rochester Institute of Technology (罗切斯特理工学院); Carnegie Mellon University (卡内基梅隆大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community’s ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.
zh

[CV-21] ACR-YOLO: A Real-time Detection Framework for Abnormal Human Behaviors Enhanced with Coordinate and Task-Aware Representations IJCNN2025

【速读】:该论文旨在解决特殊场景下异常人类行为检测(Abnormal Human Behavior Detection, AHBD)中存在的小目标检测困难、分类与回归任务冲突以及多尺度特征融合不充分等问题。其解决方案的关键在于提出一种名为TACR-YOLO的实时检测框架,包含三个核心改进:引入坐标注意力模块(Coordinate Attention Module)以增强小目标检测能力;设计任务感知注意力模块(Task-Aware Attention Module)缓解分类与回归之间的冲突;构建强化颈部网络(Strengthen Neck Network)实现更精细的多尺度特征融合;此外,通过K-means聚类优化锚框(Anchor Box)尺寸并采用DIoU-Loss提升边界框回归精度。这些改进共同提升了模型在复杂场景下的检测性能与鲁棒性。

链接: https://arxiv.org/abs/2508.11478
作者: Xinyi Yin,Wenbo Yuan,Xuecheng Wu,Liangyu Fu,Danlei Huang
机构: Zhengzhou University (郑州大学); Xi’an Jiaotong University (西安交通大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, accepted by IJCNN 2025

点击查看摘要

Abstract:Abnormal Human Behavior Detection (AHBD) under special scenarios is becoming increasingly crucial. While YOLO-based detection methods excel in real-time tasks, they remain hindered by challenges including small objects, task conflicts, and multi-scale fusion in AHBD. To tackle them, we propose TACR-YOLO, a new real-time framework for AHBD. We introduce a Coordinate Attention Module to enhance small object detection, a Task-Aware Attention Module to deal with classification-regression conflicts, and a Strengthen Neck Network for refined multi-scale fusion, respectively. In addition, we optimize Anchor Box sizes using K-means clustering and deploy DIoU-Loss to improve bounding box regression. The Personnel Anomalous Behavior Detection (PABD) dataset, which includes 8,529 samples across four behavior categories, is also presented. Extensive experimental results indicate that TACR-YOLO achieves 91.92% mAP on PABD, with competitive speed and robustness. Ablation studies highlight the contribution of each improvement. This work provides new insights for abnormal behavior detection under special scenarios, advancing its progress.
zh

[CV-22] SPG: Style-Prompting Guidance for Style-Specific Content Creation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成图像时难以精确控制视觉风格的问题。现有方法虽能较好地对齐文本语义,但在风格一致性方面表现不足。解决方案的关键在于提出一种名为Style-Prompting Guidance (SPG) 的新型采样策略:通过构建一个风格噪声向量(style noise vector),并利用其相对于无条件噪声的方向偏移来引导扩散过程,使其趋向目标风格分布。该方法与无分类器指导(Classifier-Free Guidance, CFG)结合后,能够在保持语义准确性的同时实现风格一致性,且具备良好的鲁棒性与兼容性,可无缝集成至ControlNet和IPAdapter等可控生成框架中。

链接: https://arxiv.org/abs/2508.11476
作者: Qian Liang,Zichong Chen,Yang Zhou,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Journal track of Pacific Graphics 2025

点击查看摘要

Abstract:Although recent text-to-image (T2I) diffusion models excel at aligning generated images with textual prompts, controlling the visual style of the output remains a challenging task. In this work, we propose Style-Prompting Guidance (SPG), a novel sampling strategy for style-specific image generation. SPG constructs a style noise vector and leverages its directional deviation from unconditional noise to guide the diffusion process toward the target style distribution. By integrating SPG with Classifier-Free Guidance (CFG), our method achieves both semantic fidelity and style consistency. SPG is simple, robust, and compatible with controllable frameworks like ControlNet and IPAdapter, making it practical and widely applicable. Extensive experiments demonstrate the effectiveness and generality of our approach compared to state-of-the-art methods. Code is available at this https URL.
zh

[CV-23] CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中肾小球基底膜(Glomerular Basement Membrane, GBM)分割的精度与标注效率之间的矛盾问题。传统监督深度学习方法虽能实现高精度分割,但依赖大量像素级标注,难以应用于临床场景;而少样本学习方法虽可减少标注需求,却常因无法捕捉GBM细微结构特征而导致性能不足。解决方案的关键在于提出一种快速高效的“粗到精”少样本分割流程CoFi:首先利用仅三张标注图像训练轻量神经网络生成初始粗分割掩膜,再通过形态学感知的剪枝策略自动生成高质量点提示(point prompts),并以此引导Segment Anything Model (SAM) 进行精细化分割。该方法在保持极低标注成本的同时,实现了Dice系数达74.54%的高精度分割,且推理速度为1.9 FPS,显著优于传统方法,在科研与临床应用中均具备良好潜力。

链接: https://arxiv.org/abs/2508.11469
作者: Hongjin Fang,Daniel Reisenbüchler,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
机构: Cornell University (康奈尔大学); University of Regensburg (雷根斯堡大学); Weill Cornell Medicine (威尔康奈尔医学院); Cornell Tech (康奈尔技术学院); Northwell Health (诺斯韦尔健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline’s speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: this https URL.
zh

[CV-24] Data-Driven Deepfake Image Detection Method – The 2024 Global Deepfake Image Detection Challenge

【速读】:该论文旨在解决深度伪造图像(Deepfake image)检测问题,即准确判断一张人脸图像是否为生成式 AI (Generative AI) 生成的伪造图像,并输出其为伪造的概率得分。解决方案的关键在于采用基于 Swin Transformer V2-B 的分类网络架构,结合在线数据增强与离线样本生成方法,有效提升训练样本的多样性并增强模型的泛化能力,最终在竞赛中获得优异成绩。

链接: https://arxiv.org/abs/2508.11464
作者: Xiaoya Zhu,Yibing Nan,Shiguo Lian
机构: AI Innovation Center, China Unicom (中国联通人工智能创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.
zh

[CV-25] Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

【速读】:该论文旨在解决室内导航中因GPS信号弱或不可用而导致的定位与路径规划难题,传统方法往往依赖特殊传感器、预先布置的标记、场景地图信息或网络连接,限制了其在实际场景中的部署灵活性和普适性。解决方案的关键在于提出一种仅基于视觉输入的深度学习方法,利用图结构生成路径并结合可解释的数据增强与课程学习策略,实现了数据采集、标注与训练过程的高度自动化、高效性和鲁棒性;同时构建了一个大规模室内视频数据集(包含商场内多目标方向标注),并通过Android应用验证了其实时性和易部署特性,从而为无额外硬件依赖的室内导航提供了可行方案。

链接: https://arxiv.org/abs/2508.11446
作者: Daniel Airinei,Elena Burceanu,Marius Leordeanu
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学); Bitdefender; Institute of Mathematics “Simion Stoilow” of the Romanian Academy (罗马尼亚科学院西米翁·斯托伊洛夫数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the International Conference on Computer Vision Workshops 2025

点击查看摘要

Abstract:Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site
zh

[CV-26] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

【速读】:该论文旨在解决统一架构的多模态大语言模型(Multimodal Large Language Models, MLLs)在个性化图像生成任务中难以高效适配新主体的问题。现有方法通常依赖于针对每个新主体的数据密集型微调,限制了模型的可扩展性。解决方案的关键在于提出MM-R1框架,其核心创新是引入跨模态思维链(Cross-modal Chain-of-Thought, X-CoT)推理策略,将个性化过程建模为一个融合视觉理解与图像生成的端到端流程:首先通过解析用户提供的图像和上下文线索来定位主体概念,随后基于提取的主体表征与用户提示生成高保真度的个性化图像。此外,采用分组奖励近端策略(Grouped Reward Proximal Policy Optimization, GRPO)显式对齐生成结果,实现在零样本(zero-shot)条件下高质量的主体保真度与文本一致性。

链接: https://arxiv.org/abs/2508.11433
作者: Qian Liang,Yujia Wu,Kuncheng Li,Jiwei Wei,Shiyuan He,Jinyu Guo,Ning Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.
zh

[CV-27] Robust Convolution Neural ODEs via Contractivity-promoting regularization

【速读】:该论文旨在解决神经网络在输入噪声和对抗攻击下易受干扰的问题,尤其是卷积神经网络(Convolutional Neural Networks, CNNs)的脆弱性。其核心解决方案是引入收缩理论(contraction theory),通过设计具有收缩特性的卷积神经常微分方程(Convolutional Neural Ordinary Differential Equations, NODEs),使得系统轨迹从不同初始条件出发时能以指数速度收敛,从而提升模型对特征扰动的鲁棒性。关键创新在于:一方面,在训练过程中加入依赖于系统动力学雅可比矩阵(Jacobian)的正则化项以诱导收缩性;另一方面,对于具有斜率受限激活函数(slope-restricted activation functions)的一类NODEs,可通过精心设计的权重正则化项来降低计算负担并实现收缩性促进。实验表明,该方法在MNIST和FashionMNIST数据集上的图像分类任务中,面对多种噪声和对抗攻击时均表现出更强的鲁棒性能。

链接: https://arxiv.org/abs/2508.11432
作者: Muhammad Zakwan,Liang Xu,Giancarlo Ferrari-Trecate
机构: Inspire AG(控制与自动化组); ETH Zürich(苏黎世联邦理工学院); Shanghai University(上海大学); Beihang University(北京航空航天大学); EPFL(洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Accepted in IEEE CDC2025, Rio de Janeiro, Brazil

点击查看摘要

Abstract:Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Convolutional Neural Ordinary Differential Equations (NODEs), a family of continuous-depth neural networks represented by dynamical systems, and propose to use contraction theory to improve their robustness. For a contractive dynamical system two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive Convolutional NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks. Comments: Accepted in IEEE CDC2025, Rio de Janeiro, Brazil Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY) Cite as: arXiv:2508.11432 [cs.LG] (or arXiv:2508.11432v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.11432 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Muhammad Zakwan [view email] [v1] Fri, 15 Aug 2025 12:18:44 UTC (1,520 KB) Full-text links: Access Paper: View a PDF of the paper titled Robust Convolution Neural ODEs via Contractivity-promoting regularization, by Muhammad Zakwan and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-08 Change to browse by: cs cs.CV cs.SY eess eess.SY References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-28] Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting

【速读】:该论文旨在解决3D场景重建中对象移除后的语义残留(semantic residuals)问题,即在移除特定物体后,尽管其几何结构被清除,但场景中仍可能残留可被下游模型识别的语义信息,从而对隐私保护构成威胁。其解决方案的关键在于构建了一个新的基准测试框架和评估体系,用于量化分析当前3D高斯泼溅(3D Gaussian Splatting)方法在对象移除后是否真正消除了语义痕迹,并通过发布Remove360数据集——包含真实环境中移除前后RGB图像与物体级掩膜——支持对全场景语义完整性进行系统评估。该框架揭示了现有技术在复杂现实场景下的显著局限性,强调需发展更鲁棒的语义擦除方法以实现真正的隐私保护。

链接: https://arxiv.org/abs/2508.11431
作者: Simona Kocour,Assia Benbihi,Torsten Sattler
机构: CTU in Prague, Czech Republic (布拉格捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2503.17574

点击查看摘要

Abstract:Understanding what semantic information persists after object removal is critical for privacy-preserving 3D reconstruction and editable scene representations. In this work, we introduce a novel benchmark and evaluation framework to measure semantic residuals, the unintended semantic traces left behind, after object removal in 3D Gaussian Splatting. We conduct experiments across a diverse set of indoor and outdoor scenes, showing that current methods can preserve semantic information despite the absence of visual geometry. We also release Remove360, a dataset of pre/post-removal RGB images and object-level masks captured in real-world environments. While prior datasets have focused on isolated object instances, Remove360 covers a broader and more complex range of indoor and outdoor scenes, enabling evaluation of object removal in the context of full-scene representations. Given ground truth images of a scene before and after object removal, we assess whether we can truly eliminate semantic presence, and if downstream models can still infer what was removed. Our findings reveal critical limitations in current 3D object removal techniques and underscore the need for more robust solutions capable of handling real-world complexity. The evaluation framework is available at this http URL. Data are available at this http URL.
zh

[CV-29] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中复杂动态环境下的行为预测与场景生成难以协同的问题,即如何有效融合视觉-语言模型(Vision-Language Model, VLM)在动作层面的精准决策能力与驾驶世界模型(Driving World Model, DWM)在像素级未来场景生成上的高保真特性。解决方案的关键在于提出ImagiDrive框架,该框架构建了一个统一的“想象-规划”闭环:VLM驱动的驾驶代理基于多模态输入预测初始轨迹,并引导DWM驱动的场景想象器生成对应未来场景;这些生成的场景再用于迭代优化驾驶代理的规划决策。为提升效率与准确性,论文进一步引入早期停止机制和轨迹选择策略,从而实现动作级决策与像素级预测之间的高效对齐,同时保障计算资源的有效利用。

链接: https://arxiv.org/abs/2508.11428
作者: Jingyu Li,Bozhou Zhang,Xin Jin,Jiankang Deng,Xiatian Zhu,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent’s planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.
zh

[CV-30] raining-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems

【速读】:该论文旨在解决生物特征模板保护(Biometric Template Protection)中因同态加密(Homomorphic Encryption, HE)引入的高计算开销问题,同时保持多模态生物特征识别的准确性与安全性。其解决方案的关键在于通过维度缩减(Dimensionality Reduction)对来自人脸、指纹和虹膜等多模态的深度神经网络(DNN)提取特征向量进行压缩,在保证识别性能不下降的前提下显著减少加密域内的运算量,从而实现更高效的加密处理;实验表明,融合多模态特征可使模板大小降低67%而Equal Error Rate(EER)无损,且方案具备无需训练、易于在加密环境下实现及良好泛化能力等优势。

链接: https://arxiv.org/abs/2508.11419
作者: Florian Bayer,Maximilian Russo,Christian Rathgeb
机构: Hochschule Darmstadt(达姆施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biometric recognition is widely used, making the privacy and security of extracted templates a critical concern. Biometric Template Protection schemes, especially those utilizing Homomorphic Encryption, introduce significant computational challenges due to increased workload. Recent advances in deep neural networks have enabled state-of-the-art feature extraction for face, fingerprint, and iris modalities. The ubiquity and affordability of biometric sensors further facilitate multi-modal fusion, which can enhance security by combining features from different modalities. This work investigates the biometric performance of reduced multi-biometric template sizes. Experiments are conducted on an in-house virtual multi-biometric database, derived from DNN-extracted features for face, fingerprint, and iris, using the FRGC, MCYT, and CASIA databases. The evaluated approaches are (i) explainable and straightforward to implement under encryption, (ii) training-free, and (iii) capable of generalization. Dimensionality reduction of feature vectors leads to fewer operations in the Homomorphic Encryption (HE) domain, enabling more efficient encrypted processing while maintaining biometric accuracy and security at a level equivalent to or exceeding single-biometric recognition. Our results demonstrate that, by fusing feature vectors from multiple modalities, template size can be reduced by 67 % with no loss in Equal Error Rate (EER) compared to the best-performing single modality.
zh

[CV-31] SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models ICCV

【速读】:该论文旨在解决预训练细胞分割模型在域偏移(domain shift)场景下性能下降的问题,即当测试数据与训练数据分布不一致时,模型泛化能力受限。其核心挑战在于缺乏标注数据时无法进行有效的监督微调。解决方案的关键在于提出SelfAdapt方法,该方法基于学生-教师增强一致性训练框架,引入L2-SP正则化以稳定训练过程,并设计标签无关的停止准则实现无监督自适应。实验表明,SelfAdapt可在无需标签的情况下显著提升Cellpose等通用模型在LiveCell和TissueNet数据集上的性能,相对AP0.5指标最高提升达29.64%,且能进一步优化已有监督微调的结果。

链接: https://arxiv.org/abs/2508.11411
作者: Fabian H. Reith,Jannik Franzen,Dinesh R. Palli,J. Lorenz Rumberger,Dagmar Kainmueller
机构: Charité - Universitätsmedizin, Berlin, Germany(柏林夏里特-大学医学中心,德国); Humboldt-Universität zu Berlin, Berlin, Germany(柏林洪堡大学,德国); Universität Potsdam, Digital Engineering Faculty, Potsdam, Germany(波茨坦大学数字工程学院,德国); Helmholtz Imaging(亥姆霍兹成像); Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany(亥姆霍兹联合会马克斯·德尔布吕克分子医学中心,德国); Ludwig-Maximilians-University, Munich, Germany(慕尼黑路德维希-马克西米利安大学,德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 3 figures. To appear in the proceedings of the BioImage Computing (BIC) Workshop @ ICCVW 2025. This is the accepted author manuscript (camera-ready version)

点击查看摘要

Abstract:Deep neural networks have become the go-to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state-of-the-art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine-tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre-trained cell segmentation models without the need for labels. Our approach builds upon student-teacher augmentation consistency training, introducing L2-SP regularization and label-free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine-tuned with supervision. We release SelfAdapt as an easy-to-use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller-Lab/self_adapt.
zh

[CV-32] RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator

【速读】:该论文旨在解决大气湍流(Atmospheric Turbulence, AT)导致视频质量严重退化的问题,包括几何畸变、模糊和时间闪烁等现象,这些问题严重影响视觉清晰度与时间一致性。现有基于Transformer和3D架构的先进方法虽能有效恢复视频质量,但因需多帧输入而计算复杂度高、内存消耗大,难以在资源受限场景下实现实时部署。解决方案的关键在于提出一种轻量级递归多尺度特征框架——RMFAT(Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator),其通过仅使用两帧输入进行逐帧恢复,显著缩小时间窗口并降低计算负担;同时在编码器和解码器阶段引入多尺度特征编码/解码与时间形变模块,增强空间细节保留与时间一致性,从而在保持高恢复质量的同时实现超过四倍的推理速度提升。

链接: https://arxiv.org/abs/2508.11409
作者: Zhiming Liu,Nantheera Anantrasirichai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator, designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9% improvement in SSIM) but also achieves significantly improved inference speed (more than a fourfold reduction in runtime), making it particularly suitable for real-time atmospheric turbulence suppression tasks.
zh

[CV-33] G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

【速读】:该论文旨在解决现有前馈式三维场景重建方法仅依赖输入图像、未能有效利用现实场景中常见的辅助信息(如深度图、相机标定或相机位姿)的问题。解决方案的关键在于对CUT3R模型进行轻量级改进,引入针对不同模态(如RGB图像、深度、相机参数等)的专用编码器,并通过零卷积(zero convolution)将提取的多模态特征与RGB图像token进行融合,从而实现任意组合的先验信息在推理阶段的无缝集成,显著提升重建性能并保持对多种输入模态的兼容性。

链接: https://arxiv.org/abs/2508.11379
作者: Ramil Khafizov,Artem Komarichev,Ruslan Rakhimov,Peter Wonka,Evgeny Burnaev
机构: 1. Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所); 2. National University of Science and Technology MISiS (俄罗斯国立科技大学MISiS); 3. King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); 4. Russian Academy of Sciences (俄罗斯科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.
zh

[CV-34] Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)在人脸识别模型部署于计算资源受限设备时,传统方法如原始L2特征蒸馏或特征一致性损失难以同时捕捉细粒度实例级细节与复杂关系结构的问题。其解决方案的关键在于提出一个统一框架,集成两种新型损失函数:实例级嵌入蒸馏(Instance-Level Embedding Distillation)和基于关系的成对相似性蒸馏(Relation-Based Pairwise Similarity Distillation)。前者通过动态硬挖掘策略增强对困难样本的学习能力,后者利用记忆库机制和采样策略捕获样本间的成对相似性关系,从而实现实例级对齐与几何关系保留的协同优化,显著提升蒸馏效果,并在多个基准数据集上超越现有最优方法,甚至使学生模型超越教师模型性能。

链接: https://arxiv.org/abs/2508.11376
作者: Durgesh Mishra,Rishabh Uikey
机构: Indian Institute of Science Education and Research, Bhopal, India(印度科学教育与研究学院,博帕尔,印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The paper spans a total of 14 pages, 10 pages for the main content (including references) and 4 pages for the appendix. The main paper contains 3 figures and 1 table, while the appendix includes 1 pseudo-code algorithm and 4 tables. The work was recently accepted for publication at IJCB 2025

点击查看摘要

Abstract:Knowledge Distillation is crucial for optimizing face recognition models for deployment in computationally limited settings, such as edge devices. Traditional KD methods, such as Raw L2 Feature Distillation or Feature Consistency loss, often fail to capture both fine-grained instance-level details and complex relational structures, leading to suboptimal performance. We propose a unified approach that integrates two novel loss functions, Instance-Level Embedding Distillation and Relation-Based Pairwise Similarity Distillation. Instance-Level Embedding Distillation focuses on aligning individual feature embeddings by leveraging a dynamic hard mining strategy, thereby enhancing learning from challenging examples. Relation-Based Pairwise Similarity Distillation captures relational information through pairwise similarity relationships, employing a memory bank mechanism and a sample mining strategy. This unified framework ensures both effective instance-level alignment and preservation of geometric relationships between samples, leading to a more comprehensive distillation process. Our unified framework outperforms state-of-the-art distillation methods across multiple benchmark face recognition datasets, as demonstrated by extensive experimental evaluations. Interestingly, when using strong teacher networks compared to the student, our unified KD enables the student to even surpass the teacher’s accuracy.
zh

[CV-35] Does the Skeleton-Recall Loss Really Work?

【速读】:该论文试图解决的问题是:在复杂管状结构(thin tubular structures)的图像分割任务中,基于拓扑保持(topology preservation)的损失函数(如Skeleton Recall Loss, SRL)是否真的能够显著提升分割性能。其解决方案的关键在于通过理论分析和实证比较,揭示SRL等拓扑相关损失函数在梯度特性上的局限性,并验证其在多个基准数据集上并未超越传统基线模型(traditional baseline models),从而为后续研究提供关于如何设计更有效的管状结构分割模型的洞见。

链接: https://arxiv.org/abs/2508.11374
作者: Devansh Arora,Nitin Kumar,Sukrit Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.~\cite kirchhoff2024srl, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures.
zh

[CV-36] Leverag ing the RETFound foundation model for optic disc segmentation in retinal images

【速读】:该论文旨在解决光学视盘(optic disc)分割任务中对大量标注数据依赖的问题,尤其是在医学图像分析领域,此类任务通常需要复杂且耗时的标注流程。研究者首次将已有的视网膜基础模型RETFound迁移至该任务,其关键在于利用RETFound强大的预训练表征能力,在仅使用少量特定任务样本的情况下,通过微调一个轻量级分类头(head),即可实现优于现有专用分割网络的性能。实验表明,该方法在多个公共和私有数据集上均保持约96%的Dice系数,展现出优异的内部验证、域泛化与域适应能力,为基础模型在医学影像分割中的应用提供了新的范式。

链接: https://arxiv.org/abs/2508.11354
作者: Zhenyi Zhao,Muthu Rama Krishnan Mookiah,Emanuele Trucco
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:RETFound is a well-known foundation model (FM) developed for fundus camera and optical coherence tomography images. It has shown promising performance across multiple datasets in diagnosing diseases, both eye-specific and systemic, from retinal images. However, to our best knowledge, it has not been used for other tasks. We present the first adaptation of RETFound for optic disc segmentation, a ubiquitous and foundational task in retinal image analysis. The resulting segmentation system outperforms state-of-the-art, segmentation-specific baseline networks after training a head with only a very modest number of task-specific examples. We report and discuss results with four public datasets, IDRID, Drishti-GS, RIM-ONE-r3, and REFUGE, and a private dataset, GoDARTS, achieving about 96% Dice consistently across all datasets. Overall, our method obtains excellent performance in internal verification, domain generalization and domain adaptation, and exceeds most of the state-of-the-art baseline results. We discuss the results in the framework of the debate about FMs as alternatives to task-specific architectures. The code is available at: [link to be added after the paper is accepted]
zh

[CV-37] HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

【速读】:该论文旨在解决当前开放词汇人类-物体交互(Human-Object Interaction, HOI)检测方法过度依赖大语言模型(Large Language Models, LLMs)生成文本提示,而忽视其固有的三维空间理解能力的问题。解决方案的关键在于提出首个融合链式思维(Chain-of-Thought, CoT)引导的监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习(Reinforcement Learning, RL)框架——HOID-R1。该框架首先通过SFT赋予模型推理能力并强制输出思考过程,再利用GRPO整合多奖励信号进行策略优化以增强跨模态对齐,并引入“多模态大模型作为裁判”(MLLM-as-a-judge)机制抑制CoT推理中的幻觉,从而显著提升模型在开放世界场景下的泛化性能。

链接: https://arxiv.org/abs/2508.11350
作者: Zhenhao Zhang,Hanqing Wang,Xiangyu Zeng,Ziyu Cheng,Jiaxin Liu,Haoyu Yan,Zhirui Liu,Kaiyang Ji,Tianxiang Gui,Ke Hu,Kangyi Chen,Yahao Fan,Mokai Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding and recognizing human-object interaction (HOI) is a pivotal application in AR/VR and robotics. Recent open-vocabulary HOI detection approaches depend exclusively on large language models for richer textual prompts, neglecting their inherent 3D spatial understanding capabilities. To address this shortcoming, we introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided supervised fine-tuning (SFT) with group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we initially apply SFT to imbue the model with essential reasoning capabilities, forcing the model to articulate its thought process in the output. Subsequently, we integrate GRPO to leverage multi-reward signals for policy optimization, thereby enhancing alignment across diverse modalities. To mitigate hallucinations in the CoT reasoning, we introduce an “MLLM-as-a-judge” mechanism that supervises the CoT outputs, further improving generalization. Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.
zh

[CV-38] Semantically Guided Adversarial Testing of Vision Models Using Language Models

【速读】:该论文旨在解决目标攻击(targeted adversarial attacks)中目标标签(target label)选择的不确定性问题,即现有方法依赖随机性、模型预测或静态语义资源,导致攻击结果缺乏可解释性、可复现性和灵活性。其解决方案的关键在于提出一种基于跨模态知识迁移(cross-modal knowledge transfer)的语义引导框架,利用预训练语言模型(如BERT、TinyLLAMA)和视觉-语言模型(如CLIP)提取类别间的语义相似度,从而系统性地筛选与真实标签最相关或最不相关的类别作为攻击目标,构建最优和最差场景下的对抗样本。实验表明,该方法显著优于传统词典数据库(如WordNet),尤其在远距离类别关系建模上表现突出,并能为相似度源提供事前测试(a priori testing),推动构建可解释、标准化且可扩展的对抗基准。

链接: https://arxiv.org/abs/2508.11341
作者: Katarzyna Filus,Jorge M. Cruz-Duarte
机构: Institute of Theoretical and Applied Informatics, Polish Academy of Sciences (波兰科学院理论与应用信息研究所); University of Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL (法国里尔大学, 国家科学研究中心, Inria, 里尔中央理工学院, CRIStAL 9189联合研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 3 tables. Submitted for peer review

点击查看摘要

Abstract:In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textita priori testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.
zh

[CV-39] Cost-Effective Active Labeling for Data-Efficient Cervical Cell Classification

【速读】:该论文旨在解决宫颈细胞分类中训练数据集构建成本高昂的问题,尤其是传统自动分类方法依赖于代表性训练数据集,而获取此类数据集需耗费大量人力成本。解决方案的关键在于提出一种主动标注(active labeling)策略,通过高效估计未标注宫颈细胞图像的分类器不确定性,精准筛选出对提升模型代表性最具价值的样本进行标注,从而以显著降低的人力投入构建高质量训练集,实现数据高效的宫颈细胞分类。

链接: https://arxiv.org/abs/2508.11340
作者: Yuanlin Liu,Zhihan Zhou,Mingqiang Wei,Youyi Song
机构: China Pharmaceutical University (中国药科大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: accepted by CW2025

点击查看摘要

Abstract:Information on the number and category of cervical cells is crucial for the diagnosis of cervical cancer. However, existing classification methods capable of automatically measuring this information require the training dataset to be representative, which consumes an expensive or even unaffordable human cost. We herein propose active labeling that enables us to construct a representative training dataset using a much smaller human cost for data-efficient cervical cell classification. This cost-effective method efficiently leverages the classifier’s uncertainty on the unlabeled cervical cell images to accurately select images that are most beneficial to label. With a fast estimation of the uncertainty, this new algorithm exhibits its validity and effectiveness in enhancing the representative ability of the constructed training dataset. The extensive empirical results confirm its efficacy again in navigating the usage of human cost, opening the avenue for data-efficient cervical cell classification.
zh

[CV-40] Index-Aligned Query Distillation for Transformer-based Incremental Object Detection

【速读】:该论文旨在解决基于Transformer的增量目标检测(Incremental Object Detection, IOD)中因灾难性知识遗忘(catastrophic knowledge forgetting)导致模型在旧类别上性能显著下降的问题。现有方法主要依赖知识蒸馏(Knowledge Distillation, KD),通过匈牙利匹配(Hungarian Matching)建立前后阶段检测模型查询(query)之间的对应关系,并对匹配查询的分类器和回归器输出进行对齐以缓解遗忘。然而,作者发现匈牙利匹配在IOD任务中并不理想:由于每次迭代中当前阶段查询可能与前一阶段不同查询匹配,导致原有语义和空间编码被重新塑形,从而引发旧类别知识丢失。为此,论文提出一种新的蒸馏方法——索引对齐查询蒸馏(Index-Aligned Query Distillation, IAQD),其关键在于:不使用匈牙利匹配,而是基于相同索引建立查询间的固定对应关系,并仅对关键于旧类别检测的查询子集执行蒸馏,从而有效保留历史语义与空间编码能力,同时避免干扰新类别的学习。实验表明,IAQD显著减轻了知识遗忘,达到了当前最优性能。

链接: https://arxiv.org/abs/2508.11339
作者: Mingxiao Ma,Shunyao Zhu,Guoliang Kang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Incremental object detection (IOD) aims to continuously expand the capability of a model to detect novel categories while preserving its performance on previously learned ones. When adopting a transformer-based detection model to perform IOD, catastrophic knowledge forgetting may inevitably occur, meaning the detection performance on previously learned categories may severely degenerate. Previous typical methods mainly rely on knowledge distillation (KD) to mitigate the catastrophic knowledge forgetting of transformer-based detection models. Specifically, they utilize Hungarian Matching to build a correspondence between the queries of the last-phase and current-phase detection models and align the classifier and regressor outputs between matched queries to avoid knowledge forgetting. However, we observe that in IOD task, Hungarian Matching is not a good choice. With Hungarian Matching, the query of the current-phase model may match different queries of the last-phase model at different iterations during KD. As a result, the knowledge encoded in each query may be reshaped towards new categories, leading to the forgetting of previously encoded knowledge of old categories. Based on our observations, we propose a new distillation approach named Index-Aligned Query Distillation (IAQD) for transformer-based IOD. Beyond using Hungarian Matching, IAQD establishes a correspondence between queries of the previous and current phase models that have the same index. Moreover, we perform index-aligned distillation only on partial queries which are critical for the detection of previous categories. In this way, IAQD largely preserves the previous semantic and spatial encoding capabilities without interfering with the learning of new categories. Extensive experiments on representative benchmarks demonstrate that IAQD effectively mitigates knowledge forgetting, achieving new state-of-the-art performance.
zh

[CV-41] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition ICCV

【速读】:该论文旨在解决人脸识别模型中因人口统计学(demographic)和环境因素(environmental)差异导致的公平性问题,即如何精确测量、解释并减少偏见,同时确保结果的可复现性。其解决方案的关键在于提出GANDiff FR框架,该框架首次将StyleGAN3用于保持身份不变的生成与基于扩散模型的属性控制相结合,实现对姿态(±30°)、光照(四个方向)和表情(五级强度)等关键变量的细粒度调控,在其他条件一致(ceteris paribus)的前提下合成10,000张人口统计学平衡的真实人脸图像。通过这一机制,研究者能够隔离并量化偏见驱动因素,例如发现光照贡献了42%的残余偏见,并验证AdaFace在降低组间真阳性率(TPR)差异方面效果显著(提升60%,从6.3%降至2.5%),且合成数据到真实数据的迁移性能优异(相关系数r=0.85)。该方法为公平性审计提供了可复现、符合欧盟人工智能法案(EU AI Act)要求的标准。

链接: https://arxiv.org/abs/2508.11334
作者: Md Asgor Hossain Reaj,Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Md Jawadul Hasan,Tze Hui Liew
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCVDM '25

点击查看摘要

Abstract:We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.
zh

[CV-42] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

【速读】:该论文旨在解决当前基于扩散模型的分类器(Diffusion Classifier, DC)中存在的噪声不稳定性问题,即随机采样的噪声会导致分类性能显著波动,从而迫使现有方法通过集成数百次噪声样本以获得稳定结果,严重降低了分类速度。解决方案的关键在于提出一种名为NoOp的新颖噪声优化方法,其核心思想是识别并学习“优质噪声”(good noises),这些噪声需满足频率匹配(Frequency Matching)和空间匹配(Spatial Matching)两个原则:前者通过优化一个与数据集相关的参数化噪声来实现,后者则通过训练一个元网络(Meta-Network)根据输入图像生成特定噪声偏移量,最终将优化噪声与偏移量之和用于DC中替代随机噪声,从而在保持高分类精度的同时大幅提升稳定性与效率。

链接: https://arxiv.org/abs/2508.11330
作者: Yanghao Wang,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although today’s pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises’’ that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.
zh

[CV-43] Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

【速读】:该论文旨在解决3D多目标跟踪(3D Multi-Object Tracking, MOT)在复杂交通场景中因忽略物体间几何关系而导致的关联错误问题,尤其是在拥挤环境或检测不准确时,传统基于个体运动建模的方法(如卡尔曼滤波)难以维持稳定跟踪。其解决方案的关键在于提出“线索一致性”(cue-consistency)原则:通过识别和匹配随时间保持稳定的时空模式来提升轨迹关联的鲁棒性。具体实现上,设计了统一的时空编码器(利用Point Pair Features,PPF)以学习判别性轨迹嵌入并抑制干扰;引入线索一致性Transformer模块显式对齐历史轨迹与当前检测间的稳定特征表示;并通过动态更新机制保留关键时空信息,从而实现高效、稳定的在线跟踪。

链接: https://arxiv.org/abs/2508.11323
作者: Haonan Zhang,Xinyao Wang,Boxi Wu,Tu Zheng,Wang Yunhua,Zheng Yang
机构: Zhejiang University (浙江大学); Fabu; ShanDong Land-Sea-Nexus Digital Technology (山东陆海 nexus 数字科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.
zh

[CV-44] Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在逻辑理解能力上的显著不足,即存在“逻辑盲区”,导致其在实际应用中可靠性受限的问题。现有VLMs(如CLIP)在复杂逻辑任务(如因果关系和条件关系)上表现较差,主要依赖表面语义而非深层逻辑结构。为系统诊断并提升此类能力,作者提出LogicBench基准与LogicCLIP训练框架:关键在于通过逻辑感知的数据生成策略和多粒度对比学习目标(包括粗粒度对齐、细粒度多选目标及新颖的逻辑结构感知目标),有效增强VLMs对逻辑结构的敏感性,同时保持甚至超越其在通用视觉-语言任务上的性能,从而实现逻辑理解能力的显著提升而不损害整体对齐效果。

链接: https://arxiv.org/abs/2508.11317
作者: Yuchen Zhou,Jiayu Tang,Shuo Yang,Xiaoyan Xiao,Yuqin Dai,Wenhao Yang,Chao Gou,Xiaobo Xia,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ‘‘logical blindspots’’ that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs’ logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP’s substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.
zh

[CV-45] Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval IJCAI2025

【速读】:该论文旨在解决当前文本驱动的视频片段检索(Video Moment Retrieval, VMR)方法中存在的多模态对齐失真问题,即现有方法通常对视频中所有片段进行编码,包括与文本无关的冗余片段,从而干扰了跨模态对齐并阻碍模型优化。其解决方案的关键在于提出“去噪-检索”(denoise-then-retrieve)范式:首先通过Text-Conditioned Denoising (TCD)模块利用交叉注意力和结构化状态空间块动态识别并过滤掉文本无关的视频片段,生成噪声掩码以净化多模态表示;随后引入Text-Reconstruction Feedback (TRF)模块,从净化后的视频表示中提取单一查询嵌入并与文本嵌入对齐,作为训练阶段的辅助监督信号强化去噪过程;最终基于净化后的多模态表示进行条件检索,实现更精准的VMR。该范式显著提升了性能,并具备良好的可扩展性,可无缝集成至先进VMR模型中以进一步增强效果。

链接: https://arxiv.org/abs/2508.11313
作者: Weijia Liu,Jiuxin Cao,Bo Miao,Zhiheng Fu,Xuelin Zhu,Jiawei Ge,Bo Liu,Mehwish Nasim,Ajmal Mian
机构: Southeast University (东南大学); The University of Adelaide (阿德莱德大学); The Hong Kong Polytechnic University (香港理工大学); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
zh

[CV-46] Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study

【速读】:该论文旨在解决自动驾驶感知系统中因RGB成像的异色同感(metamerism)导致行人分割不准确的问题,即在视觉上行人与背景难以区分,从而引发安全隐患。其解决方案的关键在于利用高光谱成像(Hyperspectral Imaging, HSI)技术,并通过最优波段选择方法——基于对比信噪比与联合互信息最大化(CSNR-JMIM)对128通道HSI数据进行降维,生成三通道表示,从而提升语义分割模型对行人的判别能力。实验表明,该方法相较传统RGB图像在行人分割上的IoU和F1-score分别平均提升1.44%和2.18%,显著减少了误检,验证了优化后HSI波段在安全关键型汽车应用中的有效性。

链接: https://arxiv.org/abs/2508.11301
作者: Jiarong Li,Imad Ali Shah,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
机构: 1. National University of Ireland Galway (爱尔兰国立大学戈尔韦分校); 2. Insight Centre for Data Analytics (数据解析中心); 3. University College Dublin (都柏林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE ICVES, July, 2025

点击查看摘要

Abstract:Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable… This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H-City) dataset. We compared standard RGB against two dimensionality-reduction approaches by converting 128-channel HSI data into three-channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal-to-Noise Ratio with Joint Mutual Information Maximization (CSNR-JMIM). Three semantic segmentation models were evaluated: U-Net, DeepLabV3+, and SegFormer. CSNR-JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1-score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1-score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety-critical automotive applications.
zh

[CV-47] Allen: Rethinking MAS Design through Step-Level Policy Autonomy

【速读】:该论文旨在解决当前多智能体系统(Multi-Agent System, MAS)设计中的两个核心问题:一是提升系统的策略自主性(policy autonomy),使智能体能够动态调整行为策略;二是实现在复杂网络拓扑结构下协作效率、任务监督与人类监管之间的平衡。解决方案的关键在于重新定义多智能体系统的基本执行单元,允许智能体通过组合这些单元自主形成不同的协作模式,并构建了一个四层状态架构(任务层、阶段层、智能体层、步骤层),从任务导向和执行导向两个维度约束系统行为,从而实现拓扑优化与可控进展的统一。这一设计显著增强了策略自主性,同时在协作结构的可控性上做出合理权衡。

链接: https://arxiv.org/abs/2508.11294
作者: Qiangong Zhou,Zhiting Wang,Mingyou Yao,Zongyang Liu
机构: Shenzhen Motern Technology Co., Ltd(深圳莫腾科技有限公司)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a new Multi-Agent System (MAS) - Allen, designed to address two core challenges in current MAS design: (1) improve system’s policy autonomy, empowering agents to dynamically adapt their behavioral strategies, and (2) achieving the trade-off between collaborative efficiency, task supervision, and human oversight in complex network topologies. Our core insight is to redefine the basic execution unit in the MAS, allowing agents to autonomously form different patterns by combining these units. We have constructed a four-tier state architecture (Task, Stage, Agent, Step) to constrain system behavior from both task-oriented and execution-oriented perspectives. This achieves a unification of topological optimization and controllable progress. Allen grants unprecedented Policy Autonomy, while making a trade-off for the controllability of the collaborative structure. The project code has been open source at: this https URL Subjects: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.11294 [cs.MA] (or arXiv:2508.11294v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.11294 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-48] Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agent

【速读】:该论文旨在解决自主机器人在执行任务时缺乏适应环境变化能力的问题,即当场景发生细微但关键的变化(如抽屉关闭)时,机器人仍按预设计划执行动作,导致因假设过时而失败。现有方法多依赖事后 replanning(重规划),效率低且可能无法恢复;而主动 replanning(主动重规划)虽具潜力,却常需人工规则或大量监督数据。其解决方案的关键在于提出一种基于场景图对比的主动重规划框架:通过比较当前RGB-D观测构建的场景图与成功示范中提取的参考场景图,在子任务边界处识别语义和空间不匹配,并触发轻量级推理模块诊断差异、调整原计划,从而在执行失败前实现预防性修正,显著提升任务成功率和鲁棒性。

链接: https://arxiv.org/abs/2508.11286
作者: Che Rin Yu,Daewon Chae,Dabin Seo,Sangwon Lee,Hyeongwoo Im,Jinkyu Kim
机构: Korea University (韩国大学); KT (韩国电信) R&D Center
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When humans perform everyday tasks, we naturally adjust our actions based on the current state of the environment. For instance, if we intend to put something into a drawer but notice it is closed, we open it first. However, many autonomous robots lack this adaptive awareness. They often follow pre-planned actions that may overlook subtle yet critical changes in the scene, which can result in actions being executed under outdated assumptions and eventual failure. While replanning is critical for robust autonomy, most existing methods respond only after failures occur, when recovery may be inefficient or infeasible. While proactive replanning holds promise for preventing failures in advance, current solutions often rely on manually designed rules and extensive supervision. In this work, we present a proactive replanning framework that detects and corrects failures at subtask boundaries by comparing scene graphs constructed from current RGB-D observations against reference graphs extracted from successful demonstrations. When the current scene fails to align with reference trajectories, a lightweight reasoning module is activated to diagnose the mismatch and adjust the plan. Experiments in the AI2-THOR simulator demonstrate that our approach detects semantic and spatial mismatches before execution failures occur, significantly improving task success and robustness.
zh

[CV-49] meMachine: Fine-Grained Facial Age Editing with Identity Preservation

【速读】:该论文旨在解决生成式AI(Generative AI)在人脸图像编辑中实现细粒度年龄编辑的同时保持个体身份一致性的难题。其解决方案的关键在于提出了一种基于扩散模型(diffusion-based)的新框架TimeMachine,通过在多交叉注意力模块(multi-cross attention module)中注入高精度年龄信息,显式分离与年龄相关和身份相关的特征,从而实现更精确的属性解耦;此外,引入了潜空间年龄分类引导模块(Age Classifier Guidance, ACG),直接在潜在空间预测年龄而非训练过程中进行去噪图像重建,以轻量级方式提升年龄编辑准确性,同时控制训练成本增加。

链接: https://arxiv.org/abs/2508.11284
作者: Yilin Mi,Qixin Yan,Zheng-Peng Duan,Chunle Guo,Hubery Yin,Hao Liu,Chen Li,Chongyi Li
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of generative models, facial image editing has made significant progress. However, achieving fine-grained age editing while preserving personal identity remains a challenging this http URL this paper, we propose TimeMachine, a novel diffusion-based framework that achieves accurate age editing while keeping identity features unchanged. To enable fine-grained age editing, we inject high-precision age information into the multi-cross attention module, which explicitly separates age-related and identity-related features. This design facilitates more accurate disentanglement of age attributes, thereby allowing precise and controllable manipulation of facial this http URL, we propose an Age Classifier Guidance (ACG) module that predicts age directly in the latent space, instead of performing denoising image reconstruction during training. By employing a lightweight module to incorporate age constraints, this design enhances age editing accuracy by modest increasing training cost. Additionally, to address the lack of large-scale, high-quality facial age datasets, we construct a HFFA dataset (High-quality Fine-grained Facial-Age dataset) which contains one million high-resolution images labeled with identity and facial attributes. Experimental results demonstrate that TimeMachine achieves state-of-the-art performance in fine-grained age editing while preserving identity consistency.
zh

[CV-50] Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction

【速读】:该论文旨在解决单目内窥镜位姿估计与组织表面三维重建中的关键挑战,包括深度模糊、生理组织形变、内窥镜运动不一致、纹理保真度低以及视场受限等问题。其解决方案的核心在于提出一个统一框架,通过融合尺度感知的深度预测与时间约束的感知优化机制实现高精度重建:首先引入MAPIS-Depth模块,结合Depth Pro进行鲁棒初始化和Depth Anything进行高效逐帧深度预测,并利用L-BFGS-B优化生成伪度量深度;随后通过RAFT计算像素对应关系并基于LPIPS感知相似性自适应融合光流扭曲帧,有效抑制因组织形变和运动带来的伪影;最后采用WEMA-RTDL模块优化旋转与平移参数以精确配准合成的伪RGBD帧,并通过截断符号距离函数(Truncated Signed Distance Function, TSDF)体素融合与Marching Cubes算法提取完整三维表面网格。

链接: https://arxiv.org/abs/2508.11282
作者: Muzammil Khan,Enzo Kerkhof,Matteo Fusaglia,Koert Kuhlmann,Theo Ruers,Françoise J. Siepel
机构: University of Twente (特温特大学); Netherlands Cancer Institute-Antoni van Leeuwenhoek (荷兰癌症研究所-安东尼·范·列文虎克)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures, 3 Tables, submitted to IEEE Access for review

点击查看摘要

Abstract:Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework’s robustness and superiority over state-of-the-art methods.
zh

[CV-51] Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在对抗扰动下的鲁棒性不足问题,尤其是其时间维度上的脆弱性和对抗攻击在不同时间步之间的迁移特性。解决方案的关键在于提出一种名为鲁棒时间自集成(Robust Temporal self-Ensemble, RTE)的训练框架,该框架通过统一优化目标同时提升每个时间子网络的鲁棒性并抑制对抗扰动在时间维度上的传播,结合随机采样策略实现高效训练。实验表明,RTE 在多个基准测试中显著改善了 SNNs 的鲁棒性与准确率权衡,并重塑了网络内部的鲁棒性分布,形成更稳定且时序多样化的决策边界。

链接: https://arxiv.org/abs/2508.11279
作者: Jihang Wang,Dongcheng Zhao,Ruolin Chen,Qian Zhang,Yi Zeng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.
zh

[CV-52] Probing the Representational Power of Sparse Autoencoders in Vision Models ICCV2025

【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在视觉模型中应用研究不足的问题,尤其是其在图像理解、泛化能力和可控生成方面的潜力尚未被充分探索。解决方案的关键在于系统性地评估SAE在三类主流视觉模型(视觉嵌入模型、多模态大语言模型和扩散模型)中的表现,发现其学习到的特征具有语义可解释性,并能提升分布外(Out-of-Distribution, OOD)泛化能力,同时支持通过文本编码器操控实现语义可控生成。研究表明,SAE不仅能够揭示视觉模型内部的本体结构,还能挖掘人类可理解的属性特征,从而为视觉领域提供一个强大的可解释性分析工具。

链接: https://arxiv.org/abs/2508.11277
作者: Matthew Lyle Olson,Musashi Hinck,Neale Ratzlaff,Changbai Li,Phillip Howard,Vasudev Lal,Shao-Yen Tseng
机构: Intel Labs(英特尔实验室); Oregon State University(俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025 Findings

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.
zh

[CV-53] Enhancing Supervised Composed Image Retrieval via Reasoning -Augmented Representation Engineering

【速读】:该论文旨在解决**组合图像检索(Composed Image Retrieval, CIR)**任务中现有方法在监督训练场景下性能不足的问题,尤其是传统两阶段方法依赖额外排序模型训练、Chain-of-Thought(CoT)技术难以有效融入视觉理解以及在有监督CIR中难以取得满意结果的局限性。其解决方案的关键在于提出一种无需训练的精炼框架——金字塔匹配模型与训练自由精炼(Pyramid Matching Model with Training-Free Refinement, PMTFR),通过引入一个简单但有效的模块“金字塔分块器(Pyramid Patcher)”增强模型对多粒度视觉信息的理解能力,并受表征工程启发,从CoT数据中提取表征并注入到大规模视觉语言模型(LVLMs)中,从而在不依赖显式文本推理的前提下实现细粒度的检索分数优化,显著提升了监督CIR任务的性能。

链接: https://arxiv.org/abs/2508.11272
作者: Jun Li,Kai Li,Shaoguo Liu,Tingting Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited – compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model’s understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
zh

[CV-54] Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds ICCV2025

【速读】:该论文旨在解决3D语义分割中因域偏移(domain shift)导致模型在未见环境部署时性能下降的问题,尤其关注现有方法在点云数据增强过程中仅学习全局几何模式而忽略类别级分布与对齐的局限性。解决方案的关键在于提出一种类别级几何学习框架,其核心包括两个组成部分:一是类别级几何嵌入(Category-level Geometry Embedding, CGE),用于感知点云特征的细粒度几何属性,构建每类别的几何特性并将几何嵌入耦合至语义学习;二是几何一致性学习(Geometric Consistent Learning, GCL),通过模拟潜在的3D分布并对齐类别级几何嵌入,使模型聚焦于几何不变信息以提升泛化能力。

链接: https://arxiv.org/abs/2508.11265
作者: Pei He,Lingling Li,Licheng Jiao,Ronghua Shang,Fang Liu,Shuang Wang,Xu Liu,Wenping Ma
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in International Conference on Computer Vision, ICCV 2025

点击查看摘要

Abstract:Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods.
zh

[CV-55] Vision-Language Models display a strong gender bias

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model, VLM)在对齐图像与文本表示空间时可能隐式编码并放大社会刻板印象的问题,尤其是性别相关的偏见。其解决方案的关键在于构建一个系统性的评估框架:首先采集220张按感知性别划分的人脸图像和150条涵盖六类劳动(如情感劳动、认知劳动、家务劳动等)的短语陈述;接着计算图像与文本嵌入的单位范数表示,并通过对比男性组与女性组的平均余弦相似度差值定义“陈述级关联得分”;最后结合自助抽样法(bootstrap)估计置信区间、类别聚合及标签交换零模型(label-swap null model)以量化无性别结构下预期的平均绝对关联水平,从而实现对VLM中性别关联的细粒度、可解释且鲁棒的偏差检测。

链接: https://arxiv.org/abs/2508.11262
作者: Aiswarya Konavoor,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: Togo AI Labs(托戈人工智能实验室); Vizuara AI Labs(维祖阿拉人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.
zh

[CV-56] Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在密集视觉感知任务中因局部特征表示能力不足而导致性能受限的问题,尤其是在开放词汇场景下,传统方法难以有效处理未定义类别的细粒度空间信息。其解决方案的关键在于提出DeCLIP框架,通过解耦CLIP的自注意力模块,分别提取“内容”(content)和“上下文”(context)特征:其中上下文特征通过联合蒸馏视觉基础模型(Vision Foundation Models, VFMs)的语义相关性和扩散模型提供的对象完整性线索来增强空间一致性;内容特征则通过对图像裁片表示进行对齐并受VFMs区域相关性约束,以提升局部判别能力。这一设计显著改善了开放词汇密集感知任务中的特征表达质量,从而在2D/3D检测与分割、视频实例分割及6D目标姿态估计等多类任务上实现SOTA性能。

链接: https://arxiv.org/abs/2508.11256
作者: Junjie Wang,Keyu Chen,Yulin Li,Bin Chen,Hengshuang Zhao,Xiaojuan Qi,Zhuotao Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2505.04410

点击查看摘要

Abstract:Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context’’ features respectively. \reviseThe context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation. Code is available at this https URL
zh

[CV-57] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

【速读】:该论文旨在解决音频驱动人脸动画(audio-driven portrait animation)中难以对齐人类细粒度偏好(fine-grained human preferences)的问题,尤其是在运动自然性、唇形同步准确性和视觉质量等多个维度之间存在冲突时的优化难题。其关键解决方案是提出一种多模态奖励模型 Talking-Critic,用于学习量化生成视频在多个维度上满足人类期望的奖励函数;并进一步构建大规模多维人类偏好数据集 Talking-NSQ(含41万组偏好对),在此基础上设计 Timestep-Layer adaptive multi-expert Preference Optimization (TLPO) 框架,通过将偏好解耦为专用专家模块,并在时间步和网络层间自适应融合,实现各维度无干扰的精细化提升。

链接: https://arxiv.org/abs/2508.11255
作者: MengChao Wang,Qiang Wang,Fan Jiang,Mu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: this https URL
zh

[CV-58] A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving

【速读】:该论文旨在解决自动驾驶场景中行人重识别(Pedestrian Re-Identification, ReID)因输入模态不确定或缺失(如RGB图像、红外图像、草图或文本描述等)而导致的性能下降问题。传统ReID方法在多模态信息不完整时鲁棒性差,而大规模预训练模型虽具备强大的跨模态语义建模能力,但计算开销大,难以部署于资源受限环境。解决方案的关键在于提出一种轻量级不确定性模态建模框架(Uncertainty Modal Modeling, UMM),其核心包括:1)多模态token映射器实现统一特征表示;2)合成模态增强策略缓解模态缺失影响;3)跨模态线索交互学习器挖掘不同模态间的互补信息;同时利用CLIP的视觉-语言对齐能力高效融合多模态输入,无需大量微调。该设计在保证高鲁棒性和泛化能力的同时显著提升计算效率,为自动驾驶中的行人ReID提供了可扩展且实用的解决方案。

链接: https://arxiv.org/abs/2508.11218
作者: Jialin Li,Shuqi Wu,Ning Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However, the presence of uncertain or missing input modalities–such as RGB, infrared, sketches, or textual descriptions–poses significant challenges to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIP’s vision-language alignment ability to fuse multimodal inputs efficiently without extensive finetuning. Experimental results demonstrate that UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios.
zh

[CV-59] Fluid Dynamics and Domain Reconstruction from Noisy Flow Images Using Physics-Informed Neural Networks and Quasi-Conformal Mapping

【速读】:该论文旨在解决流场图像去噪问题,尤其针对因采集时间短或设备误差导致的噪声干扰,从而影响血流动力学分析与临床诊断精度。解决方案的关键在于将问题建模为一个优化问题,其目标是最小化满足纳维-斯托克斯(Navier-Stokes)方程约束的模型速度场与观测到的噪声速度数据之间的差异。为此,作者提出一种交替求解框架:一是利用物理信息神经网络(Physics-Informed Neural Network, PINN)在固定域内重构速度场;二是通过优化拟共形映射(quasi-conformal mapping)推断潜在的流动区域。两子问题交替迭代优化,最终实现高质量流场图像重建,且在合成与真实类流场数据上均表现出鲁棒性与有效性。

链接: https://arxiv.org/abs/2508.11216
作者: Han Zhang,Xue-Cheng Tai,Jean-Michel Morel,Raymond H. Chan
机构: 未知
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blood flow imaging provides important information for hemodynamic behavior within the vascular system and plays an essential role in medical diagnosis and treatment planning. However, obtaining high-quality flow images remains a significant challenge. In this work, we address the problem of denoising flow images that may suffer from artifacts due to short acquisition times or device-induced errors. We formulate this task as an optimization problem, where the objective is to minimize the discrepancy between the modeled velocity field, constrained to satisfy the Navier-Stokes equations, and the observed noisy velocity data. To solve this problem, we decompose it into two subproblems: a fluid subproblem and a geometry subproblem. The fluid subproblem leverages a Physics-Informed Neural Network to reconstruct the velocity field from noisy observations, assuming a fixed domain. The geometry subproblem aims to infer the underlying flow region by optimizing a quasi-conformal mapping that deforms a reference domain. These two subproblems are solved in an alternating Gauss-Seidel fashion, iteratively refining both the velocity field and the domain. Upon convergence, the framework yields a high-quality reconstruction of the flow image. We validate the proposed method through experiments on synthetic flow data in a converging channel geometry under varying levels of Gaussian noise, and on real-like flow data in an aortic geometry with signal-dependent noise. The results demonstrate the effectiveness and robustness of the approach. Additionally, ablation studies are conducted to assess the influence of key hyperparameters.
zh

[CV-60] A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network

【速读】:该论文旨在解决现有高精度人体姿态估计方法计算资源消耗大、难以部署于轻量化场景的问题,同时提升模型在复杂场景下的鲁棒性与准确性。其解决方案的关键在于提出一种新颖的粗到精两阶段知识蒸馏框架:第一阶段引入人体关节结构损失(human joints structure loss),以挖掘关节间的结构信息,实现从教师模型向学生模型的高层语义知识迁移;第二阶段则利用图像引导的渐进式图卷积网络(Image-Guided Progressive Graph Convolutional Network, IGP-GCN)对初始姿态进行精细化优化,并通过教师模型最终输出的姿态逐步监督IGP-GCN训练过程,从而显著提升学生模型在复杂人群场景(如CrowdPose数据集)中的性能表现。

链接: https://arxiv.org/abs/2508.11212
作者: Zhangjian Ji,Wenjin Zhang,Shaotong Qiao,Kai Feng,Yuhua Qian
机构: Shanxi University (山西大学); Institute of Big Data Science and Industry (大数据科学与产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.
zh

[CV-61] StyleMM: Stylized 3D Morphable Face Model via Text-Driven Aligned Image Translation

【速读】:该论文旨在解决如何基于用户指定的文本描述,实现对三维人脸形态模型(3D Morphable Model, 3DMM)的风格化重建问题,即在保持原始身份、面部对齐和表情不变的前提下,将目标风格注入到3D人脸网格中。解决方案的关键在于提出StyleMM框架:首先利用预训练的网格变形网络与纹理生成器,并通过扩散模型驱动的文本引导图像到图像(text-guided image-to-image, i2i)翻译生成风格化人脸图像作为训练目标;其次引入一种显式保留源图像面部属性的风格化方法,避免在i2i过程中发生身份或表情的意外改变;最终通过图像监督训练,确保在3DMM参数空间中实现一致且可控的风格迁移,从而支持前向生成具有固定顶点连接性和可动画性的风格化人脸网格。

链接: https://arxiv.org/abs/2508.11203
作者: Seungmi Lee,Kwan Yun,Junyong Noh
机构: KAIST(韩国科学技术院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Pacific graphics 2025, CGF, 15 pages

点击查看摘要

Abstract:We introduce StyleMM, a novel framework that can construct a stylized 3D Morphable Model (3DMM) based on user-defined text descriptions specifying a target style. Building upon a pre-trained mesh deformation network and a texture generator for original 3DMM-based realistic human faces, our approach fine-tunes these models using stylized facial images generated via text-guided image-to-image (i2i) translation with a diffusion model, which serve as stylization targets for the rendered mesh. To prevent undesired changes in identity, facial alignment, or expressions during i2i translation, we introduce a stylization method that explicitly preserves the facial attributes of the source image. By maintaining these critical attributes during image stylization, the proposed approach ensures consistent 3D style transfer across the 3DMM parameter space through image-based training. Once trained, StyleMM enables feed-forward generation of stylized face meshes with explicit control over shape, expression, and texture parameters, producing meshes with consistent vertex connectivity and animatability. Quantitative and qualitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of identity-level facial diversity and stylization capability. The code and videos are available at [this http URL](this http URL).
zh

[CV-62] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

【速读】:该论文旨在解决通用视觉语言模型(Vision-Language Models, VLMs)在无人机(Unmanned Aerial Vehicle, UAV)航拍图像上性能下降的问题,这些问题包括高分辨率、复杂空间语义以及严格的实时性约束,导致现有VLMs难以胜任结构化空中推理任务。解决方案的关键在于提出一种轻量级VLM——UAV-VL-R1,其采用监督微调(Supervised Fine-Tuning, SFT)与多阶段强化学习(Multi-stage Reinforcement Learning, RL)相结合的混合训练策略,并引入组相对策略优化(Group Relative Policy Optimization, GRPO)算法,通过规则引导奖励和组内策略对齐机制提升推理的结构性与可解释性。此外,研究构建了高分辨率视觉问答数据集HRVQA-VL以支持训练与评估,实验证明该方法在零样本准确率上较基线模型提升48.17%,且在资源受限的UAV平台上具备高效部署能力(FP16下仅需3.9GB内存,INT8量化后降至2.5GB)。

链接: https://arxiv.org/abs/2508.11196
作者: Jiajin Guan(1),Haibo Mei(2),Bonan Zhang(1),Dan Liu(1),Yuanshuang Fu(1),Yue Zhang(2) ((1) Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China, (2) School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu, China)
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.
zh

[CV-63] Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset Method and Benchmark

【速读】:该论文旨在解决现实世界任务辅助中缺乏高质量对话-视频数据集的问题,尤其是针对复杂多步骤任务(如烹饪、机械维修和种植)的指导性对话数据稀缺问题。现有AI代理在任务协助场景中的应用受限于缺乏真实场景下专家与新手之间的细粒度交互数据。解决方案的关键在于提出一种全自动的方法,利用大语言模型(Large Language Models, LLMs)将单人教学视频自动转换为对齐细粒度步骤与视频片段的两人对话形式,从而高效构建大规模、结构化的对话-视频数据集HowToDIV,包含507段会话、6636个问答对及24小时视频内容,为后续基于对话的程序性任务辅助研究提供基准和数据支持。

链接: https://arxiv.org/abs/2508.11192
作者: Lavisha Aggarwal,Vikas Bahirwani,Lin Li,Andrea Colaco
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user’s surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.
zh

[CV-64] CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector ICCV2025

【速读】:该论文旨在解决单目3D目标检测模型在不同相机高度(camera height)下性能下降的问题,即模型在训练时所用的相机高度与测试时存在分布差异(out-of-distribution)时,其检测精度显著降低。现有方法多依赖Plucker嵌入、图像变换或数据增强策略,但未能从根本上缓解因深度估计误差随相机高度变化而产生的系统性偏差。论文通过系统分析扩展版CARLA数据集上的多相机高度场景,发现深度估计是影响模型鲁棒性的关键因素,并首次从数学上证明和实证观察到:基于回归的深度模型在相机高度升高时表现出一致的负向平均深度误差趋势,而基于地面的深度模型则呈现正向趋势。解决方案的关键在于提出Camera Height Robust Monocular 3D Detector (CHARM3R),其核心思想是在模型内部对两种深度估计方式(回归式与地面约束式)进行加权平均,从而有效抑制相机高度变化带来的系统性深度误差,显著提升模型在未见相机高度下的泛化能力,相较基线模型提升超过45%,并在CARLA数据集上达到当前最优(SoTA)性能。

链接: https://arxiv.org/abs/2508.11185
作者: Abhinav Kumar,Yuliang Guo,Zhihao Zhang,Xinyu Huang,Liu Ren,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Bosch Research North America, Bosch Center for AI (博世北美研究中心,博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025

点击查看摘要

Abstract:Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than 45% , achieving SoTA performance on the CARLA dataset. Codes and Models at this https URL
zh

[CV-65] Versatile Video Tokenization with Generative 2D Gaussian Splatting

【速读】:该论文旨在解决现有视频令牌化(video tokenization)方法在空间和时间维度上的局限性:空间上,固定网格划分导致低信息区域冗余编码;时间上,难以有效区分静态与动态内容以减少冗余。其解决方案的关键在于提出高斯视频变换器(Gaussian Video Transformer, GVT),该方法基于生成式二维高斯点绘(Generative 2D Gaussian Splatting, 2DGS)策略,通过时空高斯嵌入(Spatio-Temporal Gaussian Embedding, STGE)机制以前馈方式生成一组2D高斯表示,实现空间自适应分配权重(高信息区域赋予更高渲染权重),并通过高斯集分区(Gaussian Set Partitioning, GSP)将高斯分为静态与动态集合,显式建模跨时间步的共享静态内容和每帧特异的动态内容,从而提升视频重建质量、动作识别性能及压缩效率。

链接: https://arxiv.org/abs/2508.11183
作者: Zhenghao Chen,Zicong Chen,Lei Liu,Yiming Wu,Dong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video this http URL enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact this http URL primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.
zh

[CV-66] Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning

【速读】:该论文旨在解决现有基于适配器(Adapter)的视觉-语言模型(VLMs)微调方法在少样本分类任务中,因依赖显式空间邻近性对齐文本与视觉模态而导致的两个核心问题:一是无法捕捉类别与图像样本之间固有的“一对多”关联关系;二是难以建立未知类别与图像之间的准确关联。其解决方案的关键在于提出一种新颖的潜在层次适配器(Latent Hierarchical Adapter, LatHAdapter),通过引入可学习的“属性”提示作为类别与图像间的桥梁,并将类别、属性提示和图像样本投影到双曲空间(hyperbolic space)中,结合层次正则化机制学习数据的潜在语义层次结构,从而充分建模类别、属性与图像样本之间的细粒度一多关系,显著提升已知类别的适应能力与未知类别的泛化性能。

链接: https://arxiv.org/abs/2508.11176
作者: Yumiao Zhao,Bo Jiang,Yuhe Ding,Xiao Wang,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); Information Materials and Intelligent Sensing Laboratory of Anhui Province (安徽省信息材料与智能感知重点实验室); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data and employ it to provide richer, fine-grained guidance for the adapter learning process. Specifically, LatHAdapter first introduces some learnable `attribute’ prompts as the bridge to align categories and images. Then, it projects the categories, attribute prompts, and images within each batch in a hyperbolic space, and employs hierarchical regularization to learn the latent semantic hierarchy of them, thereby fully modeling the inherent one-to-many associations among categories, learnable attributes, and image samples. Extensive experiments on four challenging few-shot tasks show that the proposed LatHAdapter consistently outperforms many other fine-tuning approaches, particularly in adapting known classes and generalizing to unknown classes.
zh

[CV-67] Exploring the Tradeoff Between Diversity and Discrimination for Continuous Category Discovery CIKM2025

【速读】:该论文旨在解决连续类别发现(Continuous Category Discovery, CCD)中的两大核心挑战:一是如何在无标签的新数据流中有效识别并区分新类别,同时避免灾难性遗忘(catastrophic forgetting);二是现有方法普遍存在的误差累积问题以及知识蒸馏和数据回放导致的存储开销过大。解决方案的关键在于提出一种名为基于独立多样性与正交判别(Independence-based Diversity and Orthogonality-based Discrimination, IDOD)的新框架,其核心创新包括三个模块:独立多样性增强模块通过对比损失分离特征学习以提升多样性,联合新颖性发现模块将多阶段发现过程统一为单阶段以减少误差传播,以及基于正交性的连续增量模块通过生成互正交原型进行分类,并利用代表性表示回放实现低存储开销的遗忘抑制。

链接: https://arxiv.org/abs/2508.11173
作者: Ruobing Jiang,Yang Liu,Haobing Liu,Yanwei Yu,Chunyang Wang
机构: Ocean University of China (中国海洋大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CIKM 2025. 10 pages, 5 figures,

点击查看摘要

Abstract:Continuous category discovery (CCD) aims to automatically discover novel categories in continuously arriving unlabeled data. This is a challenging problem considering that there is no number of categories and labels in the newly arrived data, while also needing to mitigate catastrophic forgetting. Most CCD methods cannot handle the contradiction between novel class discovery and classification well. They are also prone to accumulate errors in the process of gradually discovering novel classes. Moreover, most of them use knowledge distillation and data replay to prevent forgetting, occupying more storage space. To address these limitations, we propose Independence-based Diversity and Orthogonality-based Discrimination (IDOD). IDOD mainly includes independent enrichment of diversity module, joint discovery of novelty module, and continuous increment by orthogonality module. In independent enrichment, the backbone is trained separately using contrastive loss to avoid it focusing only on features for classification. Joint discovery transforms multi-stage novel class discovery into single-stage, reducing error accumulation impact. Continuous increment by orthogonality module generates mutually orthogonal prototypes for classification and prevents forgetting with lower space overhead via representative representation replay. Experimental results show that on challenging fine-grained datasets, our method outperforms the state-of-the-art methods.
zh

[CV-68] Better Supervised Fine-tuning for VQA: Integer-Only Loss

【速读】:该论文旨在解决现有视觉语言模型(Vision Language Models, VLM)在视频质量评估(Video Quality Assessment, VQA)任务中因标签处理不当和损失计算方式不精准导致的评估结果不稳定、关键指标学习不足的问题。解决方案的关键在于提出一种名为IOVQA(Integer-only VQA)的微调方法,其核心创新包括:1)在数据构建阶段将连续型的Overall_MOS分数离散化为区间[10,50]内的整数标签,提升数值稳定性;2)设计目标掩码(target-mask)策略,在损失计算时仅对标签的前两位整数部分进行未掩码处理,迫使模型聚焦于数值评估中的关键维度。该方法显著提升了模型在定量评估场景下的准确性与一致性,验证了整数标签微调的有效性。

链接: https://arxiv.org/abs/2508.11170
作者: Baihong Qian,Haotian Fan,Wenjie Liao,Yunqiu Wang,Tao Li,Junhui Cui
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model’s output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model’s accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge – Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios.
zh

[CV-69] VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images

【速读】:该论文旨在解决源域数据不可获取条件下,遥感图像中目标检测任务因伪标签噪声导致的训练崩溃问题(即Source-Free Object Detection, SFOD中的性能瓶颈)。其解决方案的关键在于引入视觉基础模型(Vision Foundation Model, VFM),通过“免费午餐”式整合策略,在仅需少量目标域标注数据的前提下,提升伪标签质量并增强特征表示的鲁棒性。具体而言,提出了一种VFM引导的伪标签挖掘策略,利用VFM的语义先验对低置信度预测进行修正以提高伪标签可靠性;同时设计了双层级VFM引导对齐方法,分别在实例级和图像级上对齐检测器特征与VFM嵌入,结合对比学习与特征图相似性匹配,有效缓解域间差异带来的干扰,从而显著提升源-free遥感目标检测性能。

链接: https://arxiv.org/abs/2508.11167
作者: Jianhong Han,Yupei Wang,Liang Chen
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Technology Chongqing Innovation Center (北京理工大学重庆创新中心); National Key Laboratory for Space-Born Intelligent Information Processing (空间智能信息处理全国重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript submitted to IEEE TGRS

点击查看摘要

Abstract:Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a “free lunch” manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector’s feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM’s semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.
zh

[CV-70] Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models

【速读】:该论文旨在解决现有去雾方法在处理真实世界厚雾图像时性能受限的问题,其核心挑战在于缺乏成对的雾霾与清晰图像数据以及鲁棒先验知识的不足。解决方案的关键在于提出一种基于期望最大化(Expectation-Maximization, EM)与双向布朗桥扩散模型(Bidirectional Brownian Bridge Diffusion Models, B3DM)的高效半监督图像去雾方法(EM-B3DM),采用两阶段学习策略:第一阶段利用EM算法将成对图像的联合分布分解为两个条件分布,并通过统一的布朗桥扩散模型直接建模雾霾与清晰图像间的结构和内容相关性;第二阶段则借助预训练模型和大规模未配对图像进一步提升去雾性能,同时引入细节增强型残差差异卷积块(Residual Difference Convolution block, RDC)以捕捉梯度级信息,显著增强模型表征能力。

链接: https://arxiv.org/abs/2508.11165
作者: Bing Liu,Le Wang,Mingming Liu,Hao Liu,Rui Yao,Yong Zhou,Peng Liu,Tongqiang Xia
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿井数字化工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Existing dehazing methods deal with real-world haze images with difficulty, especially scenes with thick haze. One of the main reasons is the lack of real-world paired data and robust priors. To avoid the costly collection of paired hazy and clear images, we propose an efficient semi-supervised image dehazing method via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models (EM-B3DM) with a two-stage learning scheme. In the first stage, we employ the EM algorithm to decouple the joint distribution of paired hazy and clear images into two conditional distributions, which are then modeled using a unified Brownian Bridge diffusion model to directly capture the structural and content-related correlations between hazy and clear images. In the second stage, we leverage the pre-trained model and large-scale unpaired hazy and clear images to further improve the performance of image dehazing. Additionally, we introduce a detail-enhanced Residual Difference Convolution block (RDC) to capture gradient-level information, significantly enhancing the model’s representation capability. Extensive experiments demonstrate that our EM-B3DM achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.
zh

[CV-71] LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction ICONIP

【速读】:该论文旨在解决当前生成式AI在STEM教育中难以生成具有语义一致性、叙事连贯性和认知支持性的插图问题,尤其是在应对抽象和序列性科学概念时缺乏结构化布局引导与认知负荷控制。解决方案的关键在于提出LEARN框架,其核心创新包括:基于书封面(BookCover)数据集的叙事布局建模、条件化布局生成、对比视觉-语义训练以及提示调制机制,从而实现与布卢姆认知分类法(Bloom’s taxonomy)对齐的视觉序列生成,有效降低外在认知负荷(extraneous cognitive load),并通过空间组织与故事驱动的方式增强学习者的概念聚焦能力。

链接: https://arxiv.org/abs/2508.11153
作者: Maoquan Zhang,Bisser Raytchev,Xiujuan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The International Conference on Neural Information Processing (ICONIP) 2025

点击查看摘要

Abstract:LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom’s taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.
zh

[CV-72] Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation ICME

【速读】:该论文旨在解决当前深度去雾方法仅能单向去除雾霾、缺乏在有雾图像与无雾图像之间双向转换能力的问题。其核心解决方案是提出一种基于残差的高效双向扩散模型(Residual-based Efficient Bidirectional Diffusion Model, RBDM),关键在于构建双马尔可夫链以有效传递残差并实现双向平滑过渡,同时通过在每个时间步扰动有雾和无雾图像并预测噪声,联合学习两个方向的条件分布;此外,引入统一的分数函数(score function)在图像块级别而非整图上训练,从而提升小数据集下的性能并降低计算开销,最终实现仅需15次采样步即可完成尺寸无关的双向图像转换。

链接: https://arxiv.org/abs/2508.11134
作者: Bing Liu,Le Wang,Hao Liu,Mingming Liu
机构: China University of Mining and Technology (中国矿业大学); Mine Digitization Engineering Research Center of the Ministry of Education (教育部矿井数字化工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 2025 ICME Accepted

点击查看摘要

Abstract:Current deep dehazing methods only focus on removing haze from hazy images, lacking the capability to translate between hazy and haze-free images. To address this issue, we propose a residual-based efficient bidirectional diffusion model (RBDM) that can model the conditional distributions for both dehazing and haze generation. Firstly, we devise dual Markov chains that can effectively shift the residuals and facilitate bidirectional smooth transitions between them. Secondly, the RBDM perturbs the hazy and haze-free images at individual timesteps and predicts the noise in the perturbed data to simultaneously learn the conditional distributions. Finally, to enhance performance on relatively small datasets and reduce computational costs, our method introduces a unified score function learned on image patches instead of entire images. Our RBDM successfully implements size-agnostic bidirectional transitions between haze-free and hazy images with only 15 sampling steps. Extensive experiments demonstrate that the proposed method achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.
zh

[CV-73] UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring

【速读】:该论文旨在解决长时间使用计算机时因不良坐姿引发的公共健康问题,传统姿势监测方案存在隐私泄露风险(如摄像头系统)或用户不适(如可穿戴传感器)。其解决方案的关键在于提出UWB-PostureGuard系统,利用商用超宽带(UWB)传感技术实现无接触、持续的姿势监测,并通过特征工程提取多维人体工学坐姿特征;同时设计PoseGBDT模型有效捕捉姿势变化的时间依赖性,克服了传统逐帧分类方法的局限性,从而在真实环境中实现高精度(99.11%准确率)且对衣物厚度、设备干扰和家具配置等变量具有鲁棒性的姿势识别。

链接: https://arxiv.org/abs/2508.11115
作者: Haotang Li,Zhenyu Qi,Sen He,Kebin Peng,Sheng Tan,Yili Ren,Tomas Cerny,Jiyue Zhao,Zi Wang
机构: University of Arizona (亚利桑那大学); East Carolina University (东卡罗来纳大学); Trinity University (三一大学); University of South Florida (南佛罗里达大学); University of Georgia (佐治亚大学); Augusta University (奥古斯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.
zh

[CV-74] HierOctFusion: Multi-scale Octree-based 3D Shape Generation via Part-Whole-Hierarchy Message Passing

【速读】:该论文旨在解决3D内容生成中因结构复杂性导致的挑战,特别是现有基于八叉树(octree)的扩散模型在建模时忽略语义部件层次结构以及高分辨率整体建模计算成本过高的问题。其解决方案的关键在于提出HierOctFusion,一种具备部件感知能力的多尺度八叉树扩散模型,通过增强层次特征交互来生成细粒度且稀疏的对象结构;同时引入交叉注意力条件机制,将部件级信息注入生成过程,实现从部件到整体的语义特征有效传播,从而提升生成质量与效率。

链接: https://arxiv.org/abs/2508.11106
作者: Xinjie Gao,Bi’an Du,Wei Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D content generation remains a fundamental yet challenging task due to the inherent structural complexity of 3D data. While recent octree-based diffusion models offer a promising balance between efficiency and quality through hierarchical generation, they often overlook two key insights: 1) existing methods typically model 3D objects as holistic entities, ignoring their semantic part hierarchies and limiting generalization; and 2) holistic high-resolution modeling is computationally expensive, whereas real-world objects are inherently sparse and hierarchical, making them well-suited for layered generation. Motivated by these observations, we propose HierOctFusion, a part-aware multi-scale octree diffusion model that enhances hierarchical feature interaction for generating fine-grained and sparse object structures. Furthermore, we introduce a cross-attention conditioning mechanism that injects part-level information into the generation process, enabling semantic features to propagate effectively across hierarchical levels from parts to the whole. Additionally, we construct a 3D dataset with part category annotations using a pre-trained segmentation model to facilitate training and evaluation. Experiments demonstrate that HierOctFusion achieves superior shape quality and efficiency compared to prior methods.
zh

[CV-75] LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters ICCV

【速读】:该论文旨在解决长时视频到音频生成(long-form video-to-audio generation)中普遍存在的问题,即现有方法通常仅适用于短片段(<10秒)或依赖噪声污染的数据集,导致生成音频在时间上不一致且存在拼接伪影。为应对这一挑战,作者提出LD-LAudio-V1模型,其核心创新在于引入双轻量级适配器(dual lightweight adapters),在保持计算效率的同时显著提升音频与视频的时序同步性和质量。此外,研究团队发布了首个干净、人工标注的视频-音频数据集,其中不含噪声或伪影,从而为长时音频生成提供了高质量训练基础。实验表明,该方案在多项指标上均实现显著改进,如音视频语义一致性(Sem.,Rel.)提升20.15%,能量差异(EnergyΔ10ms)降低55.23%,有效缓解了传统方法中的时空不一致性问题。

链接: https://arxiv.org/abs/2508.11074
作者: Haomin Zhang,Kristin Qi,Shuxin Yang,Zihao Chen,Chaofan Ding,Xinhan Di
机构: Giant Network (中国); University of Massachusetts Boston (美国马萨诸塞大学波士顿分校); Trine University (美国特赖恩大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Gen4AVC@ICCV: 1st Workshop on Generative AI for Audio-Visual Content Creation

点击查看摘要

Abstract:Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: FD_\textpasst 450.00 \rightarrow 327.29 (+27.27%), FD_\textpanns 34.88 \rightarrow 22.68 (+34.98%), FD_\textvgg 3.75 \rightarrow 1.28 (+65.87%), KL_\textpanns 2.49 \rightarrow 2.07 (+16.87%), KL_\textpasst 1.78 \rightarrow 1.53 (+14.04%), IS_\textpanns 4.17 \rightarrow 4.30 (+3.12%), IB_\textscore 0.25 \rightarrow 0.28 (+12.00%), Energy\Delta10\textms 0.3013 \rightarrow 0.1349 (+55.23%), Energy\Delta10\textms(this http URL) 0.0531 \rightarrow 0.0288 (+45.76%), and Sem.,Rel. 2.73 \rightarrow 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at this https URL.
zh

[CV-76] Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean Overweight and Obese Cohorts

【速读】:该论文旨在解决传统BMI指标在预测2型糖尿病(Type 2 Diabetes, T2D)风险时的局限性问题,即为何部分瘦体型个体仍罹患T2D而部分肥胖者却未发病。其核心假设是:更精细的腹部体成分特征(如脂肪分布、肌肉含量及器官形态)可能揭示跨体重类别的一致性T2D表型。解决方案的关键在于利用人工智能(AI)从大规模临床CT影像中自动提取三维解剖结构的尺寸、形状与脂肪含量等可解释性测量,并结合随机森林分类模型与SHAP(Shapley Additive Explanations)分析,识别出不同BMI亚组(瘦型、超重、肥胖)中共同的T2D风险/保护特征,从而构建具有生物学意义的体成分签名(body composition signatures)。

链接: https://arxiv.org/abs/2508.11063
作者: Lucas W. Remedios,Chloe Choe,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Alvin C. Powers,Bennett A. Landman,John Virostko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease’s presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.
zh

[CV-77] Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset ACM-MM25

【速读】:该论文旨在解决当前3D视觉语言(3D VL)学习中数据集存在的两大局限:一是现有数据集通常仅要求模型在单视角下对近距离物体进行浅层推理,缺乏对远距离物体的多视图理解能力;二是标注常将指令与单一对象关联,忽视了多个物体之间的丰富上下文对齐关系。为应对这些问题,作者提出两个核心解决方案:其一,构建MV-ScanQA数据集,其中68%的问题明确需要整合多视角信息(显著高于现有数据集<7%),从而严格测试模型的多视图组合推理能力;其二,设计TripAlign预训练语料库,包含100万组2D视图、3D物体集合与文本三元组,通过显式对齐具有语义关联的多物体组与文本,提供比传统单对象标注更丰富的、基于视图的多物体跨模态对齐信号。关键创新在于利用TripAlign实现从预训练2D视觉语言模型(LVLM)到3D域的知识迁移,进而提升模型在多视图场景理解任务中的性能表现。

链接: https://arxiv.org/abs/2508.11058
作者: Wentao Mo,Qingchao Chen,Yuxin Peng,Siyuan Huang,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所); National Institute of Health Data Science, Peking University (北京大学健康数据科学研究所); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室); Wangxuan Institute of Computer Technology, State Key Laboratory of General Artificial Intelligence, Peking University (北京大学计算机技术研究所; 通用人工智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepeted to ACM MM 25

点击查看摘要

Abstract:The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M 2D view, set of 3D objects, text triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at this https URL.
zh

[CV-78] GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning ICCV2025

【速读】:该论文旨在解决视频生成模型在机器人学习中应用时面临的两大问题:一是生成数据质量对策略性能的依赖性过高,二是缺乏环境反馈导致精细操作(fine-grained manipulation)能力不足;同时,现有基于视频的强化学习方法受限于视频生成的不确定性以及大规模机器人数据集收集困难。解决方案的关键在于提出GenFlowRL框架,其通过从多样化跨体感(cross-embodiment)数据集中训练的生成流(generated flow)中提取形状奖励(shaped rewards),并利用低维、以物体为中心(object-centric)的特征来学习可泛化且鲁棒的策略,从而有效克服上述局限性。

链接: https://arxiv.org/abs/2508.11049
作者: Kelin Yu,Sheng Zhang,Harshit Soora,Furong Huang,Heng Huang,Pratap Tokekar,Ruohan Gao
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025

点击查看摘要

Abstract:Recent advances have shown that video generation models can enhance robot learning by deriving effective robot actions through inverse dynamics. However, these methods heavily depend on the quality of generated data and struggle with fine-grained manipulation due to the lack of environment feedback. While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GenFlowRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. This enables learning generalizable and robust policies from diverse demonstrations using low-dimensional, object-centric features. Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow, consistently achieving superior performance across diverse and challenging scenarios. Our Project Page: this https URL
zh

[CV-79] MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割模型在跨任务泛化能力不足的问题,尤其针对细调后的专用模型(如MedSAM)因训练数据有限、标注稀缺及分布偏移等因素导致的性能瓶颈。其解决方案的关键在于提出了一种无需训练的模型融合方法——MedSAMix,通过零阶优化策略自动发现层级最优融合方案,从而整合通用视觉模型(如SAM)与专业医学模型的优势;同时,为满足临床场景中对领域特异性与泛化能力的不同需求,设计了单任务优化和多目标优化两种实现范式,显著提升了模型在25项医学分割任务上的表现,平均提升达6.67%(专用任务)和4.37%(多任务评估)。

链接: https://arxiv.org/abs/2508.11032
作者: Yanwu Yang,Guinan Su,Jiesi Hu,Francesco Sammarco,Jonas Geiping,Thomas Wolfers
机构: University of Tübingen (图宾根大学); German Center for Mental Health (DZPG) (德国心理健康中心); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Universal medical image segmentation models have emerged as a promising paradigm due to their strong generalizability across diverse tasks, showing great potential for a wide range of clinical applications. This potential has been partly driven by the success of general-purpose vision models such as the Segment Anything Model (SAM), which has inspired the development of various fine-tuned variants for medical segmentation tasks. However, fine-tuned variants like MedSAM are trained on comparatively limited medical imaging data that often suffers from heterogeneity, scarce annotations, and distributional shifts. These challenges limit their ability to generalize across a wide range of medical segmentation tasks. In this regard, we propose MedSAMix, a training-free model merging method that integrates the strengths of both generalist models (e.g., SAM) and specialist models (e.g., MedSAM) for medical image segmentation. In contrast to traditional model merging approaches that rely on manual configuration and often result in suboptimal outcomes, we propose a zero-order optimization method to automatically discover optimal layer-wise merging solutions. Furthermore, for clinical applications, we develop two regimes to meet the demand of domain-specificity and generalizability in different scenarios by single-task optimization and multi-objective optimization respectively. Extensive evaluations on 25 medical segmentation tasks demonstrate that MedSAMix effectively mitigates model bias and consistently improves performance in both domain-specific accuracy and generalization, achieving improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations.
zh

[CV-80] Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

【速读】:该论文旨在解决当前建筑安全检查中缺乏大规模、开放标注数据集的问题,从而限制了视觉语言模型(Vision Language Models, VLMs)在实际施工场景中的泛化能力与应用潜力。其解决方案的关键在于构建了一个名为ConstructionSite 10k的新数据集,包含10,000张施工现场图像,并为三个相互关联的任务提供标注:图像描述生成、安全规则违规视觉问答(VQA)以及施工元素视觉定位(visual grounding)。该数据集支持零样本(zero-shot)和少样本(few-shot)学习评估,验证了现有先进VLMs在未直接训练任务上的良好泛化能力,同时揭示了进一步微调的必要性,为后续研究提供了可扩展的基准和训练资源。

链接: https://arxiv.org/abs/2508.11011
作者: Xuezheng Chen,Zhengbo Zou
机构: University of British Columbia (不列颠哥伦比亚大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.
zh

[CV-81] Failures to Surface Harmful Contents in Video Large Language Models

【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, VideoLLMs)在生成视频摘要时对有害内容的严重遗漏问题,即即使有害内容在视频中清晰可见,模型仍极少在输出中提及。其解决方案的关键在于识别并利用三个设计缺陷:(1)因稀疏且均匀的帧采样导致的时间覆盖不足;(2)采样帧内因激进的token下采样引发的空间信息丢失;(3)编码器与解码器之间的连接断层,使得视觉线索在文本生成阶段未被有效利用。基于这些洞察,作者提出了三种零查询黑盒攻击方法,精准针对上述处理流程中的漏洞,实验证明可使多数VideoLLMs的有害内容遗漏率超过90%,从而揭示了现有模型在语义覆盖上的根本性安全缺陷,并呼吁改进采样策略、token压缩机制和解码逻辑以保障内容理解完整性而非仅追求推理速度。

链接: https://arxiv.org/abs/2508.10974
作者: Yuxin Cao,Wei Song,Derui Wang,Jingling Xue,Jin Song Dong
机构: National University of Singapore (新加坡国立大学); University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs’ designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.
zh

[CV-82] Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在无障碍领域中模拟低视力个体视觉感知能力的问题,即评估VLMs能否准确复现低视力人群对图像的理解与识别行为。其解决方案的关键在于构建一个基于40名低视力参与者调研数据的基准测试集,并设计包含个体化视觉信息(如视力缺陷类型和程度)及示例图像响应(包括开放式和多项选择题回答)的提示模板,从而引导GPT-4o生成具有针对性的模拟代理(simulated agents)。实验表明,仅提供单一类型的输入信息(如仅视觉信息或仅示例响应)时,模型输出与真实参与者回答的一致性较低(约0.59),而结合两者信息后显著提升一致性至0.70(p < 0.0001),尤其当示例中同时包含开放式与多项选择题响应时效果最优,进一步增加示例数量则边际收益有限(p < 0.05)。这揭示了多模态、结构化提示设计对提升VLM在特定群体感知建模中的准确性至关重要。

链接: https://arxiv.org/abs/2508.10972
作者: Rosiana Natalie,Wenqian Xu,Ruei-Che Chang,Rada Mihalcea,Anhong Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants’ original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent’ and participants’ responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p 0.0001), while additional examples provided minimal benefits (p 0.05).
zh

[CV-83] EVCtrl: Efficient Control Adapter for Visual Generation

【速读】:该论文旨在解决可控生成模型(如ControlNet)在图像和视频生成任务中因辅助分支引入显著延迟和冗余计算的问题,尤其是在视频生成场景下更为突出。其核心解决方案是提出EVCtrl——一种轻量级、即插即用的控制适配器,通过空间-时间双层缓存策略优化计算效率:空间上,基于对DiT-ControlNet各层响应特性的分析,将网络划分为全局与局部功能区域,并设计局部感知缓存机制仅在需控制信号的局部区域进行计算,避免全局冗余;时间上,通过选择性跳过不必要的去噪步骤进一步提升效率。该方法无需重新训练模型即可实现显著加速(如CogVideo-Controlnet和Wan2.1-Controlnet分别提速2.16倍和2.05倍),同时保持生成质量几乎不变。

链接: https://arxiv.org/abs/2508.10963
作者: Zixiang Yang,Yue Ma,Yinhan Zhang,Shanhui Mo,Dongrui Liu,Linfeng Zhang
机构: UESTC(电子科技大学); HKUST(香港科技大学); SJTU(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation this http URL are available in the supplementary materials.
zh

[CV-84] CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving

【速读】:该论文旨在解决自动驾驶感知系统中对弱势道路使用者(Vulnerable Road Users, VRU)识别困难的问题,尤其针对由色觉异构性(metamerism)导致的视觉模糊问题——即不同材质在RGB图像中呈现相似外观,从而降低VRU与背景的可区分性。解决方案的关键在于利用高光谱成像(Hyperspectral Imaging, HSI)技术,通过捕捉可见光以外近红外(Near-Infrared, NIR)波段的独特材料光谱特征,提升VRU的可辨识度;进一步地,提出一种融合信息论方法(联合互信息最大化、相关性分析)与图像质量指标(对比信噪比,CSNR)的波段选择策略,筛选出最具光谱信息量的三个波段(497 nm、607 nm 和 895 nm),重建伪彩色图像并与RGB图像对比,显著增强VRU与背景之间的差异性和感知可分性,在欧氏距离、光谱角匹配(SAM)、马氏距离(T²)和CIE ΔE等指标上分别提升70.24%至246.62%,有效缓解了色觉异构混淆,为高级驾驶辅助系统(ADAS)和自动驾驶(AD)中的下游感知任务提供了更可靠的输入。

链接: https://arxiv.org/abs/2508.10962
作者: Jiarong Li,Imad Ali Shah,Diarmaid Geever,Fiachra Collins,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
机构: University of Galway (爱尔兰戈尔韦大学); Valeo Vision Systems (瓦莱奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at IEEE OJITS, July, 2025

点击查看摘要

Abstract:Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, \pm 27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, T^2 ) and perception (CIE \Delta E ) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.
zh

[CV-85] ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视觉问答(Visual Question Answering, VQA)任务中对图像中对象属性的抽象与推理能力不足的问题。现有VQA基准测试通常仅涵盖有限的对象属性(如尺寸),且将感知与推理混杂在一起,缺乏对推理层级和图像类别的代表性覆盖。为应对这一挑战,作者提出了一种系统性评估框架——ORBIT(Object Property Reasoning Benchmark),其核心在于:通过三类典型图像类型、三个逐步复杂化的推理层级以及四个基于常识推理先验的对象属性维度构建多层推理型VQA数据集,共包含360张图像及1,080个计数型问题。实验表明,即使在零样本设置下,最先进的12个VLM模型平均准确率仅为40%,显著低于人类水平,尤其在真实摄影图像、物理与功能属性的反事实推理以及高数量级计数任务上表现较差。这揭示了当前VLM在对象属性推理方面的局限性,并指出了未来需发展可扩展的基准测试方法、通用标注规范及更强大的推理型VLM架构。

链接: https://arxiv.org/abs/2508.10956
作者: Abhishek Kolari,Mohammadhossein Khojasteh,Yifan Jiang,Floris den Hengst,Filip Ilievski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors.
zh

[CV-86] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement

【速读】:该论文旨在解决临床扩散磁共振成像(diffusion MRI)中纤维取向分布(Fiber Orientation Distribution, FOD)估计质量受限的问题,尤其是在单壳层、低角度分辨率的常规临床扫描协议下,难以获得可靠且准确的FOD,从而影响后续脑白质纤维束追踪(tractography)和连接组(connectome)分析的可靠性。其解决方案的关键在于提出并验证了一个优化的端到端深度学习增强框架FastFOD-Net,该方法在保持高精度的同时显著提升了训练与推理效率(比前代快60倍),并在健康对照及六种神经系统疾病患者群体中进行了迄今为止最全面的临床评估,证明其能够从真实世界临床数据中提取出接近高质量研究采集水平的FOD,从而推动深度学习方法在临床扩散MRI增强中的可信应用与普及。

链接: https://arxiv.org/abs/2508.10950
作者: Xinyi Wang,Michael Barnett,Frederique Boonstra,Yael Barnett,Mariano Cabezas,Arkiev D’Souza,Matthew C. Kiernan,Kain Kyle,Meng Law,Lynette Masters,Zihao Tang,Stephen Tisch,Sicong Tu,Anneke Van Der Walt,Dongang Wang,Fernando Calamante,Weidong Cai,Chenyu Wang
机构: University of Sydney (悉尼大学); Monash University (蒙纳士大学); St Vincent’s Hospital Sydney (悉尼圣文森特医院); Neura (神经科学研究所); i-MED Network (i-MED网络); Sydney Neurology (悉尼神经学中心); South Western Sydney Local Health District (南西悉尼地方卫生区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ( 60\times faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.
zh

[CV-87] MedAtlas: Evaluating LLM s for Multi-Round Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

【速读】:该论文旨在解决当前医疗人工智能(Artificial Intelligence, AI)模型在临床决策中适应多样现实场景及执行复杂诊断推理能力不足的问题。现有医学多模态基准通常局限于单图像、单轮任务,缺乏对多模态医学影像融合与临床实践中纵向交互特性的刻画。其解决方案的关键在于提出MedAtlas这一新型基准框架,具备四大核心特征:多轮对话、多模态医学影像交互、多任务集成和高临床保真度;支持开放与封闭式多轮问答、多图像联合推理以及综合疾病诊断四项真实诊疗流程驱动的任务,整合CT、MRI、PET、超声和X-ray等多种影像模态与文本病史信息,要求模型实现跨模态深度推理。此外,作者还引入Round Chain Accuracy与Error Propagation Resistance两个新评估指标,揭示了现有多模态模型在多阶段临床推理中的显著性能差距,为构建鲁棒且可信的医疗AI提供了挑战性评测平台。

链接: https://arxiv.org/abs/2508.10947
作者: Ronghao Xu,Zhen Huang,Yangbo Wei,Xiaoqian Zhou,Zikang Xu,Ting Liu,Zihang Jiang,S.Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.
zh

[CV-88] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

【速读】:该论文旨在解决对抗性补丁(adversarial patches)对人工智能模型,尤其是计算机视觉任务中目标检测模型的鲁棒性威胁问题。传统对抗样本通常扰动整张图像,而对抗性补丁则针对图像特定区域进行攻击,导致模型误判且更难防御。其解决方案的关键在于提出增量式补丁生成方法(Incremental Patch Generation, IPG),该方法通过高效迭代优化策略,在生成效率上比现有方法提升最多11.1倍,同时保持相当的攻击效果;IPG能生成泛化能力强的补丁,有效覆盖多种模型漏洞,并可用于构建更具鲁棒性的AI系统,为安全生态系统提供结构化知识基础与主动防御能力。

链接: https://arxiv.org/abs/2508.10946
作者: Wonho Lee,Hyunsik Na,Jisu Lee,Daeseon Choi
机构: Soongsil University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO’s feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.
zh

[CV-89] WatchRoad: Scalable Detection and Geospatial Visualization of Potholes for Smart Cities

【速读】:该论文旨在解决印度复杂多样的道路环境中,坑洼(pothole)检测与实时地图标注的自动化难题,以提升道路安全和车辆使用寿命,并为政府提供可操作的道路维护决策支持。解决方案的关键在于构建了一个端到端的系统 iWatchRoad,其核心包括:利用车载摄像头采集的7000+帧多样化场景数据(涵盖不同路面类型、光照和天气条件)训练YOLO模型实现高精度实时坑洼检测;通过自研OCR模块提取视频帧时间戳并与GPS日志同步,实现坑洼的精准地理标记(geotagging);最终将结构化数据存储于数据库并通过OpenStreetMap(OSM)可视化呈现,形成面向政府使用的标准化输出。该方案具备成本低、硬件效率高、可扩展性强等特点,适用于发展中国家城乡道路管理场景。

链接: https://arxiv.org/abs/2508.10945
作者: Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra
机构: National Institute of Science Education and Research (NISER); Silicon University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Potholes on the roads are a serious hazard and maintenance burden. This poses a significant threat to road safety and vehicle longevity, especially on the diverse and under-maintained roads of India. In this paper, we present a complete end-to-end system called iWatchRoad for automated pothole detection, Global Positioning System (GPS) tagging, and real time mapping using OpenStreetMap (OSM). We curated a large, self-annotated dataset of over 7,000 frames captured across various road types, lighting conditions, and weather scenarios unique to Indian environments, leveraging dashcam footage. This dataset is used to fine-tune, Ultralytics You Only Look Once (YOLO) model to perform real time pothole detection, while a custom Optical Character Recognition (OCR) module was employed to extract timestamps directly from video frames. The timestamps are synchronized with GPS logs to geotag each detected potholes accurately. The processed data includes the potholes’ details and frames as metadata is stored in a database and visualized via a user friendly web interface using OSM. iWatchRoad not only improves detection accuracy under challenging conditions but also provides government compatible outputs for road assessment and maintenance planning through the metadata visible on the website. Our solution is cost effective, hardware efficient, and scalable, offering a practical tool for urban and rural road management in developing regions, making the system automated. iWatchRoad is available at this https URL
zh

[CV-90] Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods

【速读】:该论文旨在解决纺织增强复合材料中“嵌套”(nesting)行为的量化难题,即在干态纺织预成型体受压过程中,相邻织物层因纱线局部穿插和错位导致的几何结构特征难以精确表征的问题。该问题直接影响材料刚度、渗透性和损伤容限等关键力学性能的预测精度。解决方案的关键在于提出了一种基于低分辨率X射线断层扫描(CT)与定制3D-UNet语义分割模型相结合的框架:通过在20.22 μm/体素分辨率下进行原位压缩实验,并利用深度学习模型对基体、纬纱和填充纱相进行高精度分割(最小平均交并比为0.822,F1分数达0.902),进而采用二阶相关函数(two-point correlation function S2S_2)提取平均层厚和嵌套程度等统计几何参数,实现了从工业级CT数据中可靠提取结构描述符的目标,为复合材料预成型体的逆向建模和基于结构描述符的分析奠定了基础。

链接: https://arxiv.org/abs/2508.10943
作者: Christian Düreth,Jan Condé-Wolter,Marek Danczak,Karsten Tittmann,Jörn Jaschinski,Andreas Hornig,Maik Gude
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
备注: submitted to Elsevier Composite Part C: Open Access (JCOMC-D-25-00212), 16 pages, 8 Figures, and 3 Tables

点击查看摘要

Abstract:A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile-reinforced composites. Nesting – characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns – plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low-resolution computed tomography (CT). In-situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 \mu m per voxel resolution. A tailored 3D-UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50–60 %. The model achieved a minimum mean Intersection-over-Union of 0.822 and an F1 score of 0.902. Spatial structure was subsequently analyzed using the two-point correlation function S_2 , allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph-based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor-based structural analysis of composite preforms.
zh

[CV-91] opological Structure Description for Artcode Detection Using the Shape of Orientation Histogram ACM-MM’17

【速读】:该论文旨在解决如何在复杂环境中有效识别一类特殊装饰性标记——Artcodes(一种人类可理解且机器可读的拓扑编码标记)的问题,其核心挑战在于将几何形态各异但拓扑结构相似的Artcodes统一归类为同一类别,从而实现对这类对象的存在性检测。解决方案的关键在于提出了一种新的特征描述子——方向直方图形状(shape of orientation histogram),该描述子能够捕捉Artcodes的通用拓扑结构信息,并基于此构建了Artcode提案检测系统,通过收集数据集并进行系统实验验证了该特征向量在表示拓扑结构方面的可行性及检测系统的有效性。

链接: https://arxiv.org/abs/2508.10942
作者: Liming Xu,Dave Towey,Andrew P. French,Steve Benford
机构: University of Cambridge (剑桥大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: This work is an extension of an ACM MM’17 workshop paper (Xu et al, 2017), which was completed in late 2017 and early 2018 during the first author’s doctoral studies at the University of Nottingham. This paper includes 42 pages, 25 figures, 7 tables, and 13,536 words

点击查看摘要

Abstract:The increasing ubiquity of smartphones and resurgence of VR/AR techniques, it is expected that our everyday environment may soon be decorating with objects connecting with virtual elements. Alerting to the presence of these objects is therefore the first step for motivating follow-up further inspection and triggering digital material attached to the objects. This work studies a special kind of these objects – Artcodes – a human-meaningful and machine-readable decorative markers that camouflage themselves with freeform appearance by encoding information into their topology. We formulate this problem of recongising the presence of Artcodes as Artcode proposal detection, a distinct computer vision task that classifies topologically similar but geometrically and semantically different objects as a same class. To deal with this problem, we propose a new feature descriptor, called the shape of orientation histogram, to describe the generic topological structure of an Artcode. We collect datasets and conduct comprehensive experiments to evaluate the performance of the Artcode detection proposer built upon this new feature vector. Our experimental results show the feasibility of the proposed feature vector for representing topological structures and the effectiveness of the system for detecting Artcode proposals. Although this work is an initial attempt to develop a feature-based system for detecting topological objects like Artcodes, it would open up new interaction opportunities and spark potential applications of topological object detection.
zh

[CV-92] NIRMAL Pooling: An Adaptive Max Pooling Approach with Non-linear Activation for Enhanced Image Classification

【速读】:该论文旨在解决传统最大池化(Max Pooling)在图像分类任务中特征表达能力有限、鲁棒性不足的问题。其解决方案的关键在于提出一种新型池化层——NIRMAL Pooling,该方法融合自适应最大池化与非线性激活函数(如ReLU),通过动态调整池化参数以匹配目标输出维度,并在池化后引入ReLU激活机制,从而增强特征的表达能力和模型对复杂图像变化的鲁棒性。实验表明,NIRMAL Pooling在多个基准数据集上均优于标准Max Pooling,尤其在CIFAR-10等复杂场景下提升显著。

链接: https://arxiv.org/abs/2508.10940
作者: Nirmal Gaud,Krishna Kumar Jha,Jhimli Adhikari,Adhini Nasarin P S,Joydeep Das,Samarth S Deshpande,Nitasha Barara,Vaduguru Venkata Ramya,Santu Saha,Mehmet Tarik Baran,Sarangi Venkateshwarlu,Anusha M D,Surej Mouli,Preeti Katiyar,Vipin Kumar Chaudhary
机构: ThinkAI - A Machine Learning Community; Guru Nanak Institute of Technology, India; Narayan Zantye College, Goa, India; NIIT Foundation, India; Dayananda Sagar College of Engineering, Bangalore, India; IILM, Lodhi Road, India; Sri Balaji University, Pune, India; University of Calcutta, India; Mardin Artuklu University, Turkey; City University of Hong Kong, Hong Kong; Yenepoya (Deemed to be University), India; Aston University, Birmingham, UK; Delhi Technical Campus, Greater Noida, India; Lovely Professional University, Phagwara, Punjab, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:This paper presents NIRMAL Pooling, a novel pooling layer for Convolutional Neural Networks (CNNs) that integrates adaptive max pooling with non-linear activation function for image classification tasks. The acronym NIRMAL stands for Non-linear Activation, Intermediate Aggregation, Reduction, Maximum, Adaptive, and Localized. By dynamically adjusting pooling parameters based on desired output dimensions and applying a Rectified Linear Unit (ReLU) activation post-pooling, NIRMAL Pooling improves robustness and feature expressiveness. We evaluated its performance against standard Max Pooling on three benchmark datasets: MNIST Digits, MNIST Fashion, and CIFAR-10. NIRMAL Pooling achieves test accuracies of 99.25% (vs. 99.12% for Max Pooling) on MNIST Digits, 91.59% (vs. 91.44%) on MNIST Fashion, and 70.49% (vs. 68.87%) on CIFAR-10, demonstrating consistent improvements, particularly on complex datasets. This work highlights the potential of NIRMAL Pooling to enhance CNN performance in diverse image recognition tasks, offering a flexible and reliable alternative to traditional pooling methods.
zh

[CV-93] Deep Learning for Automated Identification of Vietnamese Timber Species: A Tool for Ecological Monitoring and Conservation

【速读】:该论文旨在解决传统木材物种识别方法依赖人工宏观与微观观察、耗时且需专家经验的问题,从而实现自动化、高精度的木材分类。其解决方案的关键在于构建了一个基于实地采集样本的定制图像数据集,并系统评估了五种先进的卷积神经网络架构(ResNet50、EfficientNet、MobileViT、MobileNetV3 和 ShuffleNetV2),最终发现 ShuffleNetV2 在分类性能与计算效率之间达到了最佳平衡,平均准确率达到 99.29%,F1 分数为 99.35%,证明轻量级深度学习模型在资源受限环境中具备实时、高精度识别木材物种的潜力。

链接: https://arxiv.org/abs/2508.10938
作者: Tianyu Song,Van-Doan Duong,Thi-Phuong Le,Ton Viet Ta
机构: Mathematical Modeling Laboratory, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University(九州大学生物资源与生物环境科学研究生院数学建模实验室); International Cooperation Center, Thai Nguyen University(太原大学国际合作中心); Centrer of Crop Research for Adaptation to Climate Change, Thai Nguyen University of Agriculture and Forestry(太原农业林业大学作物适应气候变化研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate identification of wood species plays a critical role in ecological monitoring, biodiversity conservation, and sustainable forest management. Traditional classification approaches relying on macroscopic and microscopic inspection are labor-intensive and require expert knowledge. In this study, we explore the application of deep learning to automate the classification of ten wood species commonly found in Vietnam. A custom image dataset was constructed from field-collected wood samples, and five state-of-the-art convolutional neural network architectures–ResNet50, EfficientNet, MobileViT, MobileNetV3, and ShuffleNetV2–were evaluated. Among these, ShuffleNetV2 achieved the best balance between classification performance and computational efficiency, with an average accuracy of 99.29% and F1-score of 99.35% over 20 independent runs. These results demonstrate the potential of lightweight deep learning models for real-time, high-accuracy species identification in resource-constrained environments. Our work contributes to the growing field of ecological informatics by providing scalable, image-based solutions for automated wood classification and forest biodiversity assessment.
zh

[CV-94] Personalized Face Super-Resolution with Identity Decoupling and Fitting

【速读】:该论文旨在解决极端退化场景下人脸超分辨率(Face Super-Resolution, FSR)中因图像信息严重丢失而导致的ID一致性差和幻觉问题,尤其是在8×等大尺度缩放条件下,传统方法难以重建真实且具有身份一致性的面部图像。解决方案的关键在于提出一种基于身份解耦与拟合(Identity Decoupling and Fitting, IDFSR)的新方法,其核心设计包括:1)对低分辨率(Low-Resolution, LR)图像中的面部区域进行掩码处理以去除不可靠的身份线索;2)通过将参考图像形变对齐至LR输入来提供风格引导;3)利用真实图像(Ground Truth, GT)提取的身份嵌入(ID embedding)实现细粒度身份建模与个性化适配。该方法首先预训练一个扩散模型以显式分离风格与身份信息,随后冻结大部分网络参数并仅对少量目标身份图像进行轻量级身份嵌入微调,从而显著提升重建图像的身份一致性与感知质量。

链接: https://arxiv.org/abs/2508.10937
作者: Jiarui Yang,Hang Guo,Wen Huang,Tao Dai,Shutao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, face super-resolution (FSR) methods have achieved remarkable progress, generally maintaining high image fidelity and identity (ID) consistency under standard settings. However, in extreme degradation scenarios (e.g., scale 8\times ), critical attributes and ID information are often severely lost in the input image, making it difficult for conventional models to reconstruct realistic and ID-consistent faces. Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints. To address this challenge, we propose a novel FSR method with Identity Decoupling and Fitting (IDFSR), designed to enhance ID restoration under large scaling factors while mitigating hallucination effects. Our approach involves three key designs: 1) \textbfMasking the facial region in the low-resolution (LR) image to eliminate unreliable ID cues; 2) \textbfWarping a reference image to align with the LR input, providing style guidance; 3) Leveraging \textbfID embeddings extracted from ground truth (GT) images for fine-grained ID modeling and personalized adaptation. We first pretrain a diffusion-based model to explicitly decouple style and ID by forcing it to reconstruct masked LR face regions using both style and identity embeddings. Subsequently, we freeze most network parameters and perform lightweight fine-tuning of the ID embedding using a small set of target ID images. This embedding encodes fine-grained facial attributes and precise ID information, significantly improving both ID consistency and perceptual quality. Extensive quantitative evaluations and visual comparisons demonstrate that the proposed IDFSR substantially outperforms existing approaches under extreme degradation, particularly achieving superior performance on ID consistency.
zh

[CV-95] Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

【速读】:该论文旨在解决协同感知(collaborative perception)中3D语义占据预测(3D semantic occupancy prediction)的通信效率与感知精度之间的矛盾问题。现有基于视觉的方法通常依赖密集的3D体素(dense 3D voxels)导致高通信开销,或使用2D平面特征需精确深度估计或额外监督,限制了其在协同场景中的适用性。解决方案的关键在于提出首个利用稀疏3D语义高斯点绘(sparse 3D semantic Gaussian splatting)的协同方法:通过共享和融合中间高斯原语(Gaussian primitives),实现基于邻域的跨车辆融合以去重并抑制噪声;每个原语联合编码几何与语义信息,降低对深度监督的依赖并支持简单刚性对齐;同时传输稀疏、以物体为中心的消息,在保留结构信息的同时显著减少通信量。

链接: https://arxiv.org/abs/2508.10936
作者: Cheng Chen,Hao Huang,Saurabh Bagchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.
zh

[CV-96] HQ-OV3D: A High Box Quality Open-World 3D Detection Framework based on Diffision Model

【速读】:该论文旨在解决开放词汇三维目标检测(open-vocabulary 3D detection)中伪标签(pseudo-label)质量不足的问题,尤其是现有方法在生成伪标签时几何精度(如边界框精度)较差,限制了模型对新类别(novel classes)的检测性能。解决方案的关键在于提出一种高边界框质量的开放词汇三维检测框架(HQ-OV3D),其核心创新包括:1)跨模态交叉验证的提案生成器(Intra-Modality Cross-Validated Proposal Generator, IMCV),利用多模态几何一致性生成高质量初始3D提案;2)基于标注类别几何先验的去噪器(Annotated-Class Assisted Denoiser, ACA),通过DDIM(Denoising Diffusion Implicit Models)引导的渐进式精修机制提升提案精度。实验证明,使用该框架生成的伪标签可使新类别的平均精度(mAP)提升7.37%,显著优于现有方法。

链接: https://arxiv.org/abs/2508.10935
作者: Qi Liu,Yabei Li,Hongsong Wang,Lei He
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly this http URL address this issue, we propose a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework, dedicated to generate and refine high-quality pseudo-labels for open-vocabulary classes. The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising this http URL to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines.
zh

[CV-97] ViPE: Video Pose Engine for 3D Geometric Perception

【速读】:该论文旨在解决从真实世界视频中获取一致且精确的三维几何感知信息这一关键挑战,尤其是在缺乏大规模标注数据的情况下。现有方法依赖于大量训练数据,而实际场景中的视频往往存在复杂的相机运动、动态物体和多样的镜头类型(如广角或全景),导致传统方法难以准确估计相机内参、位姿及稠密近度量级深度图。解决方案的核心在于提出 ViPE(Video Processing Engine),其通过高效联合优化相机内参、外参与稠密深度图,实现对非约束条件下原始视频的鲁棒处理,支持多种相机模型(如针孔、广角、360°全景),并在多个基准测试上显著优于未校准位姿估计基线方法(TUM/KITTI 序列提升 18%/50%),同时保持实时性(单 GPU 下 3–5 FPS)。此外,作者利用 ViPE 构建了一个包含约 9600 万帧的大规模标注数据集,涵盖真实互联网视频、AI 生成视频和全景视频,为空间智能系统(Spatial AI)的发展提供高质量基础资源。

链接: https://arxiv.org/abs/2508.10934
作者: Jiahui Huang,Qunjie Zhou,Hesam Rabeti,Aleksandr Korovko,Huan Ling,Xuanchi Ren,Tianchang Shen,Jun Gao,Dmitry Slepichev,Chen-Hsuan Lin,Jiawei Ren,Kevin Xie,Joydeep Biswas,Laura Leal-Taixe,Sanja Fidler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Paper website: this https URL

点击查看摘要

Abstract:Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames – all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
zh

[CV-98] Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications ICCV

【速读】:该论文旨在解决零售环境中相机位姿估计(Camera Pose Estimation)的精度问题,尤其在数据稀缺条件下如何提升绝对位姿回归(Absolute Pose Regression, APR)的准确性。现有方法虽能通过单图预测位姿,但引入视觉与空间场景先验(visual and spatial scene priors)可显著提高精度。其解决方案的关键在于将相机位姿自编码器(Pose Auto-Encoder, PAE)从APR扩展至相对位姿回归(Relative Pose Regression, RPR),并设计一种基于PAE-RPR的重定位优化策略:利用PAE提取的场景结构信息对初始APR结果进行精细化修正,无需额外存储图像或位姿数据即可实现高精度位姿估计。实验表明,该方法在仅使用30%训练数据时仍具竞争力,大幅降低零售场景部署的数据采集成本。

链接: https://arxiv.org/abs/2508.10933
作者: Yoli Shavit,Yosi Keller
机构: Bar Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to ICCVW 2025

点击查看摘要

Abstract:Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: this https URL
zh

[CV-99] VSF: Simple Efficient and Effective Negative Guidance in Few-Step Image Generation Models By underlineValue underlineSign underlineFlip

【速读】:该论文旨在解决在少步数扩散模型(few-step diffusion models)和流匹配(flow-matching)图像生成模型中,如何高效且有效地引入负向提示词(negative prompt)引导的问题。现有方法如无分类器引导(Classifier-Free Guidance, CFG)、NASA 和 NAG 在负向提示词的抑制能力上存在不足,尤其是在少步数场景下效果受限。其解决方案的关键在于提出 Value Sign Flip (VSF) 方法:通过动态翻转来自负向提示词的注意力值(attention values)的符号,实现对不希望出现内容的抑制,该机制计算开销极小,并能无缝集成到 MMDiT 架构(如 Stable Diffusion 3.5 Turbo)和基于交叉注意力(cross-attention)的模型(如 Wan)中,从而在静态图像与视频生成任务中均显著优于先前方法,在保持图像质量的同时大幅提升负向提示词遵循度。

链接: https://arxiv.org/abs/2508.10931
作者: Wenqi Guo,Shan Du
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in this https URL.
zh

[CV-100] A Survey on Video Temporal Grounding with Multimodal Large Language Model

【速读】:该论文旨在解决当前视频时序定位(Video Temporal Grounding, VTG)研究中缺乏对基于多模态大语言模型(Multimodal Large Language Models, MLLMs)方法的系统性综述问题。随着MLLM在多模态理解与推理能力上的显著提升,VTG-MLLM方法已展现出优于传统微调模型的性能,并在零样本、多任务和跨域场景下具备更强泛化能力,但现有文献尚未对其进行结构化梳理。解决方案的关键在于构建一个三维分类体系:首先从MLLM的功能角色出发,强调其架构设计的重要性;其次分析训练范式,包括时间推理机制与任务适配策略;最后聚焦视频特征处理技术,评估其对时空表征效果的影响。该框架为深入理解VTG-MLLMs提供了理论基础与实践指导,并指出了未来研究方向。

链接: https://arxiv.org/abs/2508.10922
作者: Jianlong Wu,Wei Liu,Ye Liu,Meng Liu,Liqiang Nie,Zhouchen Lin,Chang Wen Chen
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Shandong Jianzhu University (山东建筑大学); Peking University (北京大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages,6 figures,survey

点击查看摘要

Abstract:The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at this https URL.
zh

[CV-101] Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder

【速读】:该论文旨在解决眼动信号(gaze signals)在跨会话使用中存在用户隐私泄露风险的问题,即未授权情况下可能通过眼动数据实现用户再识别(re-identification)。其解决方案的关键在于提出一种基于潜在噪声自编码器(latent-noise autoencoder)的隐私增强机制,该机制能在显著降低生物特征可识别性的同时,最小化对下游任务(如眼动预测)的可用性影响,并保留符合生理学意义的眼动模式,从而实现更优的隐私-效用权衡(privacy-utility trade-off)。

链接: https://arxiv.org/abs/2508.10918
作者: Samantha Aziz,Oleg Komogortsev
机构: Texas State University (德克萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: IJCB 2025; 11 pages, 7 figures

点击查看摘要

Abstract:We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.
zh

[CV-102] Semi-Supervised Learning with Online Knowledge Distillation for Skin Lesion Classification

【速读】:该论文旨在解决皮肤病变分类中因依赖大量标注数据而导致的标注成本高、获取困难的问题。现有深度学习方法多采用全监督学习,而本研究提出了一种新颖的半监督深度学习方法,其关键在于将集成学习(ensemble learning)与在线知识蒸馏(online knowledge distillation)相结合:通过训练一个卷积神经网络(Convolutional Neural Network, CNN)集成模型,并利用在线知识蒸馏机制将集成模型的整体知识迁移至各成员模型,从而提升每个个体模型的性能,同时保证最终部署时可直接使用任一成员模型,显著降低对标注数据的依赖并提高资源效率。

链接: https://arxiv.org/abs/2508.11511
作者: Siyamalan Manivannan
机构: University of Jaffna (贾夫纳大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep Learning has emerged as a promising approach for skin lesion analysis. However, existing methods mostly rely on fully supervised learning, requiring extensive labeled data, which is challenging and costly to obtain. To alleviate this annotation burden, this study introduces a novel semi-supervised deep learning approach that integrates ensemble learning with online knowledge distillation for enhanced skin lesion classification. Our methodology involves training an ensemble of convolutional neural network models, using online knowledge distillation to transfer insights from the ensemble to its members. This process aims to enhance the performance of each model within the ensemble, thereby elevating the overall performance of the ensemble itself. Post-training, any individual model within the ensemble can be deployed at test time, as each member is trained to deliver comparable performance to the ensemble. This is particularly beneficial in resource-constrained environments. Experimental results demonstrate that the knowledge-distilled individual model performs better than independently trained models. Our approach demonstrates superior performance on both the \emphInternational Skin Imaging Collaboration 2018 and 2019 public benchmark datasets, surpassing current state-of-the-art results. By leveraging ensemble learning and online knowledge distillation, our method reduces the need for extensive labeled data while providing a more resource-efficient solution for skin lesion classification in real-world scenarios.
zh

[CV-103] Subcortical Masks Generation in CT Images via Ensemble-Based Cross-Domain Label Transfer

【速读】:该论文旨在解决CT影像中亚皮层结构(subcortical structures)自动分割缺乏高质量标注数据的问题。由于CT在临床中广泛应用,但其标注数据远少于MRI,导致基于深度学习的分割模型难以训练。解决方案的关键在于提出一种自动集成框架(automatic ensemble framework),通过融合已有的MRI分割模型来生成CT图像上的高质量标签,并利用配对的MRI-CT数据构建首个公开的CT亚皮层分割数据集。该方法有效缓解了标注数据稀缺问题,显著提升了CT分割性能,为后续相关研究提供了重要资源与基础。

链接: https://arxiv.org/abs/2508.11450
作者: Augustine X. W. Lee,Pak-Hei Yeung,Jagath C. Rajapakse
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subcortical segmentation in neuroimages plays an important role in understanding brain anatomy and facilitating computer-aided diagnosis of traumatic brain injuries and neurodegenerative disorders. However, training accurate automatic models requires large amounts of labelled data. Despite the availability of publicly available subcortical segmentation datasets for Magnetic Resonance Imaging (MRI), a significant gap exists for Computed Tomography (CT). This paper proposes an automatic ensemble framework to generate high-quality subcortical segmentation labels for CT scans by leveraging existing MRI-based models. We introduce a robust ensembling pipeline to integrate them and apply it to unannotated paired MRI-CT data, resulting in a comprehensive CT subcortical segmentation dataset. Extensive experiments on multiple public datasets demonstrate the superior performance of our proposed framework. Furthermore, using our generated CT dataset, we train segmentation models that achieve improved performance on related segmentation tasks. To facilitate future research, we make our source code, generated dataset, and trained models publicly available at this https URL, marking the first open-source release for CT subcortical segmentation to the best of our knowledge.
zh

[CV-104] LKFMixer: Exploring Large Kernel Feature For Efficient Image Super-Resolution

【速读】:该论文旨在解决Transformer架构在图像超分辨率(Image Super-Resolution, ISR)任务中虽能有效捕捉非局部信息(non-local information)但计算开销过大、难以部署于轻量化场景的问题。其核心解决方案是提出一种纯卷积神经网络(CNN)模型LKFMixer,关键在于通过使用大感受野的卷积核(kernel size=31)模拟自注意力机制对非局部特征的建模能力,并借助坐标分解(coordinate decomposition)显著降低参数量与计算复杂度;同时引入空间特征调制块(SFMB)增强多维特征聚焦能力,并设计特征选择块(FSB)自适应融合局部与非局部特征权重,从而在保持高重建质量的同时实现高效推理。

链接: https://arxiv.org/abs/2508.11391
作者: Yinggan Tang,Quanwei Hu
机构: Yanshan University (燕山大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of self-attention (SA) in Transformer demonstrates the importance of non-local information to image super-resolution (SR), but the huge computing power required makes it difficult to implement lightweight models. To solve this problem, we propose a pure convolutional neural network (CNN) model, LKFMixer, which utilizes large convolutional kernel to simulate the ability of self-attention to capture non-local features. Specifically, we increase the kernel size to 31 to obtain the larger receptive field as possible, and reduce the parameters and computations by coordinate decomposition. Meanwhile, a spatial feature modulation block (SFMB) is designed to enhance the focus of feature information on both spatial and channel dimension. In addition, by introducing feature selection block (FSB), the model can adaptively adjust the weights between local features and non-local features. Extensive experiments show that the proposed LKFMixer family outperform other state-of-the-art (SOTA) methods in terms of SR performance and reconstruction quality. In particular, compared with SwinIR-light on Manga109 dataset, LKFMixer-L achieves 0.6dB PSNR improvement at \times 4 scale, while the inference speed is \times 5 times faster. The code is available at this https URL.
zh

[CV-105] AnatoMaskGAN: GNN-Driven Slice Feature Fusion and Noise Augmentation for Medical Semantic Image Synthesis

【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学图像语义掩码合成中存在的一对一图像映射和复杂扫描图像空间一致性不足的问题。其解决方案的关键在于提出 AnatoMaskGAN 框架,通过引入基于图神经网络(GNN)的强相关切片特征融合模块以建模切片间的空间关系并整合邻近切片的上下文信息,设计三维空间噪声注入策略以增强结构多样性建模,并结合灰度-纹理分类器优化生成过程中的灰度分布与纹理表征,从而显著提升重建精度和感知质量。

链接: https://arxiv.org/abs/2508.11375
作者: Zonglin Wu,Yule Xue,Qianxiang Hu,Yaoyao Feng,Yuqi Ma,Shanxiong Chen
机构: Southwest University (西南大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Medical semantic-mask synthesis boosts data augmentation and analysis, yet most GAN-based approaches still produce one-to-one images and lack spatial consistency in complex scans. To address this, we propose AnatoMaskGAN, a novel synthesis framework that embeds slice-related spatial features to precisely aggregate inter-slice contextual dependencies, introduces diverse image-augmentation strategies, and optimizes deep feature learning to improve performance on complex medical images. Specifically, we design a GNN-based strongly correlated slice-feature fusion module to model spatial relationships between slices and integrate contextual information from neighboring slices, thereby capturing anatomical details more comprehensively; we introduce a three-dimensional spatial noise-injection strategy that weights and fuses spatial features with noise to enhance modeling of structural diversity; and we incorporate a grayscale-texture classifier to optimize grayscale distribution and texture representation during generation. Extensive experiments on the public L2R-OASIS and L2R-Abdomen CT datasets show that AnatoMaskGAN raises PSNR on L2R-OASIS to 26.50 dB (0.43 dB higher than the current state of the art) and achieves an SSIM of 0.8602 on L2R-Abdomen CT–a 0.48 percentage-point gain over the best model, demonstrating its superiority in reconstruction accuracy and perceptual quality. Ablation studies that successively remove the slice-feature fusion module, spatial 3D noise-injection strategy, and grayscale-texture classifier reveal that each component contributes significantly to PSNR, SSIM, and LPIPS, further confirming the independent value of each core design in enhancing reconstruction accuracy and perceptual quality.
zh

[CV-106] Guiding WaveMamba with Frequency Maps for Image Debanding

【速读】:该论文旨在解决现代视频编码器在低比特率压缩下产生的带状伪影(banding artifacts)问题,尤其在天空等平滑区域尤为明显,此类伪影会显著降低视觉质量,并因多次转码而在用户生成内容中广泛存在。其解决方案的关键在于提出一种基于小波状态空间模型(Wavelet State Space Model)与频率掩蔽图(frequency masking map)相结合的后处理修复方法,能够在有效抑制带状伪影的同时保留图像的高频纹理细节,实验表明该方法在公开数据集上相比当前最优方法(DBI值为0.082)具有更优的去伪影效果且视觉质量更佳。

链接: https://arxiv.org/abs/2508.11331
作者: Xinyi Wang,Smaranda Tasmoc,Nantheera Anantrasirichai,Angeliki Katsenou
机构: University of Bristol (布里斯托大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: this https URL.
zh

[CV-107] mporally-Similar Structure-Aware Spatiotemporal Fusion of Satellite Images

【速读】:该论文旨在解决卫星图像时空融合(Spatiotemporal Fusion, STF)中因噪声干扰导致的细节丢失与过度平滑问题。现有噪声鲁棒型ST融合方法往往难以保留精细空间结构,从而产生伪影。其解决方案的关键在于提出一种名为Temporally-Similar Structure-Aware ST Fusion (TSSTF) 的新框架,包含两个核心机制:一是时序引导的总变差正则化(Temporally-Guided Total Variation, TGTV),通过邻近日期的高空间分辨率参考图像引导,实现空间分段平滑的同时保持结构细节;二是时序边缘约束(Temporally-Guided Edge Constraint, TGEC),强制相邻时相图像在边缘位置上保持一致性,同时允许光谱差异存在。该方法将融合任务建模为带TGTV和TGEC约束的优化问题,并采用预条件原始对偶分裂算法高效求解,在噪声环境下显著优于当前最优方法。

链接: https://arxiv.org/abs/2508.11259
作者: Ryosuke Isono,Shunsuke Ono
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing. arXiv admin note: text overlap with arXiv:2308.00500

点击查看摘要

Abstract:This paper proposes a novel spatiotemporal (ST) fusion framework for satellite images, named Temporally-Similar Structure-Aware ST fusion (TSSTF). ST fusion is a promising approach to address the trade-off between the spatial and temporal resolution of satellite images. In real-world scenarios, observed satellite images are severely degraded by noise due to measurement equipment and environmental conditions. Consequently, some recent studies have focused on enhancing the robustness of ST fusion methods against noise. However, existing noise-robust ST fusion approaches often fail to capture fine spatial structure, leading to oversmoothing and artifacts. To address this issue, TSSTF introduces two key mechanisms: Temporally-Guided Total Variation (TGTV) and Temporally-Guided Edge Constraint (TGEC). TGTV is a novel regularization function that promotes spatial piecewise smoothness while preserving structural details, guided by a reference high spatial resolution image acquired on a nearby date. TGEC enforces consistency in edge locations between two temporally adjacent images, while allowing for spectral variations. We formulate the ST fusion task as a constrained optimization problem incorporating TGTV and TGEC, and develop an efficient algorithm based on a preconditioned primal-dual splitting method. Experimental results demonstrate that TSSTF performs comparably to state-of-the-art methods under noise-free conditions and outperforms them under noisy conditions. Additionally, we provide a comprehensive set of recommended parameter values that consistently yield high performance across diverse target regions and noise conditions, aiming to enhance reproducibility and practical utility.
zh

[CV-108] Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

【速读】:该论文旨在解决医学计算机断层扫描(CT)中因扫描对象超出设备视野(FOV)而导致的投影数据截断问题,此问题会引发重建图像在FOV边界附近出现显著伪影,严重影响临床诊断可靠性。传统重建算法难以从截断数据中恢复准确解剖结构,而现有深度学习方法虽有所改进,但扩散生成模型存在推理速度慢的问题。解决方案的关键在于提出一种基于图像到图像薛定谔桥(I²SB)扩散模型的高效FOV扩展框架:该模型不依赖纯高斯噪声迭代采样,而是直接学习有限FOV与扩展FOV图像之间的随机映射关系,从而实现单步推理(每2D切片仅需0.19秒),同时保持优于当前最优扩散模型(如cDDPM和patch-based diffusion)的重建精度(RMSE低至49.8 HU,模拟噪声数据下;152.0 HU,真实数据下),兼顾了生成质量与计算效率,具备临床实时部署潜力。

链接: https://arxiv.org/abs/2508.11211
作者: Zhenhao Li,Long Yang,Xiaojie Yin,Haijun Yu,Jiazhou Wang,Hongbin Han,Weigang Hu,Yixing Huang
机构: Shanghai Cancer Center, Fudan University (复旦大学上海癌症中心); Institute of Medical Technology, Peking University Health Science Center (北京大学医学部医学技术研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner’s field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I ^2 SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I ^2 SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I ^2 SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8,HU on simulated noisy data and 152.0HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135s) and surpassing diffusionGAN (0.58s), the second fastest. This combination of accuracy and efficiency makes I ^2 SB highly suitable for real-time or clinical deployment.
zh

[CV-109] HistoViT: Vision Transformer for Accurate and Scalable Histopathological Cancer Diagnosis

【速读】:该论文旨在解决现代病理学中癌症精准且可扩展诊断的挑战,尤其针对乳腺癌、前列腺癌、骨肉瘤和宫颈癌等具有复杂组织学变异性的恶性肿瘤。其解决方案的关键在于提出一种基于Transformer的深度学习框架,利用微调后的视觉Transformer(Vision Transformer, ViT)架构替代传统卷积神经网络(Convolutional Neural Networks, CNNs),从而在减少预处理需求的同时提升模型性能与跨组织类型的可扩展性。通过构建简化的预处理流程将全切片图像转换为PyTorch张量并进行标准化,确保与ViT架构兼容,并显著增强收敛稳定性和分类准确性,最终在四个基准数据集上均实现超过96%的分类准确率和AUC值,验证了该方法在数字病理学中的鲁棒性、泛化能力及临床应用潜力。

链接: https://arxiv.org/abs/2508.11181
作者: Faisal Ahmed
机构: Embry-Riddle Aeronautical University (Embry-Riddle 航空大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 3 Figures

点击查看摘要

Abstract:Accurate and scalable cancer diagnosis remains a critical challenge in modern pathology, particularly for malignancies such as breast, prostate, bone, and cervical, which exhibit complex histological variability. In this study, we propose a transformer-based deep learning framework for multi-class tumor classification in histopathological images. Leveraging a fine-tuned Vision Transformer (ViT) architecture, our method addresses key limitations of conventional convolutional neural networks, offering improved performance, reduced preprocessing requirements, and enhanced scalability across tissue types. To adapt the model for histopathological cancer images, we implement a streamlined preprocessing pipeline that converts tiled whole-slide images into PyTorch tensors and standardizes them through data normalization. This ensures compatibility with the ViT architecture and enhances both convergence stability and overall classification performance. We evaluate our model on four benchmark datasets: ICIAR2018 (breast), SICAPv2 (prostate), UT-Osteosarcoma (bone), and SipakMed (cervical) dataset – demonstrating consistent outperformance over existing deep learning methods. Our approach achieves classification accuracies of 99.32%, 96.92%, 95.28%, and 96.94% for breast, prostate, bone, and cervical cancers respectively, with area under the ROC curve (AUC) scores exceeding 99% across all datasets. These results confirm the robustness, generalizability, and clinical potential of transformer-based architectures in digital pathology. Our work represents a significant advancement toward reliable, automated, and interpretable cancer diagnosis systems that can alleviate diagnostic burdens and improve healthcare outcomes.
zh

[CV-110] Deep Learning-Based Automated Segmentation of Uterine Myomas

【速读】:该论文旨在解决子宫肌瘤(uterine fibroids)在磁共振成像(MRI)图像中手动分割效率低、耗时长且存在个体差异的问题,从而影响临床治疗决策的准确性与一致性。其解决方案的关键在于利用公开的子宫肌瘤MRI数据集(Uterine Myoma MRI Dataset, UMD),建立自动化分割的基准方法,以实现标准化评估并推动该领域后续研究的发展。通过引入深度学习算法,该研究有望实现对肌瘤体积、形态及空间位置的精准自动分割,提升诊断效率和可重复性。

链接: https://arxiv.org/abs/2508.11010
作者: Tausifa Jan Saleem,Mohammad Yaqub
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uterine fibroids (myomas) are the most common benign tumors of the female reproductive system, particularly among women of childbearing age. With a prevalence exceeding 70%, they pose a significant burden on female reproductive health. Clinical symptoms such as abnormal uterine bleeding, infertility, pelvic pain, and pressure-related discomfort play a crucial role in guiding treatment decisions, which are largely influenced by the size, number, and anatomical location of the fibroids. Magnetic Resonance Imaging (MRI) is a non-invasive and highly accurate imaging modality commonly used by clinicians for the diagnosis of uterine fibroids. Segmenting uterine fibroids requires a precise assessment of both the uterus and fibroids on MRI scans, including measurements of volume, shape, and spatial location. However, this process is labor intensive and time consuming and subjected to variability due to intra- and inter-expert differences at both pre- and post-treatment stages. As a result, there is a critical need for an accurate and automated segmentation method for uterine fibroids. In recent years, deep learning algorithms have shown re-markable improvements in medical image segmentation, outperforming traditional methods. These approaches offer the potential for fully automated segmentation. Several studies have explored the use of deep learning models to achieve automated segmentation of uterine fibroids. However, most of the previous work has been conducted using private datasets, which poses challenges for validation and comparison between studies. In this study, we leverage the publicly available Uterine Myoma MRI Dataset (UMD) to establish a baseline for automated segmentation of uterine fibroids, enabling standardized evaluation and facilitating future research in this domain.
zh

[CV-111] he Role of Radiographic Knee Alignment in Knee Replacement Outcomes and Opportunities for Artificial Intelligence-Driven Assessment

【速读】:该论文旨在解决膝关节骨性关节炎(Knee Osteoarthritis, OA)患者在接受全膝关节置换术(Total Knee Replacement, TKR)后,术后并发症和恢复情况难以提前预测的问题。其关键在于识别与TKR预后相关的膝关节对线(knee alignment)生物标志物,并借助人工智能(Artificial Intelligence, AI)技术从膝关节X光片中自动提取这些生物标志物,从而为术前风险评估和个体化治疗提供依据。

链接: https://arxiv.org/abs/2508.10941
作者: Zhisen Hu,David S. Johnson,Aleksei Tiulpin,Timothy F. Cootes,Claudia Lindner
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prevalent knee osteoarthritis (OA) imposes substantial burden on health systems with no cure available. Its ultimate treatment is total knee replacement (TKR). Complications from surgery and recovery are difficult to predict in advance, and numerous factors may affect them. Radiographic knee alignment is one of the key factors that impacts TKR outcomes, affecting outcomes such as postoperative pain or function. Recently, artificial intelligence (AI) has been introduced to the automatic analysis of knee radiographs, for example, to automate knee alignment measurements. Existing review articles tend to focus on knee OA diagnosis and segmentation of bones or cartilages in MRI rather than exploring knee alignment biomarkers for TKR outcomes and their assessment. In this review, we first examine the current scoring protocols for evaluating TKR outcomes and potential knee alignment biomarkers associated with these outcomes. We then discuss existing AI-based approaches for generating knee alignment biomarkers from knee radiographs, and explore future directions for knee alignment assessment and TKR outcome prediction.
zh

人工智能

[AI-0] Pretrained Conformers for Audio Fingerprinting and Retrieval

【速读】:该论文旨在解决音频检索任务中对短时音频片段(仅3秒)生成鲁棒且具有判别力的嵌入表示的问题,同时提升模型在时间错位、噪声、混响及极端时长拉伸等常见音频失真情况下的泛化能力。其解决方案的关键在于采用自监督对比学习框架训练基于Conformer结构的编码器,该架构能够有效捕捉音频信号的局部与全局依赖关系(local and global interactions),从而生成对微小扰动不敏感的高质量嵌入表示,显著优于传统方法在多种复杂场景下的性能表现。

链接: https://arxiv.org/abs/2508.11609
作者: Kemal Altwlkany,Elmedin Selmanovic,Sead Delalic
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Conformers have shown great results in speech processing due to their ability to capture both local and global interactions. In this work, we utilize a self-supervised contrastive learning framework to train conformer-based encoders that are capable of generating unique embeddings for small segments of audio, generalizing well to previously unseen data. We achieve state-of-the-art results for audio retrieval tasks while using only 3 seconds of audio to generate embeddings. Our models are almost completely immune to temporal misalignments and achieve state-of-the-art results in cases of other audio distortions such as noise, reverb or extreme temporal stretching. Code and models are made publicly available and the results are easy to reproduce as we train and test using popular and freely available datasets of different sizes.
zh

[AI-1] CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection

【速读】:该论文旨在解决密码算法实现中难以检测的细微逻辑缺陷问题,这些问题常导致严重的安全漏洞。解决方案的关键在于提出CryptoScope框架,该框架基于大型语言模型(Large Language Models, LLMs),融合了思维链(Chain-of-Thought, CoT)提示与检索增强生成(Retrieval-Augmented Generation, RAG)技术,并依托一个包含超过12,000条条目的专业密码学知识库进行引导,从而实现对加密代码中潜在漏洞的自动化识别与定位。

链接: https://arxiv.org/abs/2508.11599
作者: Zhihao Li,Zimo Ji,Tao Zheng,Hao Ren,Xiao Lan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cryptographic algorithms are fundamental to modern security, yet their implementations frequently harbor subtle logic flaws that are hard to detect. We introduce CryptoScope, a novel framework for automated cryptographic vulnerability detection powered by Large Language Models (LLMs). CryptoScope combines Chain-of-Thought (CoT) prompting with Retrieval-Augmented Generation (RAG), guided by a curated cryptographic knowledge base containing over 12,000 entries. We evaluate CryptoScope on LLM-CLVA, a benchmark of 92 cases primarily derived from real-world CVE vulnerabilities, complemented by cryptographic challenges from major Capture The Flag (CTF) competitions and synthetic examples across 11 programming languages. CryptoScope consistently improves performance over strong LLM baselines, boosting DeepSeek-V3 by 11.62%, GPT-4o-mini by 20.28%, and GLM-4-Flash by 28.69%. Additionally, it identifies 9 previously undisclosed flaws in widely used open-source cryptographic projects.
zh

[AI-2] A Comprehensive Perspective on Explainable AI across the Machine Learning Workflow

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)模型在科学与工业应用中因缺乏透明性而引发的信任问题,尤其是传统可解释人工智能(Explainable Artificial Intelligence, XAI)方法仅关注单个预测的解释,忽视了从数据输入到模型输出全流程中的决策与质量验证环节。解决方案的关键在于提出一种用户中心的端到端框架——整体式可解释人工智能(Holistic Explainable Artificial Intelligence, HXAI),其核心是将解释嵌入数据分析流程的六个关键组件(数据、分析设置、学习过程、模型输出、模型质量、沟通渠道),并基于领域专家、数据分析师和数据科学家的不同需求构建一个包含112项问题的问卷库,从而实现解释内容的精准匹配与认知可管理性;同时,HXAI通过整合人类解释理论、人机交互原则及实证用户研究,形成一套可操作的分类体系,推动现有工具链的系统性覆盖分析,并进一步利用大语言模型驱动的AI代理协调多样化解释技术,生成面向不同利益相关者的叙事,弥合AI开发者与领域专家之间的理解鸿沟。

链接: https://arxiv.org/abs/2508.11529
作者: George Paterakis,Andrea Castellani,George Papoutsoglou,Tobias Rodemann,Ioannis Tsamardinos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Currently under review at “Artificial Intelligence Review” journal

点击查看摘要

Abstract:Artificial intelligence is reshaping science and industry, yet many users still regard its models as opaque “black boxes”. Conventional explainable artificial-intelligence methods clarify individual predictions but overlook the upstream decisions and downstream quality checks that determine whether insights can be trusted. In this work, we present Holistic Explainable Artificial Intelligence (HXAI), a user-centric framework that embeds explanation into every stage of the data-analysis workflow and tailors those explanations to users. HXAI unifies six components (data, analysis set-up, learning process, model output, model quality, communication channel) into a single taxonomy and aligns each component with the needs of domain experts, data analysts and data scientists. A 112-item question bank covers these needs; our survey of contemporary tools highlights critical coverage gaps. Grounded in theories of human explanation, principles from human-computer interaction and findings from empirical user studies, HXAI identifies the characteristics that make explanations clear, actionable and cognitively manageable. A comprehensive taxonomy operationalises these insights, reducing terminological ambiguity and enabling rigorous coverage analysis of existing toolchains. We further demonstrate how AI agents that embed large-language models can orchestrate diverse explanation techniques, translating technical artifacts into stakeholder-specific narratives that bridge the gap between AI developers and domain experts. Departing from traditional surveys or perspective articles, this work melds concepts from multiple disciplines, lessons from real-world projects and a critical synthesis of the literature to advance a novel, end-to-end viewpoint on transparency, trustworthiness and responsible AI deployment.
zh

[AI-3] Inspire or Predict? Exploring New Paradigms in Assisting Classical Planners with Large Language Models

【速读】:该论文旨在解决大规模规划问题中因状态空间爆炸(state-space explosion)导致的求解效率低下问题,尤其在对象和动作数量增长时,传统规划方法难以有效搜索可行解。其解决方案的关键在于提出一种集成问题分解(problem decomposition)与大语言模型(Large Language Models, LLMs)的新型规划框架,通过将复杂任务拆分为多个子任务来降低难度,并引入两种新颖的LLM辅助范式:LLM4Inspire(基于通用知识提供启发式指导)和LLM4Predict(利用领域特定知识推断中间条件)。实验表明,融合领域知识的LLM4Predict策略相比仅依赖通用知识的LLM4Inspire,在剪枝搜索空间和定位可行解方面更具优势,体现了LLM与领域知识协同增强规划能力的有效性。

链接: https://arxiv.org/abs/2508.11524
作者: Wenkai Yu,Jianhang Tang,Yang Zhang,Shanjiang Tang,Kebing Jin,Hankz Hankui Zhuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Addressing large-scale planning problems has become one of the central challenges in the planning community, deriving from the state-space explosion caused by growing objects and actions. Recently, researchers have explored the effectiveness of leveraging Large Language Models (LLMs) to generate helpful actions and states to prune the search space. However, prior works have largely overlooked integrating LLMs with domain-specific knowledge to ensure valid plans. In this paper, we propose a novel LLM-assisted planner integrated with problem decomposition, which first decomposes large planning problems into multiple simpler sub-tasks. Then we explore two novel paradigms to utilize LLMs, i.e., LLM4Inspire and LLM4Predict, to assist problem decomposition, where LLM4Inspire provides heuristic guidance according to general knowledge and LLM4Predict employs domain-specific knowledge to infer intermediate conditions. We empirically validate the effectiveness of our planner across multiple domains, demonstrating the ability of search space partition when solving large-scale planning problems. The experimental results show that LLMs effectively locate feasible solutions when pruning the search space, where infusing domain-specific knowledge into LLMs, i.e., LLM4Predict, holds particular promise compared with LLM4Inspire, which offers general knowledge within LLMs.
zh

[AI-4] Weighted First Order Model Counting for Two-variable Logic with Axioms on Two Relations

【速读】:该论文旨在解决加权一阶模型计数问题(Weighted First-Order Model Counting, WFOMC)在扩展两变量片段(\textFO^2)时的复杂性边界问题,特别是当引入涉及多个关系的公理时,其计算难度是否仍保持在多项式时间内。研究发现,若在\textFO^2中添加两个线性序关系或两个无环关系,则WFOMC变为#P1\mathsf\#P_1-难;而另一方面,对于带有线性序关系、其后继关系以及另一个后继关系的\textC^2扩展形式,作者提出了一个在域大小上多项式时间内的算法,从而明确了多关系公理对WFOMC复杂性的影响边界。解决方案的关键在于区分不同类型的二元关系公理(如线性序与无环)对复杂性的提升效应,并设计出针对特定结构组合的高效计数算法。

链接: https://arxiv.org/abs/2508.11515
作者: Qipeng Kuang,Václav Kůla,Ondřej Kuželka,Yuanhong Wang,Yuyi Wang
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:The Weighted First-Order Model Counting Problem (WFOMC) asks to compute the weighted sum of models of a given first-order logic sentence over a given domain. The boundary between fragments for which WFOMC can be computed in polynomial time relative to the domain size lies between the two-variable fragment ( \textFO^2 ) and the three-variable fragment ( \textFO^3 ). It is known that WFOMC for \FOthree is \mathsf#P_1 -hard while polynomial-time algorithms exist for computing WFOMC for \textFO^2 and \textC^2 , possibly extended by certain axioms such as the linear order axiom, the acyclicity axiom, and the connectedness axiom. All existing research has concentrated on extending the fragment with axioms on a single distinguished relation, leaving a gap in understanding the complexity boundary of axioms on multiple relations. In this study, we explore the extension of the two-variable fragment by axioms on two relations, presenting both negative and positive results. We show that WFOMC for \textFO^2 with two linear order relations and \textFO^2 with two acyclic relations are \mathsf#P_1 -hard. Conversely, we provide an algorithm in time polynomial in the domain size for WFOMC of \textC^2 with a linear order relation, its successor relation and another successor relation.
zh

[AI-5] owards Faithful Class-level Self-explainability in Graph Neural Networks by Subgraph Dependencies

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在类级别解释性不足的问题,即现有自解释GNN模型(如ProtGNN和PGIB)虽然能生成实例级解释,但其学习的类别特定原型是否具备跨实例的泛化能力尚不明确。为解决此问题,作者提出GraphOracle框架,其核心创新在于通过联合训练机制同时学习GNN分类器与一组结构化、稀疏且具有判别性的子图,这些子图对应于每个类别,并采用基于掩码的评估策略验证其类级别解释的有效性。该方法利用熵正则化的子图选择与轻量级随机游走提取技术,显著提升了训练效率并避免了传统方法(如蒙特卡洛树搜索)的计算瓶颈,从而实现了高保真度、可扩展的类级别自解释能力。

链接: https://arxiv.org/abs/2508.11513
作者: Fanzhen Liu,Xiaoxiao Ma,Jian Yang,Alsharif Abuadbba,Kristen Moore,Surya Nepal,Cecile Paris,Quan Z. Sheng,Jia Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:Enhancing the interpretability of graph neural networks (GNNs) is crucial to ensure their safe and fair deployment. Recent work has introduced self-explainable GNNs that generate explanations as part of training, improving both faithfulness and efficiency. Some of these models, such as ProtGNN and PGIB, learn class-specific prototypes, offering a potential pathway toward class-level explanations. However, their evaluations focus solely on instance-level explanations, leaving open the question of whether these prototypes meaningfully generalize across instances of the same class. In this paper, we introduce GraphOracle, a novel self-explainable GNN framework designed to generate and evaluate class-level explanations for GNNs. Our model jointly learns a GNN classifier and a set of structured, sparse subgraphs that are discriminative for each class. We propose a novel integrated training that captures graph \unicodex2013 subgraph \unicodex2013 prediction dependencies efficiently and faithfully, validated through a masking-based evaluation strategy. This strategy enables us to retroactively assess whether prior methods like ProtGNN and PGIB deliver effective class-level explanations. Our results show that they do not. In contrast, GraphOracle achieves superior fidelity, explainability, and scalability across a range of graph classification tasks. We further demonstrate that GraphOracle avoids the computational bottlenecks of previous methods \unicodex2014 like Monte Carlo Tree Search \unicodex2014 by using entropy-regularized subgraph selection and lightweight random walk extraction, enabling faster and more scalable training. These findings position GraphOracle as a practical and principled solution for faithful class-level self-explainability in GNNs.
zh

[AI-6] Sim2Dust: Mastering Dynamic Waypoint Tracking on Granular Media

【速读】:该论文旨在解决在非结构化行星表面(如月球)上实现可靠自主导航的问题,特别是由于学习型控制器存在“仿真到现实的差距”(sim-to-real gap),尤其是在轮子与颗粒介质相互作用的复杂动力学场景下。解决方案的关键在于构建一个完整的“仿真到现实”(sim-to-real)框架,通过大规模并行仿真训练强化学习代理(reinforcement learning agents),覆盖大量程序化生成且物理参数随机化的环境;随后将训练好的策略零样本迁移(zero-shot transfer)至实际轮式探测车,在月球类比设施中验证其性能。实验证明,基于程序多样性训练的智能体在零样本迁移中表现更优,而高保真粒子物理微调虽提升低速精度但计算成本显著增加,从而确立了一套可验证的学习型导航系统开发流程。

链接: https://arxiv.org/abs/2508.11503
作者: Andrej Orsula,Matthieu Geist,Miguel Olivares-Mendez,Carol Martinez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The source code is available at this https URL

点击查看摘要

Abstract:Reliable autonomous navigation across the unstructured terrains of distant planetary surfaces is a critical enabler for future space exploration. However, the deployment of learning-based controllers is hindered by the inherent sim-to-real gap, particularly for the complex dynamics of wheel interactions with granular media. This work presents a complete sim-to-real framework for developing and validating robust control policies for dynamic waypoint tracking on such challenging surfaces. We leverage massively parallel simulation to train reinforcement learning agents across a vast distribution of procedurally generated environments with randomized physics. These policies are then transferred zero-shot to a physical wheeled rover operating in a lunar-analogue facility. Our experiments systematically compare multiple reinforcement learning algorithms and action smoothing filters to identify the most effective combinations for real-world deployment. Crucially, we provide strong empirical evidence that agents trained with procedural diversity achieve superior zero-shot performance compared to those trained on static scenarios. We also analyze the trade-offs of fine-tuning with high-fidelity particle physics, which offers minor gains in low-speed precision at a significant computational cost. Together, these contributions establish a validated workflow for creating reliable learning-based navigation systems, marking a critical step towards deploying autonomous robots in the final frontier.
zh

[AI-7] Landmark-Assisted Monte Carlo Planning

【速读】:该论文旨在解决在随机决策过程(Markov Decision Process, MDP)中如何提升在线规划算法性能的问题,特别是针对传统基于贪婪策略的UCT(Upper Confidence bounds applied to Trees)算法在复杂随机环境中效率不足的局限。解决方案的关键在于引入概率性地标(probabilistic landmarks),并将其作为子目标来分解MDP问题;其核心创新在于对贪心式地标达成与最终目标达成之间的平衡机制进行优化——这一平衡策略直接影响算法性能,且具有问题依赖性。实验表明,合理选择的地标能显著提升UCT在基准测试中的表现,为任意时间(anytime)算法求解MDP提供了有效引导。

链接: https://arxiv.org/abs/2508.11493
作者: David H. Chan,Mark Roberts,Dana S. Nau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To be published in the Proceedings of the 28th European Conference on Artificial Intelligence

点击查看摘要

Abstract:Landmarks \unicodex2013 conditions that must be satisfied at some point in every solution plan \unicodex2013 have contributed to major advancements in classical planning, but they have seldom been used in stochastic domains. We formalize probabilistic landmarks and adapt the UCT algorithm to leverage them as subgoals to decompose MDPs; core to the adaptation is balancing between greedy landmark achievement and final goal achievement. Our results in benchmark domains show that well-chosen landmarks can significantly improve the performance of UCT in online probabilistic planning, while the best balance of greedy versus long-term goal achievement is problem-dependent. The results suggest that landmarks can provide helpful guidance for anytime algorithms solving MDPs.
zh

[AI-8] RMSL: Weakly-Supervised Insider Threat Detection with Robust Multi-sphere Learning

【速读】:该论文旨在解决行为级内部威胁检测(behavior-level insider threat detection)中因缺乏细粒度行为标注而导致的异常检测困难问题。现有无监督方法因正常与异常行为间存在固有模糊性,常面临高误报率和漏报率。解决方案的关键在于引入弱标签(weak labels)——即以行为序列整体为单位进行标注(正常或异常),而非逐行为标注,从而降低标注成本并提升模型对行为级异常的区分能力。其核心创新是提出鲁棒多球学习框架(Robust Multi-sphere Learning, RMSL),通过构建多个超球体表示正常行为模式,并结合多实例学习与基于预测置信度的行为级自训练去偏机制,在弱序列标签指导下优化超球体边界与特征表示,显著增强模型在行为层级上的判别能力。

链接: https://arxiv.org/abs/2508.11472
作者: Yang Wang,Yaxin Zhao,Xinyu Jiao,Sihan Xu,Xiangrui Cai,Ying Zhang,Xiaojie Yuan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:Insider threat detection aims to identify malicious user behavior by analyzing logs that record user interactions. Due to the lack of fine-grained behavior-level annotations, detecting specific behavior-level anomalies within user behavior sequences is challenging. Unsupervised methods face high false positive rates and miss rates due to the inherent ambiguity between normal and anomalous behaviors. In this work, we instead introduce weak labels of behavior sequences, which have lower annotation costs, i.e., the training labels (anomalous or normal) are at sequence-level instead of behavior-level, to enhance the detection capability for behavior-level anomalies by learning discriminative features. To achieve this, we propose a novel framework called Robust Multi-sphere Learning (RMSL). RMSL uses multiple hyper-spheres to represent the normal patterns of behaviors. Initially, a one-class classifier is constructed as a good anomaly-supervision-free starting point. Building on this, using multiple instance learning and adaptive behavior-level self-training debiasing based on model prediction confidence, the framework further refines hyper-spheres and feature representations using weak sequence-level labels. This approach enhances the model’s ability to distinguish between normal and anomalous behaviors. Extensive experiments demonstrate that RMSL significantly improves the performance of behavior-level insider threat detection.
zh

[AI-9] Informative Post-Hoc Explanations Only Exist for Simple Functions

【速读】:该论文试图解决的问题是:如何在理论上界定解释算法(explanation algorithms)对复杂机器学习模型决策函数的“信息性”(informative),并评估现有主流解释方法是否能在复杂模型上提供可靠且有意义的洞察。其核心挑战在于,尽管局部后验解释算法被广泛用于理解黑箱模型行为,但此前缺乏严格的理论保障,尤其是在面对复杂决策函数时。解决方案的关键在于提出一个基于学习理论的通用框架,定义“信息性解释”为能够降低可合理决策函数空间复杂度的解释;在此基础上,证明多数常用解释算法(如梯度、反事实、SHAP 和锚定解释)在特定复杂模型类别(如可微函数或决策树)下不具备信息性,并进一步推导出使这些算法变得信息性的充分条件——这些条件通常比预期更强,从而揭示了当前解释方法的局限性,并为改进算法提供了理论依据。

链接: https://arxiv.org/abs/2508.11441
作者: Eric Günther,Balázs Szabados,Robi Bhattacharjee,Sebastian Bordt,Ulrike von Luxburg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many researchers have suggested that local post-hoc explanation algorithms can be used to gain insights into the behavior of complex machine learning models. However, theoretical guarantees about such algorithms only exist for simple decision functions, and it is unclear whether and under which assumptions similar results might exist for complex models. In this paper, we introduce a general, learning-theory-based framework for what it means for an explanation to provide information about a decision function. We call an explanation informative if it serves to reduce the complexity of the space of plausible decision functions. With this approach, we show that many popular explanation algorithms are not informative when applied to complex decision functions, providing a rigorous mathematical rejection of the idea that it should be possible to explain any model. We then derive conditions under which different explanation algorithms become informative. These are often stronger than what one might expect. For example, gradient explanations and counterfactual explanations are non-informative with respect to the space of differentiable functions, and SHAP and anchor explanations are not informative with respect to the space of decision trees. Based on these results, we discuss how explanation algorithms can be modified to become informative. While the proposed analysis of explanation algorithms is mathematical, we argue that it holds strong implications for the practical applicability of these algorithms, particularly for auditing, regulation, and high-risk applications of AI.
zh

[AI-10] AIM-Bench: Evaluating Decision-making Biases of Agent ic LLM as Inventory Manager

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在不确定供应链环境下进行库存决策时存在的决策偏差问题,特别是其行为是否类似人类的启发式偏差(如框架效应等),以及如何有效缓解诸如“拉向中心效应”(pull-to-centre effect)和“牛鞭效应”(bullwhip effect)等典型供应链失调现象。解决方案的关键在于构建一个名为AIM-Bench的新基准,通过多样化的库存补货实验系统评估LLM代理的决策行为,并验证两种策略的有效性:一是引入认知反思(cognitive reflection)以减少偏差;二是实施信息共享机制以抑制供应链波动。研究结果表明,LLM代理在不确定性情境下表现出与人类相似的决策偏差,而上述干预措施可显著改善其决策质量,为开发以人为本的供应链决策支持系统提供了实证依据与方法路径。

链接: https://arxiv.org/abs/2508.11416
作者: Xuhua Zhao,Yuxuan Xie,Caihua Chen,Yuxiang Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in mathematical reasoning and the long-term planning capabilities of large language models (LLMs) have precipitated the development of agents, which are being increasingly leveraged in business operations processes. Decision models to optimize inventory levels are one of the core elements of operations management. However, the capabilities of the LLM agent in making inventory decisions in uncertain contexts, as well as the decision-making biases (e.g. framing effect, etc.) of the agent, remain largely unexplored. This prompts concerns regarding the capacity of LLM agents to effectively address real-world problems, as well as the potential implications of biases that may be present. To address this gap, we introduce AIM-Bench, a novel benchmark designed to assess the decision-making behaviour of LLM agents in uncertain supply chain management scenarios through a diverse series of inventory replenishment experiments. Our results reveal that different LLMs typically exhibit varying degrees of decision bias that are similar to those observed in human beings. In addition, we explored strategies to mitigate the pull-to-centre effect and the bullwhip effect, namely cognitive reflection and implementation of information sharing. These findings underscore the need for careful consideration of the potential biases in deploying LLMs in Inventory decision-making scenarios. We hope that these insights will pave the way for mitigating human decision bias and developing human-centred decision support systems for supply chains.
zh

[AI-11] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

【速读】:该论文旨在解决现有结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)的后训练方法中,因引入离策略(off-policy)专家数据而导致模型模式破坏和过拟合的问题。其解决方案的关键在于提出CHORD框架,通过动态加权机制统一建模SFT与RL:将SFT视为在线策略(on-policy)强化学习过程中的一个辅助目标,而非独立阶段;并设计双控机制——全局系数引导从离策略模仿向在线策略探索的过渡,以及基于token级别的权重函数实现细粒度地从专家数据中学习,从而在保持在线探索能力的同时缓解离策略数据带来的干扰,最终实现更稳定、高效的训练过程。

链接: https://arxiv.org/abs/2508.11408
作者: Wenhao Zhang,Yuexiang Xie,Yuchang Sun,Yanxi Chen,Guoyin Wang,Yaliang Li,Bolin Ding,Jingren Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established model patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for the Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from expert tokens, which preserves on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on widely used benchmarks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at this https URL to inspire further research.
zh

[AI-12] Open Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing IROS

【速读】:该论文旨在解决自主机器人在科学实验中缺乏透明性、可重复性和开放性的问题,从而阻碍其在科学研究中的可信应用。解决方案的关键在于提出两个核心贡献:一是语义执行追踪框架(semantic execution tracing framework),通过记录传感器数据与语义标注的机器人信念状态(robot belief states),确保自动化实验过程的透明性和可复现性;二是AICOR虚拟研究建筑(Virtual Research Building, VRB)云平台,支持大规模共享、复现和验证机器人任务执行,实现确定性执行、语义记忆与开放知识表示的集成,为自主系统参与科学发现奠定基础。

链接: https://arxiv.org/abs/2508.11406
作者: Benjamin Alt,Mareike Picklum,Sorin Arion,Franklin Kenghagho Kenfack,Michael Beetz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, submitted to the 1st IROS Workshop on Embodied AI and Robotics for Future Scientific Discovery

点击查看摘要

Abstract:We envision a future in which autonomous robots conduct scientific experiments in ways that are not only precise and repeatable, but also open, trustworthy, and transparent. To realize this vision, we present two key contributions: a semantic execution tracing framework that logs sensor data together with semantically annotated robot belief states, ensuring that automated experimentation is transparent and replicable; and the AICOR Virtual Research Building (VRB), a cloud-based platform for sharing, replicating, and validating robot task executions at scale. Together, these tools enable reproducible, robot-driven science by integrating deterministic execution, semantic memory, and open knowledge representation, laying the foundation for autonomous systems to participate in scientific discovery.
zh

[AI-13] An Exploratory Study on Crack Detection in Concrete through Human-Robot Collaboration

【速读】:该论文旨在解决核设施结构检测中传统人工巡检方法存在的安全风险高、认知负荷大及因人为因素导致的检测准确性不足等问题。其解决方案的关键在于引入人机协作(Human-Robot Collaboration, HRC)模式,利用搭载先进AI视觉裂纹识别算法的移动式Jackal机器人平台,实现对核设施结构的自动化、高精度检测,从而显著提升检测效率与准确性,并降低操作人员的工作负担。

链接: https://arxiv.org/abs/2508.11404
作者: Junyeon Kim,Tianshu Ruan,Cesar Alan Contreras,Manolis Chiou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Structural inspection in nuclear facilities is vital for maintaining operational safety and integrity. Traditional methods of manual inspection pose significant challenges, including safety risks, high cognitive demands, and potential inaccuracies due to human limitations. Recent advancements in Artificial Intelligence (AI) and robotic technologies have opened new possibilities for safer, more efficient, and accurate inspection methodologies. Specifically, Human-Robot Collaboration (HRC), leveraging robotic platforms equipped with advanced detection algorithms, promises significant improvements in inspection outcomes and reductions in human workload. This study explores the effectiveness of AI-assisted visual crack detection integrated into a mobile Jackal robot platform. The experiment results indicate that HRC enhances inspection accuracy and reduces operator workload, resulting in potential superior performance outcomes compared to traditional manual methods.
zh

[AI-14] rustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis CIKM2025

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在精神健康诊断等专业领域中表现不佳的问题,具体表现为:缺乏对临床医生主动探询能力的模拟、多轮对话理解能力不足,以及输出结果难以与专家临床推理对齐。其解决方案的关键在于提出DSM5AgentFlow框架,这是首个能够自主生成符合《精神障碍诊断与统计手册》第五版(Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, DSM-5)Level-1诊断标准的问卷的LLM代理工作流;该框架通过模拟治疗师与特定客户档案之间的对话,实现透明、分步的疾病预测,从而提升诊断的可解释性与可信度,并确保符合伦理和法律规范。

链接: https://arxiv.org/abs/2508.11398
作者: Mithat Can Ozgun,Jiahuan Pei,Koen Hindriks,Lucia Donatelli,Qingzhi Liu,Xin Sun,Junxiao Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by CIKM 2025 as a full paper

点击查看摘要

Abstract:LLM-based agents have emerged as transformative tools capable of executing complex tasks through iterative planning and action, achieving significant advancements in understanding and addressing user needs. Yet, their effectiveness remains limited in specialized domains such as mental health diagnosis, where they underperform compared to general applications. Current approaches to integrating diagnostic capabilities into LLMs rely on scarce, highly sensitive mental health datasets, which are challenging to acquire. These methods also fail to emulate clinicians’ proactive inquiry skills, lack multi-turn conversational comprehension, and struggle to align outputs with expert clinical reasoning. To address these gaps, we propose DSM5AgentFlow, the first LLM-based agent workflow designed to autonomously generate DSM-5 Level-1 diagnostic questionnaires. By simulating therapist-client dialogues with specific client profiles, the framework delivers transparent, step-by-step disorder predictions, producing explainable and trustworthy results. This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards. Through comprehensive experiments, we evaluate leading LLMs across three critical dimensions: conversational realism, diagnostic accuracy, and explainability. Our datasets and implementations are fully open-sourced.
zh

[AI-15] Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

【速读】:该论文旨在解决决策导向学习(Decision-focused Learning, DFL)中线性规划(Linear Program, LP)问题的梯度消失问题,即在大多数参数空间区域中,决策后悔(regret)对预测参数的梯度为零,导致传统基于梯度的DFL方法失效。其关键解决方案是采用代理损失(surrogate loss)进行优化,即使在使用可微分优化层(differentiable optimization layer)直接最小化后悔时也有效;实验表明,这种方法不仅使不同可微分优化层达到与基于代理损失的DFL相当或更优的后悔性能,且结合DYS-Net这一高效前向与反向传播的LP求解技术,可在显著缩短训练时间的同时实现接近当前最优的后悔表现。

链接: https://arxiv.org/abs/2508.11365
作者: Jayanta Mandi,Ali İrfan Mahmutoğulları,Senne Berden,Tias Guns
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision-focused learning (DFL) trains a machine learning (ML) model to predict parameters of an optimization problem, to directly minimize decision regret, i.e., maximize decision quality. Gradient-based DFL requires computing the derivative of the solution to the optimization problem with respect to the predicted parameters. However, for many optimization problems, such as linear programs (LPs), the gradient of the regret with respect to the predicted parameters is zero almost everywhere. Existing gradient-based DFL approaches for LPs try to circumvent this issue in one of two ways: (a) smoothing the LP into a differentiable optimization problem by adding a quadratic regularizer and then minimizing the regret directly or (b) minimizing surrogate losses that have informative (sub)gradients. In this paper, we show that the former approach still results in zero gradients, because even after smoothing the regret remains constant across large regions of the parameter space. To address this, we propose minimizing surrogate losses – even when a differentiable optimization layer is used and regret can be minimized directly. Our experiments demonstrate that minimizing surrogate losses allows differentiable optimization layers to achieve regret comparable to or better than surrogate-loss based DFL methods. Further, we demonstrate that this also holds for DYS-Net, a recently proposed differentiable optimization technique for LPs, that computes approximate solutions and gradients through operations that can be performed using feedforward neural network layers. Because DYS-Net executes the forward and the backward pass very efficiently, by minimizing surrogate losses using DYS-Net, we are able to attain regret on par with the state-of-the-art while reducing training time by a significant margin.
zh

[AI-16] CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的GUI交互智能体在训练过程中面临的两个核心问题:一是将所有任务视为统一数据集,忽略了不同GUI任务之间存在的显著难度差异,导致智能体难以自适应调整学习策略;二是多数方法将任务特异性细节压缩为单一粗粒度奖励信号,使得策略更新效率低下。解决方案的关键在于提出CRAFT-GUI框架,该框架基于组相对策略优化(Group Relative Policy Optimization, GRPO)构建课程学习机制,显式建模轨迹间的难度差异,并设计了一种融合规则驱动信号与模型判别评估的细粒度奖励函数,从而实现更精准、高效的策略优化。实验表明,该方法在Android Control公开基准和内部在线基准上分别提升性能5.6%和10.3%,验证了结合强化学习与课程学习在GUI交互任务中的有效性。

链接: https://arxiv.org/abs/2508.11360
作者: Songqin Nong,Jingxuan Xu,Sheng Zhou,Jianfeng Chen,Xiaoxuan Tang,Tao Jiang,Wenhao Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As autonomous agents become adept at understanding and interacting with graphical user interface (GUI) environments, a new era of automated task execution is emerging. Recent studies have demonstrated that Reinforcement Learning (RL) can effectively enhance agents’ performance in dynamic interactive GUI environments. However, these methods face two key limitations: (1) they overlook the significant variation in difficulty across different GUI tasks by treating the entire training data as a uniform set, which hampers the agent’s ability to adapt its learning process; and (2) most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates. To address these limitations, we propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories. To enable more fine-grained policy optimization, we design a reward function that combines simple rule-based signals with model-judged evaluation, providing richer and more nuanced feedback during training. Experimental results demonstrate that our method achieves significant improvements over previous state-of-the-art approaches, outperforming them by 5.6% on public benchmarks Android Control and 10.3% on our internal online benchmarks, respectively. These findings empirically validate the effectiveness of integrating reinforcement learning with curriculum learning in GUI interaction tasks.
zh

[AI-17] PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding

【速读】:该论文旨在解决跨被试脑电图(EEG)解码中的核心挑战,即由于个体间神经生理差异显著以及缺乏通用表征而导致的模型泛化能力不足问题。其解决方案的关键在于提出PTSM(Physiology-aware and Task-invariant Spatio-temporal Modeling)框架,通过双分支掩码机制独立学习个性化与共享的时空模式,并在时间和空间维度上因子分解掩码以实现对动态EEG信号的细粒度调控;同时引入信息论约束,将潜在表示分解为正交的任务相关和被试相关子空间,从而在保持个体特异性的同时提取任务不变特征,最终实现无需被试特定校准的零样本跨被试鲁棒解码。

链接: https://arxiv.org/abs/2508.11357
作者: Changhong Jing,Yan Liu,Shuqiang Wang,Bruce X.B. Yu,Gong Chen,Zhejing Hu,Zhi Zhang,Yanyan Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-subject electroencephalography (EEG) decoding remains a fundamental challenge in brain-computer interface (BCI) research due to substantial inter-subject variability and the scarcity of subject-invariant representations. This paper proposed PTSM (Physiology-aware and Task-invariant Spatio-temporal Modeling), a novel framework for interpretable and robust EEG decoding across unseen subjects. PTSM employs a dual-branch masking mechanism that independently learns personalized and shared spatio-temporal patterns, enabling the model to preserve individual-specific neural characteristics while extracting task-relevant, population-shared features. The masks are factorized across temporal and spatial dimensions, allowing fine-grained modulation of dynamic EEG patterns with low computational overhead. To further address representational entanglement, PTSM enforces information-theoretic constraints that decompose latent embeddings into orthogonal task-related and subject-related subspaces. The model is trained end-to-end via a multi-objective loss integrating classification, contrastive, and disentanglement objectives. Extensive experiments on cross-subject motor imagery datasets demonstrate that PTSM achieves strong zero-shot generalization, outperforming state-of-the-art baselines without subject-specific calibration. Results highlight the efficacy of disentangled neural representations for achieving both personalized and transferable decoding in non-stationary neurophysiological settings.
zh

[AI-18] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在无监督场景下因依赖标注数据而导致的适应性不足,以及测试时强化学习(Test-Time Reinforcement Learning, TTRL)中存在的高推理成本和早期估计偏差问题。解决方案的关键在于引入基于熵的机制,通过两种策略实现探索与利用的平衡:一是熵分支树多数采样(Entropy-fork Tree Majority Rollout, ETMR),用于降低推理开销并提升样本多样性;二是基于优势重塑的熵调节(Entropy-based Advantage Reshaping, EAR),以缓解过早收敛和过自信问题,增强估计鲁棒性。实验表明,该方法在AIME 2024基准上使Llama3.1-8B模型的Pass@1指标相对提升68%,同时仅消耗60%的rollout token预算,显著优化了推理效率、多样性与估计稳定性之间的权衡。

链接: https://arxiv.org/abs/2508.11356
作者: Jia Liu,ChangYi He,YingQiao Lin,MingMin Yang,FeiYang Shen,ShaoGuo Liu,TingTing Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models have yielded significant improvements in complex reasoning tasks such as mathematics and programming. However, these models remain heavily dependent on annotated data and exhibit limited adaptability in unsupervised scenarios. To address these limitations, test-time reinforcement learning (TTRL) has been proposed, which enables self-optimization by leveraging model-generated pseudo-labels. Despite its promise, TTRL faces several key challenges, including high inference costs due to parallel rollouts and early-stage estimation bias that fosters overconfidence, reducing output diversity and causing performance plateaus. To address these challenges, we introduce an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our approach enables Llama3.1-8B to achieve a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of the rollout tokens budget. This highlights our method’s ability to effectively optimize the trade-off between inference efficiency, diversity, and estimation robustness, thereby advancing unsupervised reinforcement learning for open-domain reasoning tasks.
zh

[AI-19] NeMo: A Neuron-Level Modularizing-While-Training Approach for Decomposing DNN Models

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)模型在实际应用中因训练成本高而难以大规模部署的问题,尤其针对现有模块化训练方法(Modularizing-while-Training, MwT)在处理多样化的DNN架构(如Transformer模型)和大规模模型时存在的局限性。其解决方案的关键在于提出一种名为NeMo的可扩展且通用的MwT方法:该方法以神经元(neuron)为基本单元进行模块划分,确保对各类DNN结构(包括CNN与Transformer)均适用,并设计了一种基于对比学习的模块化训练策略及有效的复合损失函数,从而实现对大规模模型的有效模块化,显著提升模块分类准确率并大幅压缩模块规模。

链接: https://arxiv.org/abs/2508.11348
作者: Xiaohan Bi,Binhang Qi,Hailong Sun,Xiang Gao,Yue Yu,Xiaojun Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing incorporation of deep neural network (DNN) models into modern software systems, the prohibitive construction costs have become a significant challenge. Model reuse has been widely applied to reduce training costs, but indiscriminately reusing entire models may incur significant inference overhead. Consequently, DNN modularization has gained attention, enabling module reuse by decomposing DNN models. The emerging modularizing-while-training (MwT) paradigm, which incorporates modularization into training, outperforms modularizing-after-training approaches. However, existing MwT methods focus on small-scale CNN models at the convolutional kernel level and struggle with diverse DNNs and large-scale models, particularly Transformer-based models. To address these limitations, we propose NeMo, a scalable and generalizable MwT approach. NeMo operates at the neuron level fundamental component common to all DNNs-ensuring applicability to Transformers and various architectures. We design a contrastive learning-based modular training method with an effective composite loss function, enabling scalability to large-scale models. Comprehensive experiments on two Transformer-based models and four CNN models across two classification datasets demonstrate NeMo’s superiority over state-of-the-art MwT methods. Results show average gains of 1.72% in module classification accuracy and 58.10% reduction in module size, demonstrating efficacy across both CNN and large-scale Transformer-based models. A case study on open-source projects shows NeMo’s potential benefits in practical scenarios, offering a promising approach for scalable and generalizable DNN modularization.
zh

[AI-20] SAGE: Scale-Aware Gradual Evolution for Continual Knowledge Graph Embedding KDD2025

【速读】:该论文旨在解决动态知识图谱(Knowledge Graph, KG)嵌入中因更新规模不一致而导致的性能下降问题,尤其是现有持续知识图谱嵌入(Continual Knowledge Graph Embedding, CKGE)方法在面对不同尺度的新增实体、关系和事实时,难以有效适应嵌入空间的变化,且缺乏对整个更新过程的系统性评估。其解决方案的关键在于提出一种尺度感知的渐进演化框架 SAGE,通过两个核心机制实现:一是根据更新规模动态确定嵌入维度并扩展嵌入空间,确保模型容量与更新需求匹配;二是引入动态蒸馏(Dynamic Distillation)机制,在保留已有知识的同时高效融合新事实,从而平衡知识稳定性与增量学习能力。实验表明,SAGE 在多个基准数据集上显著优于现有方法,并验证了自适应嵌入维度对 CKGE 性能的重要性。

链接: https://arxiv.org/abs/2508.11347
作者: Yifei Li,Lingling Zhang,Hang Yan,Tianzhe Zhao,Zihan Ma,Muye Huang,Jun Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, Accepted at KDD 2025, code available at this https URL

点击查看摘要

Abstract:Traditional knowledge graph (KG) embedding methods aim to represent entities and relations in a low-dimensional space, primarily focusing on static graphs. However, real-world KGs are dynamically evolving with the constant addition of entities, relations and facts. To address such dynamic nature of KGs, several continual knowledge graph embedding (CKGE) methods have been developed to efficiently update KG embeddings to accommodate new facts while maintaining learned knowledge. As KGs grow at different rates and scales in real-world scenarios, existing CKGE methods often fail to consider the varying scales of updates and lack systematic evaluation throughout the entire update process. In this paper, we propose SAGE, a scale-aware gradual evolution framework for CKGE. Specifically, SAGE firstly determine the embedding dimensions based on the update scales and expand the embedding space accordingly. The Dynamic Distillation mechanism is further employed to balance the preservation of learned knowledge and the incorporation of new facts. We conduct extensive experiments on seven benchmarks, and the results show that SAGE consistently outperforms existing baselines, with a notable improvement of 1.38% in MRR, 1.25% in H@1 and 1.6% in H@10. Furthermore, experiments comparing SAGE with methods using fixed embedding dimensions show that SAGE achieves optimal performance on every snapshot, demonstrating the importance of adaptive embedding dimensions in CKGE. The codes of SAGE are publicly available at: this https URL.
zh

[AI-21] RegimeNAS: Regime-Aware Differentiable Architecture Search With Theoretical Guarantees for Financial Trading

【速读】:该论文旨在解决静态深度学习模型在高度动态的加密货币市场中表现不佳的问题,尤其是其无法有效适应不同市场状态(如波动性、趋势和范围变化)的局限性。解决方案的关键在于提出一种名为RegimeNAS的可微架构搜索框架,其核心创新包括:(1) 基于贝叶斯理论的搜索空间设计,确保架构优化具有可证明的收敛性;(2) 动态激活的专用神经模块(Volatility、Trend和Range块),分别适配不同的市场状态;(3) 多目标损失函数,融合市场特定惩罚项(如波动率匹配、状态转换平滑性)及数学强制的Lipschitz稳定性约束。通过多时间尺度的多头注意力机制进行市场状态识别,RegimeNAS实现了更精准的状态感知与不确定性估计,显著提升了交易性能与收敛速度。

链接: https://arxiv.org/abs/2508.11338
作者: Prathamesh Devadiga,Yashmitha Shailesh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce RegimeNAS, a novel differentiable architecture search framework specifically designed to enhance cryptocurrency trading performance by explicitly integrating market regime awareness. Addressing the limitations of static deep learning models in highly dynamic financial environments, RegimeNAS features three core innovations: (1) a theoretically grounded Bayesian search space optimizing architectures with provable convergence properties; (2) specialized, dynamically activated neural modules (Volatility, Trend, and Range blocks) tailored for distinct market conditions; and (3) a multi-objective loss function incorporating market-specific penalties (e.g., volatility matching, transition smoothness) alongside mathematically enforced Lipschitz stability constraints. Regime identification leverages multi-head attention across multiple timeframes for improved accuracy and uncertainty estimation. Rigorous empirical evaluation on extensive real-world cryptocurrency data demonstrates that RegimeNAS significantly outperforms state-of-the-art benchmarks, achieving an 80.3% Mean Absolute Error reduction compared to the best traditional recurrent baseline and converging substantially faster (9 vs. 50+ epochs). Ablation studies and regime-specific analysis confirm the critical contribution of each component, particularly the regime-aware adaptation mechanism. This work underscores the imperative of embedding domain-specific knowledge, such as market regimes, directly within the NAS process to develop robust and adaptive models for challenging financial applications.
zh

[AI-22] Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

【速读】:该论文旨在解决无线边缘计算环境中大语言模型(Large Language Models, LLMs)部署时面临的推理质量与端到端延迟之间的权衡问题。具体而言,简单任务若采用云端推理会导致显著延迟,而本地设备运行的轻量模型则难以胜任复杂计算任务。解决方案的关键在于提出一种动态的质量-延迟感知路由框架,通过在移动设备上的轻量模型与边缘服务器上的强大模型之间智能调度推理任务,实现资源优化配置。该框架构建了两种成本模型:针对单轮查询,融合BERT预测的语义得分与通信及计算开销;针对多轮对话,则进一步量化因模型切换和KV缓存管理带来的上下文相关成本,从而在不牺牲推理质量的前提下,显著降低响应延迟并减少对大型模型的调用次数。

链接: https://arxiv.org/abs/2508.11291
作者: Rui Bao,Nan Xue,Yaping Sun,Zhiyong Chen
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by IEEE/CIC ICCC workshop

点击查看摘要

Abstract:The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.
zh

[AI-23] CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

【速读】:该论文旨在解决在边缘设备上部署大语言模型(Large Language Models, LLMs)时因设备资源受限而导致的冷启动延迟(cold-start latency)问题,该延迟主要由按需加载模型引起的等待时间造成。解决方案的关键在于提出一种感知延迟的调度框架(latency-aware scheduling framework),通过动态调整模型层的划分与设备分配策略,将模型加载过程与计算和通信操作重叠执行,从而有效隐藏加载时间并减少空闲周期。作者将该问题建模为混合整数非线性规划(Mixed-Integer Non-Linear Program),并设计了一种高效的动态规划算法以优化模型分区和设备分配,实验表明该方法显著降低了冷启动延迟。

链接: https://arxiv.org/abs/2508.11287
作者: Xuran Liu,Nan Xue,Rui Bao,Yaping Sun,Zhiyong Chen,Meixia Tao,Xiaodong Xu,Shuguang Cui
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to Journal of Communications and Information Networks

点击查看摘要

Abstract:While deploying large language models on edge devices promises low-latency and privacy-preserving AI services, it is hindered by limited device resources. Although pipeline parallelism facilitates distributed inference, existing approaches often ignore the cold-start latency caused by on-demand model loading. In this paper, we propose a latency-aware scheduling framework that overlaps model loading with computation and communication to minimize total inference latency. Based on device and model parameters, the framework dynamically adjusts layer partitioning and allocation to effectively hide loading time, thereby eliminating as many idle periods as possible. We formulate the problem as a Mixed-Integer Non-Linear Program and design an efficient dynamic programming algorithm to optimize model partitioning and device assignment. Experimental results show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
zh

[AI-24] Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas

【速读】:该论文旨在解决通用人工智能(General-Purpose AI, GPAI)系统在软件工程(Software Engineering, SE)应用中可能继承或表现出人类认知偏差的问题。尽管GPAI因其非人类特性被寄予缓解人类偏见的期望,但其训练数据源自人类生成内容,使其自身也可能内嵌认知偏差,从而在实际开发流程中引发错误决策。为应对这一挑战,论文提出首个动态基准测试框架,通过设计包含8类典型认知偏差(如锚定效应、框架效应)的手工任务及其无偏版本,评估GPAI是否因语言线索而非逻辑推理而产生错误结论。关键创新在于构建一个按需增强流水线,利用GPAI自动生成任务变体以保持偏见诱导线索的同时控制表面细节和推理复杂度,结合Prolog逻辑推理与大模型作为裁判(LLM-as-a-judge)验证机制,确保任务正确性(88–99%)并揭示偏见对逻辑无关的浅层语言启发式依赖的敏感性。实验表明,主流GPAI系统(GPT、LLaMA、DeepSeek)普遍存在认知偏差(5.9%–35%),且随任务复杂度上升至49%,凸显了在真实软件工程部署中的重大风险。

链接: https://arxiv.org/abs/2508.11278
作者: Francesco Sovrano,Gabriele Dominici,Rita Sevastjanova,Alessandra Stramiglio,Alberto Bacchelli
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Human cognitive biases in software engineering can lead to costly errors. While general-purpose AI (GPAI) systems may help mitigate these biases due to their non-human nature, their training on human-generated data raises a critical question: Do GPAI systems themselves exhibit cognitive biases? To investigate this, we present the first dynamic benchmarking framework to evaluate data-induced cognitive biases in GPAI within software engineering workflows. Starting with a seed set of 16 hand-crafted realistic tasks, each featuring one of 8 cognitive biases (e.g., anchoring, framing) and corresponding unbiased variants, we test whether bias-inducing linguistic cues unrelated to task logic can lead GPAI systems from correct to incorrect conclusions. To scale the benchmark and ensure realism, we develop an on-demand augmentation pipeline relying on GPAI systems to generate task variants that preserve bias-inducing cues while varying surface details. This pipeline ensures correctness (88–99% on average, according to human evaluation), promotes diversity, and controls reasoning complexity by leveraging Prolog-based reasoning and LLM-as-a-judge validation. It also verifies that the embedded biases are both harmful and undetectable by logic-based, unbiased reasoners. We evaluate leading GPAI systems (GPT, LLaMA, DeepSeek) and find a consistent tendency to rely on shallow linguistic heuristics over deep reasoning. All systems exhibit cognitive biases (ranging from 5.9% to 35% across types), with bias sensitivity increasing sharply with task complexity (up to 49%), highlighting critical risks in real-world software engineering deployments. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2508.11278 [cs.HC] (or arXiv:2508.11278v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2508.11278 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gabriele Dominici [view email] [v1] Fri, 15 Aug 2025 07:29:46 UTC (1,051 KB)
zh

[AI-25] Hallucination in LLM -Based Code Generation: An Automotive Case Study

【速读】:该论文试图解决生成式 AI(Generative AI)在代码生成任务中因幻觉(hallucination)现象导致的可靠性问题,尤其是在汽车软件等安全关键领域中,模型输出可能存在语法错误、无效引用或API知识冲突,从而影响系统安全性。解决方案的关键在于通过引入更丰富的上下文信息(如Covesa车辆信号规范VSS和代码骨架)来提升模型生成代码的准确性与一致性,实验表明仅在最复杂的提示策略下,GPT-4.1和GPT-4o才能生成正确解,凸显了增强上下文引导对缓解幻觉、提高代码生成质量的核心作用。

链接: https://arxiv.org/abs/2508.11257
作者: Marc Pavel,Nenad Petrovic,Lukasz Mazur,Vahid Zolfaghari,Fengjunjie Pan,Alois Knoll
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.
zh

[AI-26] Graph Neural Diffusion via Generalized Opinion Dynamics

【速读】:该论文旨在解决现有基于扩散机制的图神经网络(Graph Neural Networks, GNNs)面临的三大关键问题:(1) 依赖同质扩散与静态动力学,难以适应多样化的图结构;(2) 网络深度受限于计算开销和可解释性下降;(3) 对收敛行为的理论理解不足。其解决方案的核心是提出一种广义意见动态神经框架(Generalized Opinion Dynamics Neural Framework, GODNF),通过将多种意见动态模型统一为可训练的扩散机制,实现节点特异性行为建模与动态邻域影响捕捉,从而有效建模异质扩散模式与时间演化特性,同时保障深层传播的高效性与可解释性,并提供严格的理论分析证明其对多样化收敛配置的建模能力。

链接: https://arxiv.org/abs/2508.11249
作者: Asela Hevapathige,Asiri Wijesinghe,Ahad N. Zehmakan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been a growing interest in developing diffusion-based Graph Neural Networks (GNNs), building on the connections between message passing mechanisms in GNNs and physical diffusion processes. However, existing methods suffer from three critical limitations: (1) they rely on homogeneous diffusion with static dynamics, limiting adaptability to diverse graph structures; (2) their depth is constrained by computational overhead and diminishing interpretability; and (3) theoretical understanding of their convergence behavior remains limited. To address these challenges, we propose GODNF, a Generalized Opinion Dynamics Neural Framework, which unifies multiple opinion dynamics models into a principled, trainable diffusion mechanism. Our framework captures heterogeneous diffusion patterns and temporal dynamics via node-specific behavior modeling and dynamic neighborhood influence, while ensuring efficient and interpretable message propagation even at deep layers. We provide a rigorous theoretical analysis demonstrating GODNF’s ability to model diverse convergence configurations. Extensive empirical evaluations of node classification and influence estimation tasks confirm GODNF’s superiority over state-of-the-art GNNs.
zh

[AI-27] Multi-Group Equivariant Augmentation for Reinforcement Learning in Robot Manipulation

【速读】:该论文旨在解决现实世界机器人操作中视觉-运动学习的采样效率问题。现有方法虽利用任务对称性(task symmetry)作为归纳偏置以提升效率,但局限于等距对称性(isometric symmetries),即在所有时间步对所有任务对象施加相同的群变换。为突破此限制,作者提出引入非等距对称性(non-isometric symmetries),在空间和时间维度上应用多个独立的群变换来增强灵活性。其解决方案的关键在于:一是构建一种包含非等距对称结构的新型部分可观测马尔可夫决策过程(POMDP)形式化;二是设计一种简单而有效的数据增强方法——多群等变增强(Multi-Group Equivariance Augmentation, MEA),并结合离线强化学习(offline reinforcement learning)与体素化视觉表示(voxel-based visual representation),以保留平移等变性(translational equivariance),从而显著提升采样效率。

链接: https://arxiv.org/abs/2508.11204
作者: Hongbin Lin,Juan Rojas,Kwok Wai Samuel Au
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sampling efficiency is critical for deploying visuomotor learning in real-world robotic manipulation. While task symmetry has emerged as a promising inductive bias to improve efficiency, most prior work is limited to isometric symmetries – applying the same group transformation to all task objects across all timesteps. In this work, we explore non-isometric symmetries, applying multiple independent group transformations across spatial and temporal dimensions to relax these constraints. We introduce a novel formulation of the partially observable Markov decision process (POMDP) that incorporates the non-isometric symmetry structures, and propose a simple yet effective data augmentation method, Multi-Group Equivariance Augmentation (MEA). We integrate MEA with offline reinforcement learning to enhance sampling efficiency, and introduce a voxel-based visual representation that preserves translational equivariance. Extensive simulation and real-robot experiments across two manipulation domains demonstrate the effectiveness of our approach.
zh

[AI-28] Visuomotor Grasping with World Models for Surgical Robots

【速读】:该论文旨在解决机器人辅助手术(RAS)中抓取任务的自动化问题,特别是针对现有方法在处理新物体、视觉干扰和可变形对象时泛化能力差、鲁棒性不足以及依赖特定任务模型等局限。其解决方案的关键在于提出Grasp Anything for Surgery V2(GASv2),一个基于世界模型架构和外科感知流水线的视觉-运动学习框架,结合混合控制策略以实现安全执行;通过领域随机化训练实现仿真到现实的迁移,并仅使用标准的单对立体相机,在离体和模拟手术场景中均实现了65%的成功率,且无需重新训练即可泛化至未见过的物体与夹具,展现出优异的性能、通用性和鲁棒性。

链接: https://arxiv.org/abs/2508.11200
作者: Hongbin Lin,Bin Li,Kwok Wai Samuel Au
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grasping is a fundamental task in robot-assisted surgery (RAS), and automating it can reduce surgeon workload while enhancing efficiency, safety, and consistency beyond teleoperated systems. Most prior approaches rely on explicit object pose tracking or handcrafted visual features, limiting their generalization to novel objects, robustness to visual disturbances, and the ability to handle deformable objects. Visuomotor learning offers a promising alternative, but deploying it in RAS presents unique challenges, such as low signal-to-noise ratio in visual observations, demands for high safety and millimeter-level precision, as well as the complex surgical environment. This paper addresses three key challenges: (i) sim-to-real transfer of visuomotor policies to ex vivo surgical scenes, (ii) visuomotor learning using only a single stereo camera pair – the standard RAS setup, and (iii) object-agnostic grasping with a single policy that generalizes to diverse, unseen surgical objects without retraining or task-specific models. We introduce Grasp Anything for Surgery V2 (GASv2), a visuomotor learning framework for surgical grasping. GASv2 leverages a world-model-based architecture and a surgical perception pipeline for visual observations, combined with a hybrid control system for safe execution. We train the policy in simulation using domain randomization for sim-to-real transfer and deploy it on a real robot in both phantom-based and ex vivo surgical settings, using only a single pair of endoscopic cameras. Extensive experiments show our policy achieves a 65% success rate in both settings, generalizes to unseen objects and grippers, and adapts to diverse disturbances, demonstrating strong performance, generality, and robustness.
zh

[AI-29] Quantum-Boosted High-Fidelity Deep Learning

【速读】:该论文旨在解决概率深度学习模型普遍依赖高斯先验(Gaussian prior)所带来的局限性,这一假设难以准确刻画自然数据中复杂的非高斯分布特征,尤其在复杂生物数据等科学领域严重限制了模型的保真度与科学发现能力。解决方案的关键在于提出一种大规模、长时间稳定的量子-经典混合架构——量子玻尔兹曼机-变分自编码器(Quantum Boltzmann Machine-Variational Autoencoder, QBM-VAE),其利用量子处理器高效采样来自物理基础的玻尔兹曼分布(Boltzmann distribution),将其作为深度生成模型中的强大先验,从而显著提升对单细胞组学数据等复杂结构的建模能力,在数据整合、细胞类型分类和轨迹推断等任务中优于传统基于高斯先验的模型(如VAE和SCVI)。

链接: https://arxiv.org/abs/2508.11190
作者: Feng-ao Wang,Shaobo Chen,Yao Xuan,Junwei Liu,Qi Gao,Hongdong Zhu,Junjie Hou,Lixin Yuan,Jinyu Cheng,Chenxin Yi,Hai Wei,Yin Ma,Tao Xu,Kai Wen,Yixue Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models.
zh

[AI-30] On Strong and Weak Admissibility in Non-Flat Assumption-Based Argumentation

【速读】:该论文旨在解决假设基础论证框架(Assumption-Based Argumentation, ABA)中标准可接受性(admissibility)概念的扩展与形式化问题,特别是针对非平坦(non-flat)ABA场景下强可接受性(strong admissibility)和弱可接受性(weak admissibility)的定义、性质及语义构建问题。其解决方案的关键在于引入基于抽象双极集合论证框架(abstract bipolar set-based argumentation frameworks, BSAFs)作为统一的形式化平台,利用其表达能力充分刻画假设之间的依赖关系,并在此基础上首次提出适用于一般非平坦ABA的强可接受性定义及其对应的偏好(preferred)、完整(complete)和基底(grounded)语义;同时证明了在经典、强和弱可接受性下,模块化性质(modularization property)依然成立,从而为ABA理论提供了更丰富的可接受性分析工具与语义选择。

链接: https://arxiv.org/abs/2508.11182
作者: Matti Berthold,Lydia Blümel,Anna Rapberger
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we broaden the investigation of admissibility notions in the context of assumption-based argumentation (ABA). More specifically, we study two prominent alternatives to the standard notion of admissibility from abstract argumentation, namely strong and weak admissibility, and introduce the respective preferred, complete and grounded semantics for general (sometimes called non-flat) ABA. To do so, we use abstract bipolar set-based argumentation frameworks (BSAFs) as formal playground since they concisely capture the relations between assumptions and are expressive enough to represent general non-flat ABA frameworks, as recently shown. While weak admissibility has been recently investigated for a restricted fragment of ABA in which assumptions cannot be derived (flat ABA), strong admissibility has not been investigated for ABA so far. We introduce strong admissibility for ABA and investigate desirable properties. We furthermore extend the recent investigations of weak admissibility in the flat ABA fragment to the non-flat case. We show that the central modularization property is maintained under classical, strong, and weak admissibility. We also show that strong and weakly admissible semantics in non-flat ABA share some of the shortcomings of standard admissible semantics and discuss ways to address these.
zh

[AI-31] A Semi-supervised Generative Model for Incomplete Multi-view Data Integration with Missing Labels

【速读】:该论文旨在解决多视图学习中普遍存在的两个问题:视图缺失(missing views)和标签缺失(missing labels),尤其是在生物组学等真实数据场景下,样本常同时面临视图不完整和标注稀疏的挑战。现有基于信息瓶颈(Information Bottleneck, IB)的生成式方法虽能有效利用可用视图进行分类,但其本质上是全监督框架,难以利用大量未标注数据。论文提出了一种半监督生成模型,在统一框架内同时利用标注与未标注样本:首先通过最大化未标注样本的似然来学习一个与IB模型共享的潜在空间,从而增强对未标注数据的建模能力;其次在潜在空间中引入跨视图互信息最大化机制,以强化不同视图间共享信息的提取。该方案的关键在于将IB的监督学习能力与未标注数据的潜在空间建模相结合,显著提升了在视图缺失和标签稀缺条件下的预测性能与视图补全能力。

链接: https://arxiv.org/abs/2508.11180
作者: Yiyang Shen,Weiran Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-view learning is widely applied to real-life datasets, such as multiple omics biological data, but it often suffers from both missing views and missing labels. Prior probabilistic approaches addressed the missing view problem by using a product-of-experts scheme to aggregate representations from present views and achieved superior performance over deterministic classifiers, using the information bottleneck (IB) principle. However, the IB framework is inherently fully supervised and cannot leverage unlabeled data. In this work, we propose a semi-supervised generative model that utilizes both labeled and unlabeled samples in a unified framework. Our method maximizes the likelihood of unlabeled samples to learn a latent space shared with the IB on labeled data. We also perform cross-view mutual information maximization in the latent space to enhance the extraction of shared information across views. Compared to existing approaches, our model achieves better predictive and imputation performance on both image and multi-omics data with missing views and limited labeled samples.
zh

[AI-32] Role-Augmented Intent-Driven Generative Search Engine Optimization

【速读】:该论文旨在解决生成式搜索引擎(Generative Search Engines, GSEs)环境下传统搜索引擎优化(Search Engine Optimization, SEO)策略失效的问题。由于GSEs依赖大型语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术,其信息呈现方式具有黑箱特性,导致内容创作者基于传统SEO的优化手段无法有效提升在GSE中的可见性。解决方案的关键在于提出一种角色增强的意图驱动型生成式SEO(Role-Augmented Intent-Driven Generative Search Engine Optimization, G-SEO)方法:通过多角色视角对搜索意图进行反射式细化,从而实现针对GSE场景的内容精准优化。该方法以搜索意图作为核心信号,显著优于单一维度基线方法,在主观感知和客观内容可见性方面均取得提升。

链接: https://arxiv.org/abs/2508.11158
作者: Xiaolu Chen,Haojie Wu,Jie Bao,Zhen Chen,Yong Liao,Hu Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Generative Search Engines (GSEs), powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), are reshaping information retrieval. While commercial systems (e.g., BingChat, this http URL) demonstrate impressive semantic synthesis capabilities, their black-box nature fundamentally undermines established Search Engine Optimization (SEO) practices. Content creators face a critical challenge: their optimization strategies, effective in traditional search engines, are misaligned with generative retrieval contexts, resulting in diminished visibility. To bridge this gap, we propose a Role-Augmented Intent-Driven Generative Search Engine Optimization (G-SEO) method, providing a structured optimization pathway tailored for GSE scenarios. Our method models search intent through reflective refinement across diverse informational roles, enabling targeted content enhancement. To better evaluate the method under realistic settings, we address the benchmarking limitations of prior work by: (1) extending the GEO dataset with diversified query variations reflecting real-world search scenarios and (2) introducing G-Eval 2.0, a 6-level LLM-augmented evaluation rubric for fine-grained human-aligned assessment. Experimental results demonstrate that search intent serves as an effective signal for guiding content optimization, yielding significant improvements over single-aspect baseline approaches in both subjective impressions and objective content visibility within GSE responses.
zh

[AI-33] Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在长时程机器人操作任务中,尤其是在稀疏奖励场景下,难以稳定且高效地学习连续动作片段(continuous action chunks)的问题。解决方案的关键在于提出AC3(Actor-Critic for Continuous Chunks)框架,其核心创新包括:1)采用非对称更新机制训练策略网络(actor),仅从成功轨迹中学习以保障策略改进的可靠性;2)通过引入块内n步回报(intra-chunk n-step returns)和基于锚点的自监督模块提供内在奖励,显著增强价值网络(critic)在稀疏奖励下的学习稳定性与效率。实验表明,AC3在BiGym和RLBench基准上的25个任务中,仅需少量示范即可实现优于现有方法的成功率表现。

链接: https://arxiv.org/abs/2508.11143
作者: Jiarui Yang,Bin Zhu,Jingjing Chen,Yu-Gang Jiang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic’s update is stabilized using intra-chunk n -step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.
zh

[AI-34] abularis Formatus: Predictive Formatting for Tables

【速读】:该论文旨在解决电子表格软件中条件格式(Conditional Formatting, CF)规则生成的复杂性问题,即用户在创建CF规则时面临的技术门槛高、规则设计困难以及现有用户界面支持不足等挑战。其解决方案的关键在于提出一种神经符号(neuro-symbolic)方法TaFo,该方法通过结合语言模型的语义知识与结构化规则合成机制,首次实现了基于数值内容的条件格式自动推断,无需依赖用户提供的示例或自然语言指令;TaFo能够同时学习规则触发条件和对应的视觉格式属性,从而实现对表格数据的预测性、自动化格式建议,显著提升了格式建议的准确性、多样性和完整性。

链接: https://arxiv.org/abs/2508.11121
作者: Mukul Singh,José Cambronero,Sumit Gulwani,Vu Le,Gust Verbruggen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 14 pages

点击查看摘要

Abstract:Spreadsheet manipulation software are widely used for data management and analysis of tabular data, yet the creation of conditional formatting (CF) rules remains a complex task requiring technical knowledge and experience with specific platforms. In this paper we present TaFo, a neuro-symbolic approach to generating CF suggestions for tables, addressing common challenges such as user unawareness, difficulty in rule creation, and inadequate user interfaces. TaFo takes inspiration from component based synthesis systems and extends them with semantic knowledge of language models and a diversity preserving rule this http URL previous methods focused on structural formatting, TaFo uniquely incorporates value-based formatting, automatically learning both the rule trigger and the associated visual formatting properties for CF rules. By removing the dependency on user specification used by existing techniques in the form of formatted examples or natural language instruction, TaFo makes formatting completely predictive and automated for the user. To evaluate TaFo, we use a corpus of 1.8 Million public workbooks with CF and manual formatting. We compare TaFo against a diverse set of symbolic and neural systems designed for or adapted for the task of table formatting. Our results show that TaFo generates more accurate, diverse and complete formatting suggestions than current systems and outperforms these by 15.6%–26.5% on matching user added ground truth rules in tables.
zh

[AI-35] Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees

【速读】:该论文旨在解决离散或量化变量优化问题的挑战性难题,这类问题因搜索空间的组合性质而难以求解。其核心解决方案是基于连续优化框架的分段仿射正则化(Piecewise-affine regularization, PAR),通过将量化建模转化为连续优化问题,实现了对参数空间的有效约束与高效求解。关键创新在于:首先,在过参数化场景下证明了PAR正则化损失函数的所有临界点均表现出高度量化特性;其次,推导出多种凸、拟凸及非凸PAR的闭式近似映射(proximal mappings),并提出利用近端梯度法、加速版本及交替方向乘子法(ADMM)求解PAR正则化问题;最后,建立了PAR正则化线性回归的统计保障,表明可通过PAR逼近传统ℓ₁、平方ℓ₂和非凸正则化方法,并获得具有量化解的类似统计性能。

链接: https://arxiv.org/abs/2508.11112
作者: Jianhao Ma,Lin Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Optimization problems over discrete or quantized variables are very challenging in general due to the combinatorial nature of their search space. Piecewise-affine regularization (PAR) provides a flexible modeling and computational framework for quantization based on continuous optimization. In this work, we focus on the setting of supervised learning and investigate the theoretical foundations of PAR from optimization and statistical perspectives. First, we show that in the overparameterized regime, where the number of parameters exceeds the number of samples, every critical point of the PAR-regularized loss function exhibits a high degree of quantization. Second, we derive closed-form proximal mappings for various (convex, quasi-convex, and non-convex) PARs and show how to solve PAR-regularized problems using the proximal gradient method, its accelerated variant, and the Alternating Direction Method of Multipliers. Third, we study statistical guarantees of PAR-regularized linear regression problems; specifically, we can approximate classical formulations of \ell_1 -, squared \ell_2 -, and nonconvex regularizations using PAR and obtain similar statistical guarantees with quantized solutions.
zh

[AI-36] Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

【速读】:该论文旨在解决人机协作中机器人快速推断用户意图、提供透明推理并协助用户达成目标的问题。其解决方案的关键在于通过引入视觉语言模型(VLM)和纯文本语言模型(LLM),构建语义先验机制,对候选物体与位置进行筛选:具体而言,YOLO与Segment Anything Model组成的视觉流水线提取目标区域并输入至VLM,由其根据操作者提示评估相关性得分;同时,检测到的物体标签列表由LLM排序打分。这两个分数用于加权GUIDER框架中的导航与操作模块,从而在满足阈值条件下触发自主行为切换,使机器人能够精准导航至指定区域并获取目标物体,同时适应用户意图的变化。

链接: https://arxiv.org/abs/2508.11093
作者: Cesar Alan Contreras,Manolis Chiou,Alireza Rastegarpanah,Michal Szulik,Rustam Stolkin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at Human-Centered Robot Autonomy for Human-Robot Teams (HuRoboT) at IEEE RO-MAN 2025, Eindhoven, the Netherlands

点击查看摘要

Abstract:Human-robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision-language model (VLM) and a text-only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text-only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context-relevant targets while suppressing unrelated objects. Once the combined belief exceeds a threshold, autonomy changes occur, enabling the robot to navigate to the desired area and retrieve the desired object, while adapting to any changes in the operator’s intent. Future work will evaluate the system on Isaac Sim using a Franka Emika arm on a Ridgeback base, with a focus on real-time assistance.
zh

[AI-37] Compressive Meta-Learning KDD’25

【速读】:该论文旨在解决大规模数据集下参数学习效率低的问题,传统压缩学习(compressive learning)虽能通过随机非线性特征将海量数据压缩为与样本数量无关的紧凑表示,但其编码和解码过程通常为随机且与数据无关,无法利用数据内在结构,导致性能受限。解决方案的关键在于提出一种压缩元学习(Compressive Meta-Learning)框架,通过神经网络联合元学习编码与解码阶段,使系统在保持高效性和隐私友好性的同时,显著提升参数估计的速度与准确性,优于当前最先进的方法。

链接: https://arxiv.org/abs/2508.11090
作者: Daniel Mas Montserrat,David Bonet,Maria Perera,Xavier Giró-i-Nieto,Alexander G. Ioannidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Databases (cs.DB)
备注: Extended version of a paper accepted at KDD '25

点击查看摘要

Abstract:The rapid expansion in the size of new datasets has created a need for fast and efficient parameter-learning techniques. Compressive learning is a framework that enables efficient processing by using random, non-linear features to project large-scale databases onto compact, information-preserving representations whose dimensionality is independent of the number of samples and can be easily stored, transferred, and processed. These database-level summaries are then used to decode parameters of interest from the underlying data distribution without requiring access to the original samples, offering an efficient and privacy-friendly learning framework. However, both the encoding and decoding techniques are typically randomized and data-independent, failing to exploit the underlying structure of the data. In this work, we propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods by using neural networks that provide faster and more accurate systems than the current state-of-the-art approaches. To demonstrate the potential of the presented Compressive Meta-Learning framework, we explore multiple applications – including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders.
zh

[AI-38] Learn to optimize for automatic proton PBS treatment planning for HN cancers

【速读】:该论文旨在解决质子笔形束(Proton PBS)治疗计划中因多目标冲突导致的人工规划效率低下问题,特别是逆向优化(inverse optimization)这一计算密集型环节仍依赖理论驱动方法、耗时较长的问题。解决方案的关键在于提出一种基于学习到的优化(Learning-to-Optimize, L2O)的数据驱动逆向优化器,并将其集成至基于近端策略优化(PPO)的自动治疗计划框架中:L2O模块通过从任务特定数据分布中学习预测更新步长,实现高效且高质量的MU值计算;同时首次将原本用于大语言模型(LLM)长上下文处理的技术引入基于Transformer的L2O框架,解决了现有L2O方法在规模扩展上的瓶颈。该方案使得整个系统能够在平均2.55小时内自动生成符合临床要求的高质量计划,显著优于人工计划在靶区覆盖和器官避开(OAR sparing)方面的表现。

链接: https://arxiv.org/abs/2508.11085
作者: Qingqing Wang,Liqiang Xiao,Chang Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 4 figures

点击查看摘要

Abstract:Proton PBS treatment planning for HN cancers involves numerous conflicting objectives, requiring significant effort from human planners to balance and satisfy multiple clinical goals during planning. To achieve this, experience-demanding objective parameter adjustment and computationally expensive inverse optimization are performed iteratively. Extensive efforts have been made to automatically adjust objective parameters, but the most time-consuming component, i.e., inverse optimization, still relies heavily on theory-driven approaches. We propose a data-driven inverse optimizer and integrate it into a PPO-based automatic treatment planning framework to automatically generate high-quality plans within a clinical acceptable planning time. The inverse optimizer is a L2O method that predicts update steps by learning from the task-specific data distribution. For the first time, we integrate techniques designed for long-context processing, originally developed for LLMs, into a Transformer-based L2O framework to address the scalability issue of existing L2O methods. The PPO framework functions as an outer-loop virtual planner, autonomously adjusting objective parameters through a policy network, and the dose predictor is used to initialize objective parameters. The inner-loop L2O inverse optimizer computes machine-deliverable MU values based on objectives refined by the PPO policy network. 97 patients are collected in this study, and compared with L-BFGSB, our L2O-based inverse optimizer improves the effectiveness and efficiency by 22.97% and 36.41%, respectively. In conjunction with the PPO-based learned virtual planner, plans generated by our framework within an average of 2.55 hours show improved or comparable OAR sparing with superior target coverage for patients with different prescription dose levels, number of target volumes, beam angles, etc., compared with human-generated plans.
zh

[AI-39] From Individual to Multi-Agent Algorithmic Recourse: Minimizing the Welfare Gap via Capacitated Bipartite Matching

【速读】:该论文旨在解决现有算法可溯性(algorithmic recourse)研究中忽视多主体交互问题的局限性,即在现实世界应用中,多个寻求可溯性的个体(recourse seekers)与多个提供可溯性方案的模型(recourse providers)之间存在资源竞争和协同关系,而传统方法仅关注单个个体与单一模型的匹配,无法实现系统级的社会福利最优。解决方案的关键在于提出一个三层次优化框架:首先将多对多互动建模为带容量约束的加权二分图匹配问题,其中边权重反映可溯成本;其次通过最优容量再分配最小化个体福利与集体可行解之间的差距;最后引入成本感知优化,在最大化社会福利的同时考虑调整提供方容量所带来的代价。该框架实现了从个体推荐向系统级设计的扩展,能够在最小改动系统配置的前提下达成近似最优的社会福利。

链接: https://arxiv.org/abs/2508.11070
作者: Zahra Khotanlou,Kate Larson,Amir-Hossein Karimi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision makers are increasingly relying on machine learning in sensitive situations. In such settings, algorithmic recourse aims to provide individuals with actionable and minimally costly steps to reverse unfavorable AI-driven decisions. While existing research predominantly focuses on single-individual (i.e., seeker) and single-model (i.e., provider) scenarios, real-world applications often involve multiple interacting stakeholders. Optimizing outcomes for seekers under an individual welfare approach overlooks the inherently multi-agent nature of real-world systems, where individuals interact and compete for limited resources. To address this, we introduce a novel framework for multi-agent algorithmic recourse that accounts for multiple recourse seekers and recourse providers. We model this many-to-many interaction as a capacitated weighted bipartite matching problem, where matches are guided by both recourse cost and provider capacity. Edge weights, reflecting recourse costs, are optimized for social welfare while quantifying the welfare gap between individual welfare and this collectively feasible outcome. We propose a three-layer optimization framework: (1) basic capacitated matching, (2) optimal capacity redistribution to minimize the welfare gap, and (3) cost-aware optimization balancing welfare maximization with capacity adjustment costs. Experimental validation on synthetic and real-world datasets demonstrates that our framework enables the many-to-many algorithmic recourse to achieve near-optimal welfare with minimum modification in system settings. This work extends algorithmic recourse from individual recommendations to system-level design, providing a tractable path toward higher social welfare while maintaining individual actionability.
zh

[AI-40] AI That Helps Us Help Each Other: A Proactive System for Scaffolding Mentor-Novice Collaboration in Entrepreneurship Coaching

【速读】:该论文旨在解决初创企业创始人在面对开放性、定义不清的问题时,因元认知能力不足而难以有效识别风险、挑战假设并做出战略决策的难题,同时应对导师时间有限且难以提供个性化指导的现实约束。解决方案的关键在于构建一个融合领域特定创业风险认知模型与大语言模型(Large Language Model, LLM)的人机协同教练系统,通过主动提出诊断性问题来引导新手反思,并帮助新手与导师共同规划更具情感敏感性和聚焦性的会谈。该系统的核心创新在于其可配置性——导师可审查并修改底层认知模型,使AI逻辑动态适配自身经验与需求,从而提升人机协作深度与有效性。

链接: https://arxiv.org/abs/2508.11052
作者: Evey Jiaxin Huang,Matthew Easterday,Elizabeth Gerber
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear in CSCW 2025 Volume 9

点击查看摘要

Abstract:Entrepreneurship requires navigating open-ended, ill-defined problems: identifying risks, challenging assumptions, and making strategic decisions under deep uncertainty. Novice founders often struggle with these metacognitive demands, while mentors face limited time and visibility to provide tailored support. We present a human-AI coaching system that combines a domain-specific cognitive model of entrepreneurial risk with a large language model (LLM) to proactively scaffold both novice and mentor thinking. The system proactively poses diagnostic questions that challenge novices’ thinking and helps both novices and mentors plan for more focused and emotionally attuned meetings. Critically, mentors can inspect and modify the underlying cognitive model, shaping the logic of the system to reflect their evolving needs. Through an exploratory field deployment, we found that using the system supported novice metacognition, helped mentors plan emotionally attuned strategies, and improved meeting depth, intentionality, and focus–while also surfaced key tensions around trust, misdiagnosis, and expectations of AI. We contribute design principles for proactive AI systems that scaffold metacognition and human-human collaboration in complex, ill-defined domains, offering implications for similar domains like healthcare, education, and knowledge work.
zh

[AI-41] Learning with Confidence UAI2025

【速读】:该论文旨在解决如何在学习或信念更新过程中形式化“信心”(confidence)这一概念的问题,即量化个体对新信息的信任程度及其对信念状态的影响。传统上,信心常被误认为是概率或似然,但本文指出其本质不同,并将其与学习率、训练轮次、Shafer的证据权重及卡尔曼增益等已有概念统一起来。解决方案的关键在于通过公理化方法定义“带信心的学习”,并提出两种在连续尺度上测量信心的规范方式,证明信心始终可由此类方式表示;在此基础上,进一步在附加假设下推导出基于向量场和损失函数的紧凑表达形式,从而构建了一种扩展的复合“并行”观测语言,并将贝叶斯规则识别为损失函数为线性期望时的最优学习者特例。

链接: https://arxiv.org/abs/2508.11037
作者: Oliver Ethan Richardson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG)
备注: Accepted for oral UAI 2025, plus some additional modifications for clarity

点击查看摘要

Abstract:We characterize a notion of confidence that arises in learning or updating beliefs: the amount of trust one has in incoming information and its impact on the belief state. This learner’s confidence can be used alongside (and is easily mistaken for) probability or likelihood, but it is fundamentally a different concept – one that captures many familiar concepts in the literature, including learning rates and number of training epochs, Shafer’s weight of evidence, and Kalman gain. We formally axiomatize what it means to learn with confidence, give two canonical ways of measuring confidence on a continuum, and prove that confidence can always be represented in this way. Under additional assumptions, we derive more compact representations of confidence-based learning in terms of vector fields and loss functions. These representations induce an extended language of compound “parallel” observations. We characterize Bayes Rule as the special case of an optimizing learner whose loss representation is a linear expectation.
zh

[AI-42] Risk-Based Prognostics and Health Management

【速读】:该论文试图解决风险评估与预测性维护(prognostics)常被视为独立任务而导致信息割裂的问题,旨在通过构建更紧密耦合的风险驱动型预测框架来提升决策效率。其解决方案的关键在于采用连续时间贝叶斯网络(continuous-time Bayesian network)作为建模基础,从而实现对故障发生概率及其风险影响的动态联合推断,并提供从数据中自动构建此类模型的技术路径,支持如决策辅助和基于性能的后勤管理等实际应用。

链接: https://arxiv.org/abs/2508.11031
作者: John W. Sheppard
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Appears as Chapter 27 in Realizing Complex Integrated Systems, Anthony P. Ambler and John W. Sheppard (ads.), CRC Press, 2025

点击查看摘要

Abstract:It is often the case that risk assessment and prognostics are viewed as related but separate tasks. This chapter describes a risk-based approach to prognostics that seeks to provide a tighter coupling between risk assessment and fault prediction. We show how this can be achieved using the continuous-time Bayesian network as the underlying modeling framework. Furthermore, we provide an overview of the techniques that are available to derive these models from data and show how they might be used in practice to achieve tasks like decision support and performance-based logistics. This work is intended to provide an overview of the recent developments related to risk-based prognostics, and we hope that it will serve as a tutorial of sorts that will assist others in adopting these techniques.
zh

[AI-43] Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks

【速读】:该论文旨在解决传统共形预测(conformal prediction)方法在计算效率和多维输出依赖关系建模方面的局限性。现有方法通常需要复杂的不确定性建模与校准过程,且多以区间形式表示预测集,难以捕捉高维输出间的依赖结构。其解决方案的关键在于提出zono-conformal预测,通过将不确定性的zonotope(平行多面体)集合直接嵌入基础预测器模型中,利用单个线性规划即可高效识别具有统计保证覆盖概率的预测集。该方法不仅适用于任意非线性基础预测器(如前馈神经网络),还可扩展至分类任务中的集合输出建模,并在数值实验中展现出比传统区间预测和标准共形预测更少的保守性与相当的覆盖率。

链接: https://arxiv.org/abs/2508.11025
作者: Laura Lützow,Michael Eichelbeck,Mykel J. Kochenderfer,Matthias Althoff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Preprint. Under review

点击查看摘要

Abstract:Conformal prediction is a popular uncertainty quantification method that augments a base predictor with prediction sets with statistically valid coverage guarantees. However, current methods are often computationally expensive and data-intensive, as they require constructing an uncertainty model before calibration. Moreover, existing approaches typically represent the prediction sets with intervals, which limits their ability to capture dependencies in multi-dimensional outputs. We address these limitations by introducing zono-conformal prediction, a novel approach inspired by interval predictor models and reachset-conformant identification that constructs prediction zonotopes with assured coverage. By placing zonotopic uncertainty sets directly into the model of the base predictor, zono-conformal predictors can be identified via a single, data-efficient linear program. While we can apply zono-conformal prediction to arbitrary nonlinear base predictors, we focus on feed-forward neural networks in this work. Aside from regression tasks, we also construct optimal zono-conformal predictors in classification settings where the output of an uncertain predictor is a set of possible classes. We provide probabilistic coverage guarantees and present methods for detecting outliers in the identification data. In extensive numerical experiments, we show that zono-conformal predictors are less conservative than interval predictor models and standard conformal prediction methods, while achieving a similar coverage over the test data.
zh

[AI-44] CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention

【速读】:该论文旨在解决强化学习中基于验证奖励(Reinforcement Learning with Verified Reward, RLVR)框架下,因重复使用静态初始状态采样导致模型行为过于确定、多样性不足的问题,进而引发熵崩溃(entropy collapse),限制了长期训练中的性能提升。解决方案的关键在于提出一种两阶段框架CURE(Critical-token-gUided Re concatenation for Entropy-collapse prevention):第一阶段通过高熵关键token重生成并联合优化原始与分支轨迹,主动引导模型探索新颖且连贯的上下文,从而维持高熵水平并提升数学推理能力;第二阶段则回归静态初始状态采样,利用DAPO方法强化已习得策略的 exploitation,实现探索与利用的动态平衡。实验表明,该方法在多个数学基准上相较其他RLVR方法平均提升5%性能,同时保持高熵特性,达到当前最优水平。

链接: https://arxiv.org/abs/2508.11016
作者: Qingbin Li,Rongkun Xue,Jie Wang,Ming Zhou,Zhi Li,Xiaofeng Ji,Yongqi Wang,Miao Liu,Zheming Yang,Minghui Qiu,Jing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Reinforcement Learning with Verified Reward (RLVR) have driven the emergence of more sophisticated cognitive behaviors in large language models (LLMs), thereby enhancing their reasoning capabilities. However, in prior RLVR pipelines, the repeated use of static initial-state sampling drawn exactly from the dataset distribution during each sampling phase produced overly deterministic, low diversity model behavior, which manifested as rapid entropy collapse and hindered sustained performance gains during prolonged training. To address this issue, we introduce CURE (Critical-token-gUided Re concatenation for Entropy-collapse prevention), a two-stage framework that balances exploration and exploitation. Specifically, in the first stage, to deliberately steer the model toward novel yet coherent contexts, we re-generate at high-entropy critical tokens and jointly optimize the original and the branched trajectories. The further comparison with vanilla DAPO shows that the regeneration process achieves a better performance on math reasoning tasks while sustaining a high-level entropy degree for exploration. In the second stage, we continue training with static initial-state sampling by DAPO, intentionally placing the model in a familiar state to gradually strengthen exploitation. Extensive experiments on Qwen-2.5-Math-7B show that, compared to other RLVR methods, CURE achieves a 5% performance gain across six math benchmarks, establishing state-of-the-art performance in both entropy and accuracy. A series of experiments further validate the effectiveness of our approach. Code is available at this https URL.
zh

[AI-45] MCP-Guard: A Defense Framework for Model Context Protocol Integrity in Large Language Model Applications

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)通过模型上下文协议(Model Context Protocol, MCP)与外部工具交互时引入的关键安全漏洞问题,包括提示注入(prompt injection)、数据外泄(data exfiltration)等威胁。其解决方案的核心是提出MCP-Guard,一种分层防御架构,包含三阶段检测流程:首先进行轻量级静态扫描以识别显式威胁;其次采用深度神经网络检测语义攻击;最后利用微调后的E5模型实现高精度(96.01%准确率)的对抗性提示识别;同时引入一个轻量级LLM仲裁器融合多源信号,从而在保障检测准确性的同时显著降低误报率。

链接: https://arxiv.org/abs/2508.10991
作者: Wenpeng Xing,Zhonghao Qi,Yupeng Qin,Yilin Li,Caini Chang,Jiahui Yu,Changting Lin,Zhenzhen Xie,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-Guard, a robust, layered defense architecture designed for LLM–tool interactions. MCP-Guard employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model achieves (96.01) accuracy in identifying adversarial prompts. Finally, a lightweight LLM arbitrator synthesizes these signals to deliver the final decision while minimizing false positives. To facilitate rigorous training and evaluation, we also introduce MCP-AttackBench, a comprehensive benchmark of over 70,000 samples. Sourced from public datasets and augmented by GPT-4, MCP-AttackBench simulates diverse, real-world attack vectors in the MCP format, providing a foundation for future research into securing LLM-tool ecosystems.
zh

[AI-46] Grounding Rule-Based Argumentation Using Datalog

【速读】:该论文旨在解决在基于规则的论证框架ASPIC+中,如何高效地对一阶逻辑规则进行推理的问题。由于现有方法主要支持命题规则,而ASPIC+常使用一阶规则,因此需要通过预处理的“实例化”(grounding)步骤将一阶规则转化为命题形式,但此过程可能导致输入理论规模呈指数级增长,从而引发计算瓶颈。解决方案的关键在于提出一种智能实例化程序:首先将一阶ASPIC+实例转化为Datalog程序,并借助Datalog引擎查询获得有效的变量替换以执行规则和对立关系(contraries)的实例化;同时,引入针对ASPIC+形式化的特化简化策略,避免对不影响最终推理结果的规则进行无谓实例化,从而在保证推理正确性的前提下显著控制实例化规模。实证评估表明该方法具备良好的可扩展性。

链接: https://arxiv.org/abs/2508.10976
作者: Martin Diller,Sarah Alice Gaggl,Philipp Hanisch,Giuseppina Monterosso,Fritz Rauschenbach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ASPIC+ is one of the main general frameworks for rule-based argumentation for AI. Although first-order rules are commonly used in ASPIC+ examples, most existing approaches to reason over rule-based argumentation only support propositional rules. To enable reasoning over first-order instances, a preliminary grounding step is required. As groundings can lead to an exponential increase in the size of the input theories, intelligent procedures are needed. However, there is a lack of dedicated solutions for ASPIC+. Therefore, we propose an intelligent grounding procedure that keeps the size of the grounding manageable while preserving the correctness of the reasoning process. To this end, we translate the first-order ASPIC+ instance into a Datalog program and query a Datalog engine to obtain ground substitutions to perform the grounding of rules and contraries. Additionally, we propose simplifications specific to the ASPIC+ formalism to avoid grounding of rules that have no influence on the reasoning process. Finally, we performed an empirical evaluation of a prototypical implementation to show scalability.
zh

[AI-47] Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

【速读】:该论文旨在解决现有逆合成预测模型依赖静态模式匹配范式、缺乏有效逻辑决策能力导致黑箱决策的问题。其解决方案的关键在于提出一种可解释的逆合成框架Retro-Expert,通过强化学习协同整合大型语言模型(Large Language Model, LLM)与专用模型的互补推理优势:首先由专用模型构建高质量的化学决策空间,再由LLM驱动关键推理生成预测及可解释的推理路径,最后通过强化学习优化可解释的决策策略,从而实现兼具高精度与透明性的逆合成预测。

链接: https://arxiv.org/abs/2508.10967
作者: Xinyi Li,Sai Wang,Yutian Lin,Yu Wu,Yi Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrosynthesis prediction aims to infer the reactant molecule based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing models rely on static pattern-matching paradigm, which limits their ability to perform effective logic decision-making, leading to black-box decision-making. Building on this, we propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary reasoning strengths of Large Language Models and specialized models via reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models perform shallow reasoning to construct high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions and corresponding interpretable reasoning path, and (3) reinforcement learning optimizing interpretable decision policy. Experiments show that Retro-Expert not only surpasses both LLM-based and specialized models across different metrics but also provides expert-aligned explanations that bridge the gap between AI predictions and actionable chemical insights.
zh

[AI-48] owards Efficient Prompt-based Continual Learning in Distributed Medical AI

【速读】:该论文旨在解决医疗领域中因伦理、社会及制度约束导致的数据共享困难问题,从而限制了集中式学习的可行性;同时应对持续学习(Continual Learning, CL)在医学场景下的挑战,包括模型在增量更新时对新样本过拟合、灾难性遗忘旧知识,以及因诊断设备和人群分布差异引发的数据分布漂移问题。其解决方案的关键在于提出一种基于提示(Prompt)的持续学习方法(Prompt-based Continual Learning, PCL),通过引入一个统一的提示池(prompt pool)并采用最小扩展策略——仅扩展并冻结部分提示,显著降低计算开销;同时设计了一种新颖的正则化项以平衡模型对已有知识的保留与对新任务的适应能力,从而在多个糖尿病视网膜病变数据集上实现分类准确率提升至少10%、F1分数提高9分,且推理成本更低。

链接: https://arxiv.org/abs/2508.10954
作者: Gyutae Oh,Jitae Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10p

点击查看摘要

Abstract:Modern AI models achieve state-of-the-art performance with large-scale, high-quality datasets; however, ethical, social, and institutional constraints in the medical domain severely restrict data sharing, rendering centralized learning nearly impossible. Each institution must incrementally update models using only local data. Traditional training overfits new samples and suffers from catastrophic forgetting, losing previously acquired knowledge. Medical data distributions also shift due to varying diagnostic equipment and demographics. Although continual learning (CL) has advanced, most methods address natural images, leaving medical-domain-specific CL underexplored. We propose a prompt-based continual learning (PCL) approach featuring a unified prompt pool with a minimal expansion strategy: by expanding and freezing a subset of prompts, our method reduces computational overhead, and a novel regularization term balances retention and adaptation. Experiments on three diabetic retinopathy datasets Aptos2019, LI2019, and Diabetic Retinopathy Detection show our model improves final classification accuracy by at least 10% and F1-score by 9 points over state-of-the-art approaches while lowering inference cost. We anticipate this study will drive sustainable medical AI advances, enabling real-time diagnosis, patient monitoring, and telemedicine applications in distributed healthcare. Code will be released upon acceptance
zh

[AI-49] Apriel-Nemotron-15B-Thinker

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业应用场景中因内存和计算资源消耗过大而难以部署的问题。其核心解决方案是提出 Apriel-Nemotron-15B-Thinker 模型,该模型通过四阶段训练流程(包括基础模型扩展、持续预训练、监督微调(Supervised Fine-tuning, SFT)以及基于 GRPO 的强化学习)实现高效性能,在仅拥有 150 亿参数的情况下,达到甚至超越 320 亿参数模型(如 o1-mini、QWQ32B 和 EXAONE-Deep-32B)的推理能力,同时保持不到其一半的内存占用,从而显著提升企业在实际部署中的可行性与效率。

链接: https://arxiv.org/abs/2508.10948
作者: Shruthan Radhakrishna,Soham Parikh,Gopal Sarda,Anil Turkkan,Quaizar Vohra,Raymond Li,Dhruv Jhamb,Kelechi Ogueji,Aanjaneya Shukla,Oluwanifemi Bamgbose,Toby Liang,Luke Kumar,Oleksiy Ostapenko,Shiva Krishna Reddy Malay,Aman Tiwari,Tara Bogavelli,Vikas Yadav,Jash Mehta,Saloni Mittal,Akshay Kalkunte,Pulkit Pattnaik,Khalil Slimi,Anirudh Sreeram,Jishnu Nair,Akintunde Oladipo,Shashank Maiya,Khyati Mahajan,Rishabh Maheshwary,Masoud Hashemi,Sai Rajeswar Mudumba,Sathwik Tejaswi Madhusudhan,Torsten Scholak,Sebastien Paquet,Sagar Davasam,Srinivas Sunkara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have achieved remarkable reasoning capabilities across domains like code, math and other enterprise tasks, their significant memory and computational costs often preclude their use in practical enterprise settings. To this end, we introduce Apriel-Nemotron-15B-Thinker, a 15-billion parameter model in the ServiceNow Apriel SLM series that achieves performance against medium sized state-of-the-art models such as o1-mini, QWQ32B, and EXAONE-Deep-32B while maintaining only half the memory footprint of those alternatives. Apriel-Nemotron-15B-Thinker model is trained in a four stage training pipeline including 1) Base Model upscaling, 2) Continual Pre-training 3) Supervised Fine-tuning (SFT) and 4) Reinforcement Learning using GRPO. Comprehensive evaluations across a diverse suite of benchmarks consistently demonstrate that our Apriel-Nemotron-15B-Thinker model matches or exceeds the performance of its 32-billion parameter counterparts, despite being less than half their size.
zh

[AI-50] Human-AI collaboration or obedient and often clueless AI in instruct serve repeat dynamics?

【速读】:该论文旨在解决当前人-AI协作研究中对高认知负荷任务下交互动态演变关注不足的问题,特别是缺乏对协作模式及其与学生表现关系的深入分析。其解决方案的关键在于采用多种定量与定性方法相结合的混合分析框架:通过过渡网络分析(transition network analysis)、序列分析(sequence analysis)和部分相关网络(partial correlation networks),辅以卡方检验与Person残差着色马赛克图(Person-residual shaded Mosaic plots),系统刻画学生与大语言模型(Large Language Models, LLMs)在复杂问题求解中的互动模式、演化轨迹及与任务难度和学业成绩的关系。研究发现,主导的“指导型”(Instructive)交互模式表现为迭代指令而非协同协商,且常出现学生提示与AI输出之间存在显著错位,揭示出当前LLMs更擅长执行指令而非实现认知协同,从而挑战了将其视为认知伙伴的主流假设。

链接: https://arxiv.org/abs/2508.10919
作者: Mohammed Saqr,Kamila Misiejuk,Sonsoles López-Pernas
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While research on human-AI collaboration exists, it mainly examined language learning and used traditional counting methods with little attention to evolution and dynamics of collaboration on cognitively demanding tasks. This study examines human-AI interactions while solving a complex problem. Student-AI interactions were qualitatively coded and analyzed with transition network analysis, sequence analysis and partial correlation networks as well as comparison of frequencies using chi-square and Person-residual shaded Mosaic plots to map interaction patterns, their evolution, and their relationship to problem complexity and student performance. Findings reveal a dominant Instructive pattern with interactions characterized by iterative ordering rather than collaborative negotiation. Oftentimes, students engaged in long threads that showed misalignment between their prompts and AI output that exemplified a lack of synergy that challenges the prevailing assumptions about LLMs as collaborative partners. We also found no significant correlations between assignment complexity, prompt length, and student grades suggesting a lack of cognitive depth, or effect of problem difficulty. Our study indicates that the current LLMs, optimized for instruction-following rather than cognitive partnership, compound their capability to act as cognitively stimulating or aligned collaborators. Implications for designing AI systems that prioritize cognitive alignment and collaboration are discussed.
zh

[AI-51] Managing the unexpected: Operator behavioural data and its value in predicting correct alarm responses

【速读】:该论文旨在解决如何在不干扰日常操作的前提下,通过非侵入式手段实时监测和预测控制室操作员在异常工况下的行为与响应结果的问题。其核心挑战在于传统生理测量工具(如眼动追踪和脑电图 EEG)虽能提供认知负荷等信息,但因侵入性强而难以应用于实际运行环境。解决方案的关键在于利用分布式控制系统(Distributed Control System, DCS)的历史数据或过程日志中记录的实时操作行为数据,结合步骤式逻辑回归和贝叶斯网络模型,识别出具有预测能力的行为指标(predictive behavioural metrics),从而实现对关键报警响应场景下操作员表现的早期预警与决策支持。

链接: https://arxiv.org/abs/2508.10917
作者: Chidera W. Amazu,Joseph Mietkiewicz,Ammar N. Abbas,Gabriele Baldissone,Davide Fissore,Micaela Demichela,Anders L. Madsen,Maria Chiara Leva
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data from psychophysiological measures can offer new insight into control room operators’ behaviour, cognition, and mental workload status. This can be particularly helpful when combined with appraisal of capacity to respond to possible critical plant conditions (i.e. critical alarms response scenarios). However, wearable physiological measurement tools such as eye tracking and EEG caps can be perceived as intrusive and not suitable for usage in daily operations. Therefore, this article examines the potential of using real-time data from process and operator-system interactions during abnormal scenarios that can be recorded and retrieved from the distributed control system’s historian or process log, and their capacity to provide insight into operator behavior and predict their response outcomes, without intruding on daily tasks. Data for this study were obtained from a design of experiment using a formaldehyde production plant simulator and four human-in-the-loop experimental support configurations. A comparison between the different configurations in terms of both behaviour and performance is presented in this paper. A step-wise logistic regression and a Bayesian network models were used to achieve this objective. The results identified some predictive metrics and the paper discuss their value as precursor or predictor of overall system performance in alarm response scenarios. Knowledge of relevant and predictive behavioural metrics accessible in real time can better equip decision-makers to predict outcomes and provide timely support measures for operators.
zh

[AI-52] Multimodal Quantitative Measures for Multiparty Behaviour Evaluation

【速读】:该论文旨在解决当前数字人(Digital Humans)在多人社交交互中缺乏有效评估指标的问题,尤其是现有方法未能充分捕捉情境协调动态(contextual coordination dynamics)。其解决方案的关键在于提出一个统一的、基于干预驱动的框架,用于客观评估骨骼运动数据中的多主体社会行为,涵盖三个互补维度:(1) 通过交叉递归定量分析(Cross-Recurrence Quantification Analysis, CRQA)衡量同步性,(2) 基于多尺度经验模态分解的节拍一致性(Multiscale Empirical Mode Decomposition-based Beat Consistency)评估时间对齐,(3) 利用软动态时间规整(Soft Dynamic Time Warping, Soft-DTW)量化结构相似性。该框架通过理论驱动的扰动实验验证了各指标的敏感性和独立性,为社交智能代理的评估与优化提供了可解释且稳健的工具集。

链接: https://arxiv.org/abs/2508.10916
作者: Ojas Shirekar,Wim Pouw,Chenxu Hao,Vrushank Phadnis,Thabo Beeler,Chirag Raman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decompositionbased Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations – gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction-applied to \approx 145 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study ( N=27 ) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on \hrefthis https URLGitHub.
zh

[AI-53] SDSNN: A Single-Timestep Spiking Neural Network with Self-Dropping Neuron and Bayesian Optimization

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在边缘计算场景中因多时间步推理模型导致的高延迟与高能耗问题。其核心挑战在于如何在保持分类精度的同时显著降低计算复杂度和能量消耗。解决方案的关键在于提出一种单时间步SNN架构,通过设计自适应Drop神经元机制(Self-Dropping Neuron),利用动态阈值调整和选择性脉冲抑制增强信息承载能力,并结合贝叶斯优化全局搜索最优时间参数,从而实现单时间步内高效推理。实验表明,该方法在Fashion-MNIST、CIFAR-10和CIFAR-100数据集上分别达到93.72%、92.20%和69.45%的准确率,同时相较传统多时间步LIF模型分别降低56%、21%和22%的能量消耗。

链接: https://arxiv.org/abs/2508.10913
作者: Changqing Xu,Buxuan Song,Yi Liu,Xinfang Liao,Wenbin Zheng,Yintang Yang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), as an emerging biologically inspired computational model, demonstrate significant energy efficiency advantages due to their event-driven information processing mechanism. Compared to traditional Artificial Neural Networks (ANNs), SNNs transmit information through discrete spike signals, which substantially reduces computational energy consumption through their sparse encoding approach. However, the multi-timestep computation model significantly increases inference latency and energy, limiting the applicability of SNNs in edge computing scenarios. We propose a single-timestep SNN, which enhances accuracy and reduces computational energy consumption in a single timestep by optimizing spike generation and temporal parameters. We design a Self-Dropping Neuron mechanism, which enhances information-carrying capacity through dynamic threshold adjustment and selective spike suppression. Furthermore, we employ Bayesian optimization to globally search for time parameters and obtain an efficient inference mode with a single time step. Experimental results on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that, compared to traditional multi-timestep SNNs employing the Leaky Integrate-and-Fire (LIF) model, our method achieves classification accuracies of 93.72%, 92.20%, and 69.45%, respectively, using only single-timestep spikes, while maintaining comparable or even superior accuracy. Additionally, it reduces energy consumption by 56%, 21%, and 22%, respectively.
zh

[AI-54] FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

【速读】:该论文旨在解决多模态分类中视觉与文本信号融合策略脆弱、易受模态特异性噪声干扰的问题,尤其在标签噪声、长尾类别不平衡和语义异质性等挑战下表现不佳。其解决方案的关键在于提出一种基于token级建模的统一集成框架——FLUID(Flow-Latent Unified Integration via Token Distillation for Expert Specialization),核心包括:(1) 可学习的Q-transforms机制,用于从各模态骨干网络中蒸馏并保留关键token特征;(2) 两阶段融合策略,先通过对比对齐强化跨模态一致性,再利用门控机制与Q-bottleneck实现任务感知的自适应信息压缩;(3) 预测时轻量级且负载均衡的专家混合(Mixture-of-Experts)结构,支持对多样化语义模式的高效专业化处理。这一设计显著提升了模型在复杂场景下的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2508.07264
作者: Van Duc Cuong,Ta Dinh Tam,Tran Duc Chinh,Nguyen Thi Hanh
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \textscFLUID-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \textscFLUID contributes three core elements: (1) \emphQ-transforms, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \emphQ-bottleneck that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time that enables efficient specialization to diverse semantic patterns. Extensive experiments demonstrate that \textscFLUID attains (91%) accuracy on the GLAMI-1M benchmark, significantly outperforming prior baselines and exhibiting strong resilience to label noise, long-tail class imbalance, and semantic heterogeneity. Targeted ablation studies corroborate both the individual and synergistic benefits of the proposed components, positioning \textscFLUID as a scalable, noise-resilient solution for multimodal product classification.
zh

[AI-55] ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)训练中数据混合比例(data mixture)优化这一关键问题,即如何在有限计算资源下高效确定最优训练数据组合以提升模型性能。传统方法依赖启发式探索,缺乏系统性与可扩展性。其解决方案的关键在于将数据混合选择建模为黑箱超参数优化问题,并引入多保真度贝叶斯优化(Multi-fidelity Bayesian Optimization),通过在低保真度(如小规模模型训练)和高保真度(大规模训练)实验之间权衡计算成本与模型拟合效果,实现对最优数据混合的快速搜索。该方法显著减少了所需实验次数,在多个模型规模(1M–7B参数)和任务场景中均表现出优于现有基线的性能,且通过公开 ADMIRE IFT Runs 数据集(460次完整训练评估,超13,000 GPU小时)降低了研究门槛。

链接: https://arxiv.org/abs/2508.11551
作者: Shengzhuang Chen,Xu Ouyang,Michael Arthur Leopold Pearce,Thomas Hartvigsen,Jonathan Richard Schwarz
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of benchmarks, showingspeed-ups of over 500% in determining the best data mixture on our largest experiments relative to recent baselines. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training evaluation runs across various model sizes worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area.
zh

[AI-56] AlphaAgents : Large Language Model based Multi-Agents for Equity Portfolio Constructions

【速读】:该论文旨在解决传统股票选择方法在复杂市场环境中效率低、适应性差的问题,尤其是在股权研究与投资组合管理中如何提升决策质量与自动化水平。其解决方案的关键在于构建基于角色的多智能体系统(role-based multi-agent system),通过多个专业化AI代理协作完成股票筛选任务,利用大型语言模型(LLM)的自主执行与迭代优化能力,在不同风险偏好下实现更高效、灵活且可解释的选股策略,从而验证了多智能体框架在金融分析中的实践有效性与潜在挑战。

链接: https://arxiv.org/abs/2508.11152
作者: Tianjiao Zhao,Jingrao Lyu,Stokes Jones,Harrison Garber,Stefano Pasquali,Dhagash Mehta
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of artificial intelligence (AI) agents is evolving rapidly, driven by the capabilities of Large Language Models (LLMs) to autonomously perform and refine tasks with human-like efficiency and adaptability. In this context, multi-agent collaboration has emerged as a promising approach, enabling multiple AI agents to work together to solve complex challenges. This study investigates the application of role-based multi-agent systems to support stock selection in equity research and portfolio management. We present a comprehensive analysis performed by a team of specialized agents and evaluate their stock-picking performance against established benchmarks under varying levels of risk tolerance. Furthermore, we examine the advantages and limitations of employing multi-agent frameworks in equity analysis, offering critical insights into their practical efficacy and implementation challenges.
zh

[AI-57] Note on Selection Bias in Observational Estimates of Algorithmic Progress

【速读】:该论文试图解决的问题是量化语言模型(Language Models)中算法进步的程度,具体而言是评估随着计算资源投入的增加,模型在损失函数(loss)上的表现是否提升,即是否存在算法效率的提高。其解决方案的关键在于通过收集语言模型在不同时间点的损失值和计算量(compute)数据,分析固定计算量下损失的下降趋势,以此推断算法效率的改进。然而,该方法可能因算法质量的潜在性(latent algorithmic quality)与计算资源配置的内生性(endogenous compute choices)而产生估计偏差。

链接: https://arxiv.org/abs/2508.11033
作者: Parker Whitfill
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ho et. al (2024) is an interesting paper that attempts to estimate the degree of algorithmic progress from language models. They collect observational data on language models’ loss and compute over time, and argue that as time has passed, language models’ algorithmic efficiency has been rising. That is, the loss achieved for fixed compute has been dropping over time. In this note, I want to raise one potential methodological problem with the estimation strategy. Intuitively, if part of algorithmic quality is latent, and compute choices are endogenous to algorithmic quality, then resulting estimates of algorithmic quality will be biased.
zh

[AI-58] Generalized Similarity U: A Non-parametric Test of Association Based on Similarity

【速读】:该论文旨在解决全基因组测序(Whole Genome Sequencing, WGS)数据中复杂对象间关联性检验的统计难题,特别是如何有效识别与高维表型(如影像学特征)相关的遗传变异集合。其核心问题是传统方法在处理基因型与复杂表型之间的非线性、高维及结构化关系时存在功率不足和稳健性差的问题。解决方案的关键在于提出一种基于相似性的广义相似性U检验(Generalized Similarity U, GSU),并通过理论分析证明使用拉普拉斯核(Laplacian kernel)构建相似性矩阵可显著提升检验功效并增强对模型假设的鲁棒性。该方法已在阿尔茨海默病神经影像计划(ADNI)数据中成功识别出与影像表型显著相关的三个基因(APOE、APOC1 和 TOMM40),并开发了配套的C++工具包用于实际WGS数据分析。

链接: https://arxiv.org/abs/1801.01220
作者: Changshuai Wei,Qing Lu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Second generation sequencing technologies are being increasingly used for genetic association studies, where the main research interest is to identify sets of genetic variants that contribute to various phenotype. The phenotype can be univariate disease status, multivariate responses and even high-dimensional outcomes. Considering the genotype and phenotype as two complex objects, this also poses a general statistical problem of testing association between complex objects. We here proposed a similarity-based test, generalized similarity U (GSU), that can test the association between complex objects. We first studied the theoretical properties of the test in a general setting and then focused on the application of the test to sequencing association studies. Based on theoretical analysis, we proposed to use Laplacian kernel based similarity for GSU to boost power and enhance robustness. Through simulation, we found that GSU did have advantages over existing methods in terms of power and robustness. We further performed a whole genome sequencing (WGS) scan for Alzherimer Disease Neuroimaging Initiative (ADNI) data, identifying three genes, APOE, APOC1 and TOMM40, associated with imaging phenotype. We developed a C++ package for analysis of whole genome sequencing data using GSU. The source codes can be downloaded at this https URL.
zh

[AI-59] rees Assembling Mann Whitney Approach for Detecting Genome-wide Joint Association among Low Marginal Effect loci

【速读】:该论文旨在解决复杂疾病中低边际效应遗传变异(Low Marginal Effect, LME)的联合关联分析难题,尤其是在全基因组关联研究(Genome-Wide Association Studies, GWAS)等高维数据背景下,如何高效且有力地识别多个LME位点及其交互作用。解决方案的关键在于提出一种名为Trees Assembling Mann Whitney(TAMW)的新方法,其核心创新是通过构建决策树集合(ensemble of decision trees)并结合Mann-Whitney检验统计量,实现对大量LME变异的计算高效性和统计强大性,从而显著提升检测多因子交互作用与疾病风险关联的能力。

链接: https://arxiv.org/abs/1505.01206
作者: Changshuai Wei,Daniel J. Schaid,Qing Lu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Common complex diseases are likely influenced by the interplay of hundreds, or even thousands, of genetic variants. Converging evidence shows that genetic variants with low marginal effects (LME) play an important role in disease development. Despite their potential significance, discovering LME genetic variants and assessing their joint association on high dimensional data (e.g., genome wide association studies) remain a great challenge. To facilitate joint association analysis among a large ensemble of LME genetic variants, we proposed a computationally efficient and powerful approach, which we call Trees Assembling Mann whitney (TAMW). Through simulation studies and an empirical data application, we found that TAMW outperformed multifactor dimensionality reduction (MDR) and the likelihood ratio based Mann whitney approach (LRMW) when the underlying complex disease involves multiple LME loci and their interactions. For instance, in a simulation with 20 interacting LME loci, TAMW attained a higher power (power=0.931) than both MDR (power=0.599) and LRMW (power=0.704). In an empirical study of 29 known Crohn’s disease (CD) loci, TAMW also identified a stronger joint association with CD than those detected by MDR and LRMW. Finally, we applied TAMW to Wellcome Trust CD GWAS to conduct a genome wide analysis. The analysis of 459K single nucleotide polymorphisms was completed in 40 hours using parallel computing, and revealed a joint association predisposing to CD (p-value=2.763e-19). Further analysis of the newly discovered association suggested that 13 genes, such as ATG16L1 and LACC1, may play an important role in CD pathophysiological and etiological processes.
zh

[AI-60] A Weighted U Statistic for Genetic Association Analyses of Sequencing Data

【速读】:该论文旨在解决高维测序数据中罕见变异(rare variants)在复杂疾病遗传病因学研究中的统计分析难题,传统方法因变异频率低和数据维度极高而面临检验效能显著下降的问题。其解决方案的关键在于提出了一种基于非参数U统计量的加权方法(WU-seq),该方法无需假设疾病模型或表型分布,适用于多种表型类型,并在模拟和真实数据中均展现出优于常用SKAT方法的性能,尤其在假设不成立时(如表型服从重尾分布)优势更为明显,同时在假设满足时仍保持相当的检验效能。

链接: https://arxiv.org/abs/1505.01204
作者: Changshuai Wei,Ming Li,Zihuai He,Olga Vsevolozhskaya,Daniel J. Schaid,Qing Lu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:With advancements in next generation sequencing technology, a massive amount of sequencing data are generated, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, this poses a great challenge for the statistical analysis of high-dimensional sequencing data. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a weighted U statistic, referred to as WU-seq, for the high-dimensional association analysis of sequencing data. Based on a non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used SKAT method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-seq to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.
zh

[AI-61] A Generalized Similarity U Test for Multivariate Analysis of Sequencing Data

【速读】:该论文旨在解决高维基因型数据与多表型复杂疾病关联分析中传统统计方法(如基于回归的单位点分析)面临的挑战,包括数据维度高、遗传变异频率低以及多表型可能服从不同分布导致现有方法假设不成立的问题。其解决方案的关键在于提出一种广义相似性U检验(Generalized Similarity U test, GSU),该方法基于相似性构建检验统计量,能够有效处理高维基因型和多维表型数据,并在理论层面提供了高效的p值计算及样本量与功效估算方法,从而在模拟研究中展现出比现有方法更高的检验效能和对表型分布的稳健性。

链接: https://arxiv.org/abs/1505.01179
作者: Changshuai Wei,Qing Lu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sequencing-based studies are emerging as a major tool for genetic association studies of complex diseases. These studies pose great challenges to the traditional statistical methods (e.g., single-locus analyses based on regression methods) because of the high-dimensionality of data and the low frequency of genetic variants. In addition, there is a great interest in biology and epidemiology to identify genetic risk factors contributed to multiple disease phenotypes. The multiple phenotypes can often follow different distributions, which violates the assumptions of most current methods. In this paper, we propose a generalized similarity U test, referred to as GSU. GSU is a similarity-based test and can handle high-dimensional genotypes and phenotypes. We studied the theoretical properties of GSU, and provided the efficient p-value calculation for association test as well as the sample size and power calculation for the study design. Through simulation, we found that GSU had advantages over existing methods in terms of power and robustness to phenotype distributions. Finally, we used GSU to perform a multivariate analysis of sequencing data in the Dallas Heart Study and identified a joint association of 4 genes with 5 metabolic related phenotypes.
zh

[AI-62] A weighted U statistic for association analysis considering genetic heterogeneity

【速读】:该论文旨在解决复杂疾病遗传研究中因遗传异质性(genetic heterogeneity)导致的统计效能下降问题。当前多数统计方法假设疾病具有同质的遗传效应,但在实际中,临床表现相似的复杂疾病可能由不同的遗传机制引起,这使得传统方法在检测关联时灵敏度不足。解决方案的关键在于提出一种新的异质性加权U统计量(Heterogeneity Weighted U, HWU)方法,该方法能有效整合不同遗传亚型的效应,适用于多种表型类型(如二分类和连续型),且计算效率高,适合大规模基因组数据处理。HWU通过识别并加权不同遗传模式下的信号,在模拟和真实数据(如SAGE队列中的尼古丁依赖)分析中均展现出优越性能。

链接: https://arxiv.org/abs/1504.08319
作者: Changshuai Wei,Robert C. Elston,Qing Lu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Converging evidence suggests that common complex diseases with the same or similar clinical manifestations could have different underlying genetic etiologies. While current research interests have shifted toward uncovering rare variants and structural variations predisposing to human diseases, the impact of heterogeneity in genetic studies of complex diseases has been largely overlooked. Most of the existing statistical methods assume the disease under investigation has a homogeneous genetic effect and could, therefore, have low power if the disease undergoes heterogeneous pathophysiological and etiological processes. In this paper, we propose a heterogeneity weighted U (HWU) method for association analyses considering genetic heterogeneity. HWU can be applied to various types of phenotypes (e.g., binary and continuous) and is computationally effcient for high- dimensional genetic data. Through simulations, we showed the advantage of HWU when the underlying genetic etiology of a disease was heterogeneous, as well as the robustness of HWU against different model assumptions (e.g., phenotype distributions). Using HWU, we conducted a genome-wide analysis of nicotine dependence from the Study of Addiction: Genetics and Environments (SAGE) dataset. The genome-wide analysis of nearly one million genetic markers took 7 hours, identifying heterogeneous effects of two new genes (i.e., CYP3A5 and IKBKB) on nicotine dependence.
zh

机器学习

[LG-0] Optimal CO2 storag e management considering safety constraints in multi-stakeholder multi-site CCS projects: a game theoretic perspective

链接: https://arxiv.org/abs/2508.11618
作者: Jungang Chen,Seyyed A. Hosseini
类目: Machine Learning (cs.LG)
*备注: 38 pages, 16 figures

点击查看摘要

Abstract:Carbon capture and storage (CCS) projects typically involve a diverse array of stakeholders or players from public, private, and regulatory sectors, each with different objectives and responsibilities. Given the complexity, scale, and long-term nature of CCS operations, determining whether individual stakeholders can independently maximize their interests or whether collaborative coalition agreements are needed remains a central question for effective CCS project planning and management. CCS projects are often implemented in geologically connected sites, where shared geological features such as pressure space and reservoir pore capacity can lead to competitive behavior among stakeholders. Furthermore, CO2 storage sites are often located in geologically mature basins that previously served as sites for hydrocarbon extraction or wastewater disposal in order to leverage existing infrastructures, which makes unilateral optimization even more complicated and unrealistic. In this work, we propose a paradigm based on Markov games to quantitatively investigate how different coalition structures affect the goals of stakeholders. We frame this multi-stakeholder multi-site problem as a multi-agent reinforcement learning problem with safety constraints. Our approach enables agents to learn optimal strategies while compliant with safety regulations. We present an example where multiple operators are injecting CO2 into their respective project areas in a geologically connected basin. To address the high computational cost of repeated simulations of high-fidelity models, a previously developed surrogate model based on the Embed-to-Control (E2C) framework is employed. Our results demonstrate the effectiveness of the proposed framework in addressing optimal management of CO2 storage when multiple stakeholders with various objectives and goals are involved. Comments: 38 pages, 16 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.11618 [cs.LG] (or arXiv:2508.11618v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.11618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Investigating Sensors and Methods in Grasp State Classification in Agricultural Manipulation

链接: https://arxiv.org/abs/2508.11588
作者: Benjamin Walt,Jordan Westphal,Girish Krishnan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective and efficient agricultural manipulation and harvesting depend on accurately understanding the current state of the grasp. The agricultural environment presents unique challenges due to its complexity, clutter, and occlusion. Additionally, fruit is physically attached to the plant, requiring precise separation during harvesting. Selecting appropriate sensors and modeling techniques is critical for obtaining reliable feedback and correctly identifying grasp states. This work investigates a set of key sensors, namely inertial measurement units (IMUs), infrared (IR) reflectance, tension, tactile sensors, and RGB cameras, integrated into a compliant gripper to classify grasp states. We evaluate the individual contribution of each sensor and compare the performance of two widely used classification models: Random Forest and Long Short-Term Memory (LSTM) networks. Our results demonstrate that a Random Forest classifier, trained in a controlled lab environment and tested on real cherry tomato plants, achieved 100% accuracy in identifying slip, grasp failure, and successful picks, marking a substantial improvement over baseline performance. Furthermore, we identify a minimal viable sensor combination, namely IMU and tension sensors that effectively classifies grasp states. This classifier enables the planning of corrective actions based on real-time feedback, thereby enhancing the efficiency and reliability of fruit harvesting operations.

[LG-2] SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling

链接: https://arxiv.org/abs/2508.11553
作者: Jinghui Wang,Shaojie Wang,Yinghan Cui,Xuxing Chen,Chao Wang,Xiaojiang Zhang,Minglei Zhang,Jiarong Zhang,Wenhao Zhuang,Yuchen Cao,Wankang Bao,Haimo Li,Zheng Lin,Huiming Wang,Haoyang Huang,Zongxian Feng,Zizheng Zhan,Ken Deng,Wen Xiang,Huaixi Tang,Kun Wu,Mengtong Li,Mengfei Xie,Junyi Peng,Haotian Zhang,Bin Chen,Bing Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce SeamlessFlow, a server based reinforcement learning (RL) framework that addresses two core challenges in industrial scale RL: (1) decoupling RL training from the complex execution flow of agents; (2) maximizing GPU utilization with minimal idle time while preserving the stability and scalability required for large-scale deployments. First, SeamlessFlow introduces a data plane that decouples the RL trainer from diverse, complex agent implementations while sustaining high throughput. A central trajectory manager maintains complete interaction histories and supports partial rollout, allowing rollout to pause for weight updates and resume seamlessly, keeping agents unaware of service interruptions. Second, we propose a tag driven scheduling paradigm that abstracts hardware into capability tagged resources, unifying colocated and disaggregated architectures. Based on this, SeamlessFlow introduces a spatiotemporal multiplexing pipeline that dynamically reassigns idle training nodes to rollout in a train rollout separated setup, eliminating pipeline bubbles and fully exploiting heterogeneous cluster resources. By combining these innovations, SeamlessFlow delivers both stability and high performance, making it well suited for multi agent, long horizon, and other complex RL tasks.

[LG-3] Nested Operator Inference for Adaptive Data-Driven Learning of Reduced-order Models

链接: https://arxiv.org/abs/2508.11542
作者: Nicole Aretz,Karen Willcox
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents a data-driven, nested Operator Inference (OpInf) approach for learning physics-informed reduced-order models (ROMs) from snapshot data of high-dimensional dynamical systems. The approach exploits the inherent hierarchy within the reduced space to iteratively construct initial guesses for the OpInf learning problem that prioritize the interactions of the dominant modes. The initial guess computed for any target reduced dimension corresponds to a ROM with provably smaller or equal snapshot reconstruction error than with standard OpInf. Moreover, our nested OpInf algorithm can be warm-started from previously learned models, enabling versatile application scenarios involving dynamic basis and model form updates. We demonstrate the performance of our algorithm on a cubic heat conduction problem, with nested OpInf achieving a four times smaller error than standard OpInf at a comparable offline time. Further, we apply nested OpInf to a large-scale, parameterized model of the Greenland ice sheet where, despite model form approximation errors, it learns a ROM with, on average, 3% error and computational speed-up factor above 19,000.

[LG-4] DFed-SST: Building Semantic- and Structure-aware Topologies for Decentralized Federated Graph Learning

链接: https://arxiv.org/abs/2508.11530
作者: Lianshuai Guo,Zhongzheng Yuan,Xunkai Li,Yinlin Zhu,Meixia Qu,Wenyu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) has emerged as a robust distributed paradigm that circumvents the single-point-of-failure and communication bottleneck risks of centralized architectures. However, a significant challenge arises as existing DFL optimization strategies, primarily designed for tasks such as computer vision, fail to address the unique topological information inherent in the local subgraph. Notably, while Federated Graph Learning (FGL) is tailored for graph data, it is predominantly implemented in a centralized server-client model, failing to leverage the benefits of this http URL bridge this gap, we propose DFed-SST, a decentralized federated graph learning framework with adaptive communication. The core of our method is a dual-topology adaptive communication mechanism that leverages the unique topological features of each client’s local subgraph to dynamically construct and optimize the inter-client communication topology. This allows our framework to guide model aggregation efficiently in the face of heterogeneity. Extensive experiments on eight real-world datasets consistently demonstrate the superiority of DFed-SST, achieving 3.26% improvement in average accuracy over baseline methods.

[LG-5] Physics-Informed Diffusion Models for Unsupervised Anomaly Detection in Multivariate Time Series

链接: https://arxiv.org/abs/2508.11528
作者: Juhi Soni,Markus Lange-Hegermann,Stefan Windmann
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:We propose an unsupervised anomaly detection approach based on a physics-informed diffusion model for multivariate time series data. Over the past years, diffusion model has demonstrated its effectiveness in forecasting, imputation, generation, and anomaly detection in the time series domain. In this paper, we present a new approach for learning the physics-dependent temporal distribution of multivariate time series data using a weighted physics-informed loss during diffusion model training. A weighted physics-informed loss is constructed using a static weight schedule. This approach enables a diffusion model to accurately approximate underlying data distribution, which can influence the unsupervised anomaly detection performance. Our experiments on synthetic and real-world datasets show that physics-informed training improves the F1 score in anomaly detection; it generates better data diversity and log-likelihood. Our model outperforms baseline approaches, additionally, it surpasses prior physics-informed work and purely data-driven diffusion models on a synthetic dataset and one real-world dataset while remaining competitive on others.

[LG-6] Finite-Width Neural Tangent Kernels from Feynman Diagrams

链接: https://arxiv.org/abs/2508.11522
作者: Max Guillen,Philipp Misof,Jan E. Gerken
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 11 pages + appendices

点击查看摘要

Abstract:Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursive relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We validate our results with numerical experiments.

[LG-7] DiCriTest: Testing Scenario Generation for Decision-Making Agents Considering Diversity and Criticality

链接: https://arxiv.org/abs/2508.11514
作者: Qitong Chu,Yufeng Yue,Danya Yao,Huaxin Pei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing deployment of decision-making agents in dynamic environments increases the demand for safety verification. While critical testing scenario generation has emerged as an appealing verification methodology, effectively balancing diversity and criticality remains a key challenge for existing methods, particularly due to local optima entrapment in high-dimensional scenario spaces. To address this limitation, we propose a dual-space guided testing framework that coordinates scenario parameter space and agent behavior space, aiming to generate testing scenarios considering diversity and criticality. Specifically, in the scenario parameter space, a hierarchical representation framework combines dimensionality reduction and multi-dimensional subspace evaluation to efficiently localize diverse and critical subspaces. This guides dynamic coordination between two generation modes: local perturbation and global exploration, optimizing critical scenario quantity and diversity. Complementarily, in the agent behavior space, agent-environment interaction data are leveraged to quantify behavioral criticality/diversity and adaptively support generation mode switching, forming a closed feedback loop that continuously enhances scenario characterization and exploration within the parameter space. Experiments show our framework improves critical scenario generation by an average of 56.23% and demonstrates greater diversity under novel parameter-behavior co-driven metrics when tested on five decision-making agents, outperforming state-of-the-art baselines.

[LG-8] Predicting and Explaining Traffic Crash Severity Through Crash Feature Selection

链接: https://arxiv.org/abs/2508.11504
作者: Andrea Castellani,Zacharias Papadovasilakis,Giorgos Papoutsoglou,Mary Cole,Brian Bautsch,Tobias Rodemann,Ioannis Tsamardinos,Angela Harden
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Preprint. Manuscript under review at “Accident Analysis Prevention” journal

点击查看摘要

Abstract:Motor vehicle crashes remain a leading cause of injury and death worldwide, necessitating data-driven approaches to understand and mitigate crash severity. This study introduces a curated dataset of more than 3 million people involved in accidents in Ohio over six years (2017-2022), aggregated to more than 2.3 million vehicle-level records for predictive analysis. The primary contribution is a transparent and reproducible methodology that combines Automated Machine Learning (AutoML) and explainable artificial intelligence (AI) to identify and interpret key risk factors associated with severe crashes. Using the JADBio AutoML platform, predictive models were constructed to distinguish between severe and non-severe crash outcomes. The models underwent rigorous feature selection across stratified training subsets, and their outputs were interpreted using SHapley Additive exPlanations (SHAP) to quantify the contribution of individual features. A final Ridge Logistic Regression model achieved an AUC-ROC of 85.6% on the training set and 84.9% on a hold-out test set, with 17 features consistently identified as the most influential predictors. Key features spanned demographic, environmental, vehicle, human, and operational categories, including location type, posted speed, minimum occupant age, and pre-crash action. Notably, certain traditionally emphasized factors, such as alcohol or drug impairment, were less influential in the final model compared to environmental and contextual variables. Emphasizing methodological rigor and interpretability over mere predictive performance, this study offers a scalable framework to support Vision Zero with aligned interventions and advanced data-informed traffic safety policy.

[LG-9] Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models

链接: https://arxiv.org/abs/2508.11460
作者: Aurora Grefsrud,Nello Blaser,Trygve Buanes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms are well calibrated, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers developing new methods of uncertainty estimation for scientific data-driven modeling.

[LG-10] Multi-Sensory Cognitive Computing for Learning Population-level Brain Connectivity

链接: https://arxiv.org/abs/2508.11436
作者: Mayssa Soussia,Mohamed Ali Mahjoub,Islem Rekik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The generation of connectional brain templates (CBTs) has recently garnered significant attention for its potential to identify unique connectivity patterns shared across individuals. However, existing methods for CBT learning such as conventional machine learning and graph neural networks (GNNs) are hindered by several limitations. These include: (i) poor interpretability due to their black-box nature, (ii) high computational cost, and (iii) an exclusive focus on structure and topology, overlooking the cognitive capacity of the generated CBT. To address these challenges, we introduce mCOCO (multi-sensory COgnitive COmputing), a novel framework that leverages Reservoir Computing (RC) to learn population-level functional CBT from BOLD (Blood-Oxygen-level-Dependent) signals. RC’s dynamic system properties allow for tracking state changes over time, enhancing interpretability and enabling the modeling of brain-like dynamics, as demonstrated in prior literature. By integrating multi-sensory inputs (e.g., text, audio, and visual data), mCOCO captures not only structure and topology but also how brain regions process information and adapt to cognitive tasks such as sensory processing, all in a computationally efficient manner. Our mCOCO framework consists of two phases: (1) mapping BOLD signals into the reservoir to derive individual functional connectomes, which are then aggregated into a group-level CBT - an approach, to the best of our knowledge, not previously explored in functional connectivity studies - and (2) incorporating multi-sensory inputs through a cognitive reservoir, endowing the CBT with cognitive traits. Extensive evaluations show that our mCOCO-based template significantly outperforms GNN-based CBT in terms of centeredness, discriminativeness, topological soundness, and multi-sensory memory retention. Our source code is available at this https URL.

[LG-11] Generative Co-Design of Antibody Sequences and Structures via Black-Box Guidance in a Shared Latent Space IJCAI2025

链接: https://arxiv.org/abs/2508.11424
作者: Yinghua Yao,Yuangang Pan,Xixian Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Advancements in deep generative models have enabled the joint modeling of antibody sequence and structure, given the antigen-antibody complex as context. However, existing approaches for optimizing complementarity-determining regions (CDRs) to improve developability properties operate in the raw data space, leading to excessively costly evaluations due to the inefficient search process. To address this, we propose LatEnt blAck-box Design (LEAD), a sequence-structure co-design framework that optimizes both sequence and structure within their shared latent space. Optimizing shared latent codes can not only break through the limitations of existing methods, but also ensure synchronization of different modality designs. Particularly, we design a black-box guidance strategy to accommodate real-world scenarios where many property evaluators are non-differentiable. Experimental results demonstrate that our LEAD achieves superior optimization performance for both single and multi-property objectives. Notably, LEAD reduces query consumption by a half while surpassing baseline methods in property optimization. The code is available at this https URL.

[LG-12] A Remedy for Over-Squashing in Graph Learning via Forman-Ricci Curvature based Graph-to-Hypergraph Structural Lifting

链接: https://arxiv.org/abs/2508.11390
作者: Michael Banf,Dominik Filipiak,Max Schattauer,Liliya Imasheva
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks are highly effective at learning from relational data, leveraging node and edge features while maintaining the symmetries inherent to graph structures. However, many real-world systems, such as social or biological networks, exhibit complex interactions that are more naturally represented by higher-order topological domains. The emerging field of Geometric and Topological Deep Learning addresses this challenge by introducing methods that utilize and benefit from higher-order structures. Central to TDL is the concept of lifting, which transforms data representations from basic graph forms to more expressive topologies before the application of GNN models for learning. In this work, we propose a structural lifting strategy using Forman-Ricci curvature, which defines an edge-based network characteristic based on Riemannian geometry. Curvature reveals local and global properties of a graph, such as a network’s backbones, i.e. coarse, structure-preserving graph geometries that form connections between major communities - most suitably represented as hyperedges to model information flows between clusters across large distances in the network. To this end, our approach provides a remedy to the problem of information distortion in message passing across long distances and graph bottlenecks - a phenomenon known in graph learning as over-squashing.

[LG-13] Fusing Rewards and Preferences in Reinforcement Learning

链接: https://arxiv.org/abs/2508.11363
作者: Sadegh Khorasani,Saber Salehkaleybar,Negar Kiyavash,Matthias Grossglauser
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy’s log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA’s preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.

[LG-14] Harmonized Gradient Descent for Class Imbalanced Data Stream Online Learning

链接: https://arxiv.org/abs/2508.11353
作者: Han Zhou,Hongpeng Yin,Xuanhong Deng,Yuyu Huang,Hao Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world data are sequentially collected over time and often exhibit skewed class distributions, resulting in imbalanced data streams. While existing approaches have explored several strategies, such as resampling and reweighting, for imbalanced data stream learning, our work distinguishes itself by addressing the imbalance problem through training modification, particularly focusing on gradient descent techniques. We introduce the harmonized gradient descent (HGD) algorithm, which aims to equalize the norms of gradients across different classes. By ensuring the gradient norm balance, HGD mitigates under-fitting for minor classes and achieves balanced online learning. Notably, HGD operates in a streamlined implementation process, requiring no data-buffer, extra parameters, or prior knowledge, making it applicable to any learning models utilizing gradient descent for optimization. Theoretical analysis, based on a few common and mild assumptions, shows that HGD achieves a satisfied sub-linear regret bound. The proposed algorithm are compared with the commonly used online imbalance learning methods under several imbalanced data stream scenarios. Extensive experimental evaluations demonstrate the efficiency and effectiveness of HGD in learning imbalanced data streams.

[LG-15] A Global Dataset of Location Data Integrity-Assessed Reforestation Efforts

链接: https://arxiv.org/abs/2508.11349
作者: Angela John,Selvyn Allotey,Till Koebe,Alexandra Tyukavina,Ingmar Weber
类目: Machine Learning (cs.LG)
*备注: 10 figures

点击查看摘要

Abstract:Afforestation and reforestation are popular strategies for mitigating climate change by enhancing carbon sequestration. However, the effectiveness of these efforts is often self-reported by project developers, or certified through processes with limited external validation. This leads to concerns about data reliability and project integrity. In response to increasing scrutiny of voluntary carbon markets, this study presents a dataset on global afforestation and reforestation efforts compiled from primary (meta-)information and augmented with time-series satellite imagery and other secondary data. Our dataset covers 1,289,068 planting sites from 45,628 projects spanning 33 years. Since any remote sensing-based validation effort relies on the integrity of a planting site’s geographic boundary, this dataset introduces a standardized assessment of the provided site-level location information, which we summarize in one easy-to-communicate key indicator: LDIS – the Location Data Integrity Score. We find that approximately 79% of the georeferenced planting sites monitored fail on at least 1 out of 10 LDIS indicators, while 15% of the monitored projects lack machine-readable georeferenced data in the first place. In addition to enhancing accountability in the voluntary carbon market, the presented dataset also holds value as training data for e.g. computer vision-related tasks with millions of linked Sentinel-2 and Planetscope satellite images.

[LG-16] Conformal Prediction Meets Long-tail Classification

链接: https://arxiv.org/abs/2508.11345
作者: Shuqi Liu,Jianguo Huang,Luke Ong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal Prediction (CP) is a popular method for uncertainty quantification that converts a pretrained model’s point prediction into a prediction set, with the set size reflecting the model’s confidence. Although existing CP methods are guaranteed to achieve marginal coverage, they often exhibit imbalanced coverage across classes under long-tail label distributions, tending to over cover the head classes at the expense of under covering the remaining tail classes. This under coverage is particularly concerning, as it undermines the reliability of the prediction sets for minority classes, even with coverage ensured on average. In this paper, we propose the Tail-Aware Conformal Prediction (TACP) method to mitigate the under coverage of the tail classes by utilizing the long-tail structure and narrowing the head-tail coverage gap. Theoretical analysis shows that it consistently achieves a smaller head-tail coverage gap than standard methods. To further improve coverage balance across all classes, we introduce an extension of TACP: soft TACP (sTACP) via a reweighting mechanism. The proposed framework can be combined with various non-conformity scores, and experiments on multiple long-tail benchmark datasets demonstrate the effectiveness of our methods.

[LG-17] Enhancing Interactive Voting-Based Map Matching: Improving Efficiency and Robustness for Heterogeneous GPS Trajectories

链接: https://arxiv.org/abs/2508.11235
作者: William Alemanni,Arianna Burzacchi,Davide Colombi,Elena Giarratano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an enhanced version of the Interactive Voting-Based Map Matching algorithm, designed to efficiently process trajectories with varying sampling rates. The main aim is to reconstruct GPS trajectories with high accuracy, independent of input data quality. Building upon the original algorithm, developed exclusively for aligning GPS signals to road networks, we extend its capabilities by integrating trajectory imputation. Our improvements also include the implementation of a distance-bounded interactive voting strategy to reduce computational complexity, as well as modifications to address missing data in the road network. Furthermore, we incorporate a custom-built asset derived from OpenStreetMap, enabling this approach to be smoothly applied in any geographic region covered by OpenStreetMap’s road network. These advancements preserve the core strengths of the original algorithm while significantly extending its applicability to diverse real-world scenarios.

[LG-18] Air Quality PM2.5 Index Prediction Model Based on CNN-LSTM

链接: https://arxiv.org/abs/2508.11215
作者: Zicheng Guo,Shuqi Wu,Meixing Zhu,He Guandi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the intensification of global climate change, accurate prediction of air quality indicators, especially PM2.5 concentration, has become increasingly important in fields such as environmental protection, public health, and urban management. To address this, we propose an air quality PM2.5 index prediction model based on a hybrid CNN-LSTM architecture. The model effectively combines Convolutional Neural Networks (CNN) for local spatial feature extraction and Long Short-Term Memory (LSTM) networks for modeling temporal dependencies in time series data. Using a multivariate dataset collected from an industrial area in Beijing between 2010 and 2015 – which includes hourly records of PM2.5 concentration, temperature, dew point, pressure, wind direction, wind speed, and precipitation – the model predicts the average PM2.5 concentration over 6-hour intervals. Experimental results show that the model achieves a root mean square error (RMSE) of 5.236, outperforming traditional time series models in both accuracy and generalization. This demonstrates its strong potential in real-world applications such as air pollution early warning systems. However, due to the complexity of multivariate inputs, the model demands high computational resources, and its ability to handle diverse atmospheric factors still requires optimization. Future work will focus on enhancing scalability and expanding support for more complex multivariate weather prediction tasks.

[LG-19] Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning ALT

链接: https://arxiv.org/abs/2508.11210
作者: Minghui Sun,Matthew M. Engelhard,Benjamin A. Goldstein
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: accepted by Machine Learning for Healthcare 2025

点击查看摘要

Abstract:Risk assessments for a pediatric population are often conducted across multiple stages. For example, clinicians may evaluate risks prenatally, at birth, and during Well-Child visits. Although predictions made at later stages typically achieve higher precision, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on improving prediction performance in early-stage risk assessments. Our solution, \textbfBorrowing From the Future (BFF), is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while performing a risk assessment using up-to-date information. This contrastive framework allows the model to ``borrow’’ informative signals from later stages (e.g., Well-Child visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessments. The code is available at this https URL.

[LG-20] Meta-learning Structure-Preserving Dynamics

链接: https://arxiv.org/abs/2508.11205
作者: Cheng Jing,Uvini Balasuriya Mudiyanselage,Woojin Cho,Minju Jo,Anthony Gruber,Kookjin Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structure-preserving approaches to dynamics modeling have demonstrated great potential for modeling physical systems due to their strong inductive biases that enforce conservation laws and dissipative behavior. However, the resulting models are typically trained for fixed system configurations, requiring explicit knowledge of system parameters as well as costly retraining for each new set of parameters – a major limitation in many-query or parameter-varying scenarios. Meta-learning offers a potential solution, but existing approaches like optimization-based meta-learning often suffer from training instability or limited generalization capability. Inspired by ideas from computer vision, we introduce a modulation-based meta-learning framework that directly conditions structure-preserving models on compact latent representations of potentially unknown system parameters, avoiding the need for gray-box system knowledge and explicit optimization during adaptation. Through the application of novel modulation strategies to parametric energy-conserving and dissipative systems, we enable scalable and generalizable learning across parametric families of dynamical systems. Experiments on standard benchmark problems demonstrate that our approach achieves accurate predictions in few-shot learning settings, without compromising on the essential physical constraints necessary for dynamical stability and effective generalization performance across parameter space.

[LG-21] Mitigating Modality Quantity and Quality Imbalance in Multimodal Online Federated Learning

链接: https://arxiv.org/abs/2508.11159
作者: Heqiang Wang,Weihong Yang,Xiaoxiong Zhong,Jia Zhou,Fangming Liu,Weizhe Zhang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2505.16138

点击查看摘要

Abstract:The Internet of Things (IoT) ecosystem produces massive volumes of multimodal data from diverse sources, including sensors, cameras, and microphones. With advances in edge intelligence, IoT devices have evolved from simple data acquisition units into computationally capable nodes, enabling localized processing of heterogeneous multimodal data. This evolution necessitates distributed learning paradigms that can efficiently handle such data. Furthermore, the continuous nature of data generation and the limited storage capacity of edge devices demand an online learning framework. Multimodal Online Federated Learning (MMO-FL) has emerged as a promising approach to meet these requirements. However, MMO-FL faces new challenges due to the inherent instability of IoT devices, which often results in modality quantity and quality imbalance (QQI) during data collection. In this work, we systematically investigate the impact of QQI within the MMO-FL framework and present a comprehensive theoretical analysis quantifying how both types of imbalance degrade learning performance. To address these challenges, we propose the Modality Quantity and Quality Rebalanced (QQR) algorithm, a prototype learning based method designed to operate in parallel with the training process. Extensive experiments on two real-world multimodal datasets show that the proposed QQR algorithm consistently outperforms benchmarks under modality imbalance conditions with promising learning performance.

[LG-22] owards the Next-generation Bayesian Network Classifiers

链接: https://arxiv.org/abs/2508.11145
作者: Huan Zhang,Daokun Zhang,Kexin Meng,Geoffrey I. Webb
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian network classifiers provide a feasible solution to tabular data classification, with a number of merits like high time and memory efficiency, and great explainability. However, due to the parameter explosion and data sparsity issues, Bayesian network classifiers are restricted to low-order feature dependency modeling, making them struggle in extrapolating the occurrence probabilities of complex real-world data. In this paper, we propose a novel paradigm to design high-order Bayesian network classifiers, by learning distributional representations for feature values, as what has been done in word embedding and graph representation learning. The learned distributional representations are encoded with the semantic relatedness between different features through their observed co-occurrence patterns in training data, which then serve as a hallmark to extrapolate the occurrence probabilities of new test samples. As a classifier design realization, we remake the K-dependence Bayesian classifier (KDB) by extending it into a neural version, i.e., NeuralKDB, where a novel neural network architecture is designed to learn distributional representations of feature values and parameterize the conditional probabilities between interdependent features. A stochastic gradient descent based algorithm is designed to train the NeuralKDB model efficiently. Extensive classification experiments on 60 UCI datasets demonstrate that the proposed NeuralKDB classifier excels in capturing high-order feature dependencies and significantly outperforms the conventional Bayesian network classifiers, as well as other competitive classifiers, including two neural network based classifiers without distributional representation learning.

[LG-23] CTRL Your Shift: Clustered Transfer Residual Learning for Many Small Datasets

链接: https://arxiv.org/abs/2508.11144
作者: Gauri Jain,Dominik Rothenhäusler,Kirk Bansak,Elisabeth Paulson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) tasks often utilize large-scale data that is drawn from several distinct sources, such as different locations, treatment arms, or groups. In such settings, practitioners often desire predictions that not only exhibit good overall accuracy, but also remain reliable within each source and preserve the differences that matter across sources. For instance, several asylum and refugee resettlement programs now use ML-based employment predictions to guide where newly arriving families are placed within a host country, which requires generating informative and differentiated predictions for many and often small source locations. However, this task is made challenging by several common characteristics of the data in these settings: the presence of numerous distinct data sources, distributional shifts between them, and substantial variation in sample sizes across sources. This paper introduces Clustered Transfer Residual Learning (CTRL), a meta-learning method that combines the strengths of cross-domain residual learning and adaptive pooling/clustering in order to simultaneously improve overall accuracy and preserve source-level heterogeneity. We provide theoretical results that clarify how our objective navigates the trade-off between data quantity and data quality. We evaluate CTRL alongside other state-of-the-art benchmarks on 5 large-scale datasets. This includes a dataset from the national asylum program in Switzerland, where the algorithmic geographic assignment of asylum seekers is currently being piloted. CTRL consistently outperforms the benchmarks across several key metrics and when using a range of different base learners.

[LG-24] Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation

链接: https://arxiv.org/abs/2508.11105
作者: Sajjad Saed,Babak Teimourpour
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid expansion of the fashion industry and the growing variety of products have made it challenging for users to find compatible items on e-commerce platforms. Effective fashion recommendation systems are crucial for filtering irrelevant items and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are often treated independently in existing studies, often overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, inspired by the HFGN model, which leverages graph neural networks and graph attention mechanisms to tackle this issue. The proposed framework constructs a three-tier hierarchical graph of users, outfits, and items, integrating visual and textual features to simultaneously model outfit compatibility and user preferences. A graph attention mechanism dynamically weights node importance during representation propagation, enabling the capture of key interactions and generating precise representations for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms baseline models such as HFGN, achieving improved results in precision, HR, recall, NDCG, and this http URL results demonstrate that combining multimodal visual-textual features with a hierarchical graph structure and attention mechanisms significantly enhances the accuracy and efficiency of personalized fashion recommendation systems.

[LG-25] Predictive Multimodal Modeling of Diagnoses and Treatments in EHR

链接: https://arxiv.org/abs/2508.11092
作者: Cindy Shih-Ting Huang,Clarence Boon Liang Ng,Marek Rei
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:While the ICD code assignment problem has been widely studied, most works have focused on post-discharge document classification. Models for early forecasting of this information could be used for identifying health risks, suggesting effective treatments, or optimizing resource allocation. To address the challenge of predictive modeling using the limited information at the beginning of a patient stay, we propose a multimodal system to fuse clinical notes and tabular events captured in electronic health records. The model integrates pre-trained encoders, feature pooling, and cross-modal attention to learn optimal representations across modalities and balance their presence at every temporal point. Moreover, we present a weighted temporal loss that adjusts its contribution at each point in time. Experiments show that these strategies enhance the early prediction model, outperforming the current state-of-the-art systems.

[LG-26] Relative Advantage Debiasing for Watch-Time Prediction in Short-Video Recommendation

链接: https://arxiv.org/abs/2508.11086
作者: Emily Liu,Kuan Han,Minfeng Zhan,Bocheng Zhao,Guanyu Mu,Yang Song
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Watch time is widely used as a proxy for user satisfaction in video recommendation platforms. However, raw watch times are influenced by confounding factors such as video duration, popularity, and individual user behaviors, potentially distorting preference signals and resulting in biased recommendation models. We propose a novel relative advantage debiasing framework that corrects watch time by comparing it to empirically derived reference distributions conditioned on user and item groups. This approach yields a quantile-based preference signal and introduces a two-stage architecture that explicitly separates distribution estimation from preference learning. Additionally, we present distributional embeddings to efficiently parameterize watch-time quantiles without requiring online sampling or storage of historical data. Both offline and online experiments demonstrate significant improvements in recommendation accuracy and robustness compared to existing baseline methods.

[LG-27] A Feasibility Experiment on the Application of Predictive Coding to Instant Messaging Corpora

链接: https://arxiv.org/abs/2508.11084
作者: Thanasis Schoinas,Ghulam Qadir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive coding, the term used in the legal industry for document classification using machine learning, presents additional challenges when the dataset comprises instant messages, due to their informal nature and smaller sizes. In this paper, we exploit a data management workflow to group messages into day chats, followed by feature selection and a logistic regression classifier to provide an economically feasible predictive coding solution. We also improve the solution’s baseline model performance by dimensionality reduction, with focus on quantitative features. We test our methodology on an Instant Bloomberg dataset, rich in quantitative information. In parallel, we provide an example of the cost savings of our approach.

[LG-28] Abundance-Aware Set Transformer for Microbiome Sample Embedding

链接: https://arxiv.org/abs/2508.11075
作者: Hyunwoo Yoo,Gail Rosen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.

[LG-29] Human-in-the-Loop Systems for Adaptive Learning Using Generative AI

链接: https://arxiv.org/abs/2508.11062
作者: Bhavishya Tarun,Haoze Du,Dinesh Kannan,Edward F. Gehringer
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for presentation at the Frontiers in Education Conference, Nashville, Tennessee, USA, 2-5 November 2025

点击查看摘要

Abstract:A Human-in-the-Loop (HITL) approach leverages generative AI to enhance personalized learning by directly integrating student feedback into AI-generated solutions. Students critique and modify AI responses using predefined feedback tags, fostering deeper engagement and understanding. This empowers students to actively shape their learning, with AI serving as an adaptive partner. The system uses a tagging technique and prompt engineering to personalize content, informing a Retrieval-Augmented Generation (RAG) system to retrieve relevant educational material and adjust explanations in real time. This builds on existing research in adaptive learning, demonstrating how student-driven feedback loops can modify AI-generated responses for improved student retention and engagement, particularly in STEM education. Preliminary findings from a study with STEM students indicate improved learning outcomes and confidence compared to traditional AI tools. This work highlights AI’s potential to create dynamic, feedback-driven, and personalized learning environments through iterative refinement.

[LG-30] SHLIME: Foiling adversarial attacks fooling SHAP and LIME

链接: https://arxiv.org/abs/2508.11053
作者: Sam Chauhan,Estelle Duguet,Karthik Ramakrishnan,Hugh Van Deventer,Jack Kruger,Ranjan Subbaraman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Post hoc explanation methods, such as LIME and SHAP, provide interpretable insights into black-box classifiers and are increasingly used to assess model biases and generalizability. However, these methods are vulnerable to adversarial manipulation, potentially concealing harmful biases. Building on the work of Slack et al. (2020), we investigate the susceptibility of LIME and SHAP to biased models and evaluate strategies for improving robustness. We first replicate the original COMPAS experiment to validate prior findings and establish a baseline. We then introduce a modular testing framework enabling systematic evaluation of augmented and ensemble explanation approaches across classifiers of varying performance. Using this framework, we assess multiple LIME/SHAP ensemble configurations on out-of-distribution models, comparing their resistance to bias concealment against the original methods. Our results identify configurations that substantially improve bias detection, highlighting their potential for enhancing transparency in the deployment of high-stakes machine learning systems.

[LG-31] Conditional Independence Estimates for the Generalized Nonparanormal

链接: https://arxiv.org/abs/2508.11050
作者: Ujas Shah(1),Manuel Lladser(1),Rebecca Morrison(1) ((1) University of Colorado Boulder)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages, 7 figures, 3 tables

点击查看摘要

Abstract:For general non-Gaussian distributions, the covariance and precision matrices do not encode the independence structure of the variables, as they do for the multivariate Gaussian. This paper builds on previous work to show that for a class of non-Gaussian distributions – those derived from diagonal transformations of a Gaussian – information about the conditional independence structure can still be inferred from the precision matrix, provided the data meet certain criteria, analogous to the Gaussian case. We call such transformations of the Gaussian as the generalized nonparanormal. The functions that define these transformations are, in a broad sense, arbitrary. We also provide a simple and computationally efficient algorithm that leverages this theory to recover conditional independence structure from the generalized nonparanormal data. The effectiveness of the proposed algorithm is demonstrated via synthetic experiments and applications to real-world data.

[LG-32] Quantization vs Pruning: Insights from the Strong Lottery Ticket Hypothesis

链接: https://arxiv.org/abs/2508.11020
作者: Aakash Kumar,Emanuele Natale
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization is an essential technique for making neural networks more efficient, yet our theoretical understanding of it remains limited. Previous works demonstrated that extremely low-precision networks, such as binary networks, can be constructed by pruning large, randomly-initialized networks, and showed that the ratio between the size of the original and the pruned networks is at most polylogarithmic. The specific pruning method they employed inspired a line of theoretical work known as the Strong Lottery Ticket Hypothesis (SLTH), which leverages insights from the Random Subset Sum Problem. However, these results primarily address the continuous setting and cannot be applied to extend SLTH results to the quantized setting. In this work, we build on foundational results by Borgs et al. on the Number Partitioning Problem to derive new theoretical results for the Random Subset Sum Problem in a quantized setting. Using these results, we then extend the SLTH framework to finite-precision networks. While prior work on SLTH showed that pruning allows approximation of a certain class of neural networks, we demonstrate that, in the quantized setting, the analogous class of target discrete neural networks can be represented exactly, and we prove optimal bounds on the necessary overparameterization of the initial network as a function of the precision of the target network. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.11020 [cs.LG] (or arXiv:2508.11020v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.11020 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Aakash Kumar [view email] [v1] Thu, 14 Aug 2025 18:51:34 UTC (27 KB)

[LG-33] A Cooperative Game-Based Multi-Criteria Weighted Ensemble Approach for Multi-Class Classification

链接: https://arxiv.org/abs/2508.10926
作者: DongSeong-Yoon
类目: Machine Learning (cs.LG)
*备注: English translation of the author’s pre-revision version of the article published in J-KICS 50(4):561-571 (2025), DOI https://doi.org/10.7840/kics.2025.50.4.561 . Posted with permission from KICS (Aug 7, 2025). The published version may differ

点击查看摘要

Abstract:Since the Fourth Industrial Revolution, AI technology has been widely used in many fields, but there are several limitations that need to be overcome, including overfitting/underfitting, class imbalance, and the limitations of representation (hypothesis space) due to the characteristics of different models. As a method to overcome these problems, ensemble, commonly known as model combining, is being extensively used in the field of machine learning. Among ensemble learning methods, voting ensembles have been studied with various weighting methods, showing performance improvements. However, the existing methods that reflect the pre-information of classifiers in weights consider only one evaluation criterion, which limits the reflection of various information that should be considered in a model realistically. Therefore, this paper proposes a method of making decisions considering various information through cooperative games in multi-criteria situations. Using this method, various types of information known beforehand in classifiers can be simultaneously considered and reflected, leading to appropriate weight distribution and performance improvement. The machine learning algorithms were applied to the Open-ML-CC18 dataset and compared with existing ensemble weighting methods. The experimental results showed superior performance compared to other weighting methods.

[LG-34] Insect-Wing Structured Microfluidic System for Reservoir Computing

链接: https://arxiv.org/abs/2508.10915
作者: Jacob Clouse(1),Thomas Ramsey(2),Samitha Somathilaka(1),Nicholas Kleinsasser(1),Sangjin Ryu(2),Sasitharan Balasubramaniam(1) ((1) School of Computing, University of Nebraska-Lincoln, Lincoln, Nebraska, USA, (2) Department of Mechanical and Materials Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, USA)
类目: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the demand for more efficient and adaptive computing grows, nature-inspired architectures offer promising alternatives to conventional electronic designs. Microfluidic platforms, drawing on biological forms and fluid dynamics, present a compelling foundation for low-power, high-resilience computing in environments where electronics are unsuitable. This study explores a hybrid reservoir computing system based on a dragonfly-wing inspired microfluidic chip, which encodes temporal input patterns as fluid interactions within the micro channel network. The system operates with three dye-based inlet channels and three camera-monitored detection areas, transforming discrete spatial patterns into dynamic color output signals. These reservoir output signals are then modified and passed to a simple and trainable readout layer for pattern classification. Using a combination of raw reservoir outputs and synthetically generated outputs, we evaluated system performance, system clarity, and data efficiency. The results demonstrate consistent classification accuracies up to 91% , even with coarse resolution and limited training data, highlighting the viability of the microfluidic reservoir computing. Subjects: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2508.10915 [cs.NE] (or arXiv:2508.10915v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2508.10915 Focus to learn more arXiv-issued DOI via DataCite

[LG-35] Uncovering Latent Connections in Indigenous Heritage: Semantic Pipelines for Cultural Preservation in Brazil AAAI2026

链接: https://arxiv.org/abs/2508.10911
作者: Luis Vitor Zerkowski,Nina S. T. Hirata
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 8 tables, 7 figures, submitted to AAAI2026

点击查看摘要

Abstract:Indigenous communities face ongoing challenges in preserving their cultural heritage, particularly in the face of systemic marginalization and urban development. In Brazil, the Museu Nacional dos Povos Indigenas through the Tainacan platform hosts the country’s largest online collection of Indigenous objects and iconographies, providing a critical resource for cultural engagement. Using publicly available data from this repository, we present a data-driven initiative that applies artificial intelligence to enhance accessibility, interpretation, and exploration. We develop two semantic pipelines: a visual pipeline that models image-based similarity and a textual pipeline that captures semantic relationships from item descriptions. These embedding spaces are projected into two dimensions and integrated into an interactive visualization tool we also developed. In addition to similarity-based navigation, users can explore the collection through temporal and geographic lenses, enabling both semantic and contextualized perspectives. The system supports curatorial tasks, aids public engagement, and reveals latent connections within the collection. This work demonstrates how AI can ethically contribute to cultural preservation practices.

[LG-36] Nonparametric learning of stochastic differential equations from sparse and noisy data

链接: https://arxiv.org/abs/2508.11597
作者: Arnab Ganguly,Riten Mitra,Jinpu Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Methodology (stat.ME)
*备注: 35 pages, 6 figures

点击查看摘要

Abstract:The paper proposes a systematic framework for building data-driven stochastic differential equation (SDE) models from sparse, noisy observations. Unlike traditional parametric approaches, which assume a known functional form for the drift, our goal here is to learn the entire drift function directly from data without strong structural assumptions, making it especially relevant in scientific disciplines where system dynamics are partially understood or highly complex. We cast the estimation problem as minimization of the penalized negative log-likelihood functional over a reproducing kernel Hilbert space (RKHS). In the sparse observation regime, the presence of unobserved trajectory segments makes the SDE likelihood intractable. To address this, we develop an Expectation-Maximization (EM) algorithm that employs a novel Sequential Monte Carlo (SMC) method to approximate the filtering distribution and generate Monte Carlo estimates of the E-step objective. The M-step then reduces to a penalized empirical risk minimization problem in the RKHS, whose minimizer is given by a finite linear combination of kernel functions via a generalized representer theorem. To control model complexity across EM iterations, we also develop a hybrid Bayesian variant of the algorithm that uses shrinkage priors to identify significant coefficients in the kernel expansion. We establish important theoretical convergence results for both the exact and approximate EM sequences. The resulting EM-SMC-RKHS procedure enables accurate estimation of the drift function of stochastic dynamical systems in low-data regimes and is broadly applicable across domains requiring continuous-time modeling under observational constraints. We demonstrate the effectiveness of our method through a series of numerical experiments.

[LG-37] Repetitive TMS-based Identification of Methamphetamine-Dependent Individuals Using EEG Spectra

链接: https://arxiv.org/abs/2508.11312
作者: Ziyi Zeng,Yun-Hsuan Chen,Xurong Gao,Wenyao Zheng,Hemmings Wu,Zhoule Zhu,Jie Yang,Chengkai Wang,Lihua Zhong,Weiwei Cheng,Mohamad Sawan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:The impact of repetitive transcranial magnetic stimulation (rTMS) on methamphetamine (METH) users’ craving levels is often assessed using questionnaires. This study explores the feasibility of using neural signals to obtain more objective results. EEG signals recorded from 20 METH-addicted participants Before and After rTMS (MBT and MAT) and from 20 healthy participants (HC) are analyzed. In each EEG paradigm, participants are shown 15 METH-related and 15 neutral pictures randomly, and the relative band power (RBP) of each EEG sub-band frequency is derived. The average RBP across all 31 channels, as well as individual brain regions, is analyzed. Statistically, MAT’s alpha, beta, and gamma RBPs are more like those of HC compared to MBT, as indicated by the power topographies. Utilizing a random forest (RF), the gamma RBP is identified as the optimal frequency band for distinguishing between MBT and HC with a 90% accuracy. The performance of classifying MAT versus HC is lower than that of MBT versus HC, suggesting that the efficacy of rTMS can be validated using RF with gamma RBP. Furthermore, the gamma RBP recorded by the TP10 and CP2 channels dominates the classification task of MBT versus HC when receiving METH-related image cues. The gamma RBP during exposure to METH-related cues can serve as a biomarker for distinguishing between MBT and HC and for evaluating the effectiveness of rTMS. Therefore, real-time monitoring of gamma RBP variations holds promise as a parameter for implementing a customized closed-loop neuromodulation system for treating METH addiction.

[LG-38] Approximating the universal thermal climate index using sparse regression with orthogonal polynomials

链接: https://arxiv.org/abs/2508.11307
作者: Sabin Roman,Gregor Skok,Ljupco Todorovski,Saso Dzeroski
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:This article explores novel data-driven modeling approaches for analyzing and approximating the Universal Thermal Climate Index (UTCI), a physiologically-based metric integrating multiple atmospheric variables to assess thermal comfort. Given the nonlinear, multivariate structure of UTCI, we investigate symbolic and sparse regression techniques as tools for interpretable and efficient function approximation. In particular, we highlight the benefits of using orthogonal polynomial bases-such as Legendre polynomials-in sparse regression frameworks, demonstrating their advantages in stability, convergence, and hierarchical interpretability compared to standard polynomial expansions. We demonstrate that our models achieve significantly lower root-mean squared losses than the widely used sixth-degree polynomial benchmark-while using the same or fewer parameters. By leveraging Legendre polynomial bases, we construct models that efficiently populate a Pareto front of accuracy versus complexity and exhibit stable, hierarchical coefficient structures across varying model capacities. Training on just 20% of the data, our models generalize robustly to the remaining 80%, with consistent performance under bootstrapping. The decomposition effectively approximates the UTCI as a Fourier-like expansion in an orthogonal basis, yielding results near the theoretical optimum in the L2 (least squares) sense. We also connect these findings to the broader context of equation discovery in environmental modeling, referencing probabilistic grammar-based methods that enforce domain consistency and compactness in symbolic expressions. Taken together, these results illustrate how combining sparsity, orthogonality, and symbolic structure enables robust, interpretable modeling of complex environmental indices like UTCI - and significantly outperforms the state-of-the-art approximation in both accuracy and efficiency.

[LG-39] Uniform convergence for Gaussian kernel ridge regression

链接: https://arxiv.org/abs/2508.11274
作者: Paul Dommel,Rajmadan Lakshmanan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper establishes the first polynomial convergence rates for Gaussian kernel ridge regression (KRR) with a fixed hyperparameter in both the uniform and the L^2 -norm. The uniform convergence result closes a gap in the theoretical understanding of KRR with the Gaussian kernel, where no such rates were previously known. In addition, we prove a polynomial L^2 -convergence rate in the case, where the Gaussian kernel’s width parameter is fixed. This also contributes to the broader understanding of smooth kernels, for which previously only sub-polynomial L^2 -rates were known in similar settings. Together, these results provide new theoretical justification for the use of Gaussian KRR with fixed hyperparameters in nonparametric regression.

[LG-40] he Role of Entanglement in Quantum Reservoir Computing with Coupled Kerr Nonlinear Oscillators

链接: https://arxiv.org/abs/2508.11175
作者: Ali Karimi,Hadi Zadeh-Haghighi,Youssef Kora,Christoph Simon
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Quantum Reservoir Computing (QRC) uses quantum dynamics to efficiently process temporal data. In this work, we investigate a QRC framework based on two coupled Kerr nonlinear oscillators, a system well-suited for time-series prediction tasks due to its complex nonlinear interactions and potentially high-dimensional state space. We explore how its performance in time-series prediction depends on key physical parameters: input drive strength, Kerr nonlinearity, and oscillator coupling, and analyze the role of entanglement in improving the reservoir’s computational performance, focusing on its effect on predicting non-trivial time series. Using logarithmic negativity to quantify entanglement and normalized root mean square error (NRMSE) to evaluate predictive accuracy, our results suggest that entanglement provides a computational advantage on average-up to a threshold in the input frequency-that persists under some levels of dissipation and dephasing. In particular, we find that higher dissipation rates can enhance performance. While the entanglement advantage manifests as improvements in both average and worst-case performance, it does not lead to improvements in the best-case error. These findings contribute to the broader understanding of quantum reservoirs for high performance, efficient quantum machine learning and time-series forecasting.

[LG-41] Functional Analysis of Variance for Association Studies

链接: https://arxiv.org/abs/2508.11069
作者: Olga A. Vsevolozhskaya,Dmitri V. Zaykin,Mark C. Greenwood,Changshuai Wei,Qing Lu
类目: Applications (stat.AP); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:While progress has been made in identifying common genetic variants associated with human diseases, for most of common complex diseases, the identified genetic variants only account for a small proportion of heritability. Challenges remain in finding additional unknown genetic variants predisposing to complex diseases. With the advance in next-generation sequencing technologies, sequencing studies have become commonplace in genetic research. The ongoing exome-sequencing and whole-genome-sequencing studies generate a massive amount of sequencing variants and allow researchers to comprehensively investigate their role in human diseases. The discovery of new disease-associated variants can be enhanced by utilizing powerful and computationally efficient statistical methods. In this paper, we propose a functional analysis of variance (FANOVA) method for testing an association of sequence variants in a genomic region with a qualitative trait. The FANOVA has a number of advantages: (1) it tests for a joint effect of gene variants, including both common and rare; (2) it fully utilizes linkage disequilibrium and genetic position information; and (3) allows for either protective or risk-increasing causal variants. Through simulations, we show that FANOVA outperform two popularly used methods - SKAT and a previously proposed method based on functional linear models (FLM), - especially if a sample size of a study is small and/or sequence variants have low to moderate effects. We conduct an empirical study by applying three methods (FANOVA, SKAT and FLM) to sequencing data from Dallas Heart Study. While SKAT and FLM respectively detected ANGPTL 4 and ANGPTL 3 associated with obesity, FANOVA was able to identify both genes associated with obesity.

[LG-42] Counterfactual Survival Q Learning for Longitudinal Randomized Trials via Buckley James Boosting

链接: https://arxiv.org/abs/2508.11060
作者: Jeongjin Lee,Jong-Min Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a Buckley James (BJ) Boost Q learning framework for estimating optimal dynamic treatment regimes under right censored survival data, tailored for longitudinal randomized clinical trial settings. The method integrates accelerated failure time models with iterative boosting techniques, including componentwise least squares and regression trees, within a counterfactual Q learning framework. By directly modeling conditional survival time, BJ Boost Q learning avoids the restrictive proportional hazards assumption and enables unbiased estimation of stage specific Q functions. Grounded in potential outcomes, this framework ensures identifiability of the optimal treatment regime under standard causal assumptions. Compared to Cox based Q learning, which relies on hazard modeling and may suffer from bias under misspecification, our approach provides robust and flexible estimation. Simulation studies and analysis of the ACTG175 HIV trial demonstrate that BJ Boost Q learning yields higher accuracy in treatment decision making, especially in multistage settings where bias can accumulate.

[LG-43] Non-asymptotic convergence bound of conditional diffusion models

链接: https://arxiv.org/abs/2508.10944
作者: Mengze Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning and generating various types of data based on conditional diffusion models has been a research hotspot in recent years. Although conditional diffusion models have made considerable progress in improving acceleration algorithms and enhancing generation quality, the lack of non-asymptotic properties has hindered theoretical research. To address this gap, we focus on a conditional diffusion model within the domains of classification and regression (CARD), which aims to learn the original distribution with given input x (denoted as Y|X). It innovatively integrates a pre-trained model f_\phi(x) into the original diffusion model framework, allowing it to precisely capture the original conditional distribution given f (expressed as Y|f_\phi(x)). Remarkably, when f_\phi(x) performs satisfactorily, Y|f_\phi(x) closely approximates Y|X. Theoretically, we deduce the stochastic differential equations of CARD and establish its generalized form predicated on the Fokker-Planck equation, thereby erecting a firm theoretical foundation for analysis. Mainly under the Lipschitz assumptions, we utilize the second-order Wasserstein distance to demonstrate the upper error bound between the original and the generated conditional distributions. Additionally, by appending assumptions such as light-tailedness to the original distribution, we derive the convergence upper bound between the true value analogous to the score function and the corresponding network-estimated value.

[LG-44] CleanCTG: A Deep Learning Model for Multi-Artefact Detection and Reconstruction in Cardiotocography

链接: https://arxiv.org/abs/2508.10928
作者: Sheng Wong,Beth Albert,Gabriel Davis Jones
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Cardiotocography (CTG) is essential for fetal monitoring but is frequently compromised by diverse artefacts which obscure true fetal heart rate (FHR) patterns and can lead to misdiagnosis or delayed intervention. Current deep-learning approaches typically bypass comprehensive noise handling, applying minimal preprocessing or focusing solely on downstream classification, while traditional methods rely on simple interpolation or rule-based filtering that addresses only missing samples and fail to correct complex artefact types. We present CleanCTG, an end-to-end dual-stage model that first identifies multiple artefact types via multi-scale convolution and context-aware cross-attention, then reconstructs corrupted segments through artefact-specific correction branches. Training utilised over 800,000 minutes of physiologically realistic, synthetically corrupted CTGs derived from expert-verified “clean” recordings. On synthetic data, CleanCTG achieved perfect artefact detection (AU-ROC = 1.00) and reduced mean squared error (MSE) on corrupted segments to 2.74 x 10^-4 (clean-segment MSE = 2.40 x 10^-6), outperforming the next best method by more than 60%. External validation on 10,190 minutes of clinician-annotated segments yielded AU-ROC = 0.95 (sensitivity = 83.44%, specificity 94.22%), surpassing six comparator classifiers. Finally, when integrated with the Dawes-Redman system on 933 clinical CTG recordings, denoised traces increased specificity (from 80.70% to 82.70%) and shortened median time to decision by 33%. These findings suggest that explicit artefact removal and signal reconstruction can both maintain diagnostic accuracy and enable shorter monitoring sessions, offering a practical route to more reliable CTG interpretation.

[LG-45] Data-driven global ocean model resolving ocean-atmosphere coupling dynamics

链接: https://arxiv.org/abs/2508.10908
作者: Jeong-Hwan Kim,Daehyun Kang,Young-Min Yang,Jae-Heung Park,Yoo-Geun Ham
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: The manuscript contains 4 main figures. The Extended Data contains 7 figures and 3 tables. The Supplementary Information contains 3 text sections, 7 figures, 1 table

点击查看摘要

Abstract:Artificial intelligence has advanced global weather forecasting, outperforming traditional numerical models in both accuracy and computational efficiency. Nevertheless, extending predictions beyond subseasonal timescales requires the development of deep learning (DL)-based ocean-atmosphere coupled models that can realistically simulate complex oceanic responses to atmospheric forcing. This study presents KIST-Ocean, a DL-based global three-dimensional ocean general circulation model using a U-shaped visual attention adversarial network architecture. KIST-Ocean integrates partial convolution, adversarial training, and transfer learning to address coastal complexity and predictive distribution drift in auto-regressive models. Comprehensive evaluations confirmed the model’s robust ocean predictive skill and efficiency. Moreover, it accurately captures realistic ocean response, such as Kelvin and Rossby wave propagation in the tropical Pacific, and vertical motions induced by cyclonic and anticyclonic wind stress, demonstrating its ability to represent key ocean-atmosphere coupling mechanisms underlying climate phenomena, including the El Nino-Southern Oscillation. These findings reinforce confidence in DL-based global weather and climate models and their extending DL-based approaches to broader Earth system modeling, offering potential for enhancing climate prediction capabilities.

信息检索

[IR-0] INFNet: A Task-aware Information Flow Network for Large-Scale Recommendation Systems

链接: https://arxiv.org/abs/2508.11565
作者: Kaiyuan Li,Dongdong Mao,Yongxiang Tang,Yanhua Cheng,Yanxiang Zeng,Chao Wang,Xialong Liu,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Feature interaction has long been a cornerstone of ranking models in large-scale recommender systems due to its proven effectiveness in capturing complex dependencies among features. However, existing feature interaction strategies face two critical challenges in industrial applications: (1) The vast number of categorical and sequential features makes exhaustive interaction computationally prohibitive, often resulting in optimization difficulties. (2) Real-world recommender systems typically involve multiple prediction objectives, yet most current approaches apply feature interaction modules prior to the multi-task learning layers. This late-fusion design overlooks task-specific feature dependencies and inherently limits the capacity of multi-task modeling. To address these limitations, we propose the Information Flow Network (INFNet), a task-aware architecture designed for large-scale recommendation scenarios. INFNet distinguishes features into three token types, categorical tokens, sequence tokens, and task tokens, and introduces a novel dual-flow design comprising heterogeneous and homogeneous alternating information blocks. For heterogeneous information flow, we employ a cross-attention mechanism with proxy that facilitates efficient cross-modal token interaction with balanced computational cost. For homogeneous flow, we design type-specific Proxy Gated Units (PGUs) to enable fine-grained intra-type feature processing. Extensive experiments on multiple offline benchmarks confirm that INFNet achieves state-of-the-art performance. Moreover, INFNet has been successfully deployed in a commercial online advertising system, yielding significant gains of +1.587% in Revenue (REV) and +1.155% in Click-Through Rate (CTR).

[IR-1] When Algorithms Mirror Minds: A Confirmation-Aware Social Dynamic Model of Echo Chamber and Homogenization Traps

链接: https://arxiv.org/abs/2508.11516
作者: Ming Tang,Xiaowen Huang,Jitao Sang
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems increasingly suffer from echo chambers and user homogenization, systemic distortions arising from the dynamic interplay between algorithmic recommendations and human behavior. While prior work has studied these phenomena through the lens of algorithmic bias or social network structure, we argue that the psychological mechanisms of users and the closed-loop interaction between users and recommenders are critical yet understudied drivers of these emergent effects. To bridge this gap, we propose the Confirmation-Aware Social Dynamic Model which incorporates user psychology and social relationships to simulate the actual user and recommender interaction process. Our theoretical analysis proves that echo chambers and homogenization traps, defined respectively as reduced recommendation diversity and homogenized user representations, will inevitably occur. We also conduct extensive empirical simulations on two real-world datasets and one synthetic dataset with five well-designed metrics, exploring the root factors influencing the aforementioned phenomena from three level perspectives: the stochasticity and social integration degree of recommender (system-level), the psychological mechanisms of users (user-level), and the dataset scale (platform-level). Furthermore, we demonstrate four practical mitigation strategies that help alleviate echo chambers and user homogenization at the cost of some recommendation accuracy. Our findings provide both theoretical and empirical insights into the emergence and drivers of echo chambers and user homogenization, as well as actionable guidelines for human-centered recommender design.

[IR-2] RAG for Geoscience: What We Expect Gaps and Opportunities

链接: https://arxiv.org/abs/2508.11246
作者: Runlong Yu,Shiyuan Luo,Rahul Ghosh,Lingyao Li,Yiqun Xie,Xiaowei Jia
类目: Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances language models by combining retrieval with generation. However, its current workflow remains largely text-centric, limiting its applicability in geoscience. Many geoscientific tasks are inherently evidence-hungry. Typical examples involve imputing missing observations using analog scenes, retrieving equations and parameters to calibrate models, geolocating field photos based on visual cues, or surfacing historical case studies to support policy analyses. A simple ``retrieve-then-generate’’ pipeline is insufficient for these needs. We envision Geo-RAG, a next-generation paradigm that reimagines RAG as a modular retrieve \rightarrow reason \rightarrow generate \rightarrow verify loop. Geo-RAG supports four core capabilities: (i) retrieval of multi-modal Earth data; (ii) reasoning under physical and domain constraints; (iii) generation of science-grade artifacts; and (iv) verification of generated hypotheses against numerical models, ground measurements, and expert assessments. This shift opens new opportunities for more trustworthy and transparent geoscience workflows.

[IR-3] Mitigating Filter Bubble from the Perspective of Community Detection: A Universal Framework

链接: https://arxiv.org/abs/2508.11239
作者: Ming Tang,Xiaowen Huang,Jitao Sang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, recommender systems have primarily focused on improving accuracy at the expense of diversity, which exacerbates the well-known filter bubble effect. This paper proposes a universal framework called CD-CGCN to address the filter bubble issue in recommender systems from a community detection perspective. By analyzing user-item interaction histories with a community detection algorithm, we reveal that state-of-the-art recommendations often focus on intra-community items, worsening the filter bubble effect. CD-CGCN, a model-agnostic framework, integrates a Conditional Discriminator and a Community-reweighted Graph Convolutional Network which can be plugged into most recommender models. Using adversarial learning based on community labels, it counteracts the extracted community attributes and incorporates an inference strategy tailored to the user’s specific filter bubble state. Extensive experiments on real-world datasets with multiple base models validate its effectiveness in mitigating filter bubbles while preserving recommendation quality. Additionally, by applying community debiasing to the original test set to construct an unbiased test set, we observe that CD-CGCN demonstrates superior performance in capturing users’ inter-community preferences.

[IR-4] Representation Quantization for Collaborative Filtering Augmentation

链接: https://arxiv.org/abs/2508.11194
作者: Yunze Luo,Yinjie Jiang,Gaode Chen,Jingchi Wang,Shicheng Wang,Ruina Sun,Jiang Yuezihan,Jun Zhang,Jian Liang,Han Li,Kun Gai,Kaigui Bian
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:As the core algorithm in recommendation systems, collaborative filtering (CF) algorithms inevitably face the problem of data sparsity. Since CF captures similar users and items for recommendations, it is effective to augment the lacking user-user and item-item homogeneous linkages. However, existing methods are typically limited to connecting through overlapping interacted neighbors or through similar attributes and contents. These approaches are constrained by coarse-grained, sparse attributes and fail to effectively extract behavioral characteristics jointly from interaction sequences and attributes. To address these challenges, we propose a novel two-stage collaborative recommendation algorithm, DQRec: Decomposition-based Quantized Variational AutoEncoder (DQ-VAE) for Recommendation. DQRec augments features and homogeneous linkages by extracting the behavior characteristics jointly from interaction sequences and attributes, namely patterns, such as user multi-aspect interests. Inspired by vector quantization (VQ) technology, we propose a new VQ algorithm, DQ-VAE, which decomposes the pre-trained representation embeddings into distinct dimensions, and quantize them to generates semantic IDs. We utilize the generated semantic IDs as the extracted patterns mentioned above. By integrating these semantic ID patterns into the recommendation process through feature and linkage augmentation, the system enriches both latent and explicit user and item features, identifies pattern-similar neighbors, and thereby improves the efficiency of information diffusion. Experimental comparisons with baselines across multiple datasets demonstrate the superior performance of the proposed DQRec method.

[IR-5] JobPulse: A Big Data Approach to Real-Time Engineering Workforce Analysis and National Industrial Policy

链接: https://arxiv.org/abs/2508.11014
作者: Karen S. Markel,Mihir Tale,Andrea Belz
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Employment on a societal scale contributes heavily to national and global affairs; consequently, job openings and unemployment estimates provide important information to financial markets and governments alike. However, such reports often describe only the supply (employee job seeker) side of the job market, and skill mismatches are poorly understood. Job postings aggregated on recruiting platforms illuminate marketplace demand, but to date have primarily focused on candidate skills described in their personal profiles. In this paper, we report on a big data approach to estimating job market mismatches by focusing on demand, as represented in publicly available job postings. We use commercially available web scraping tools and a new data processing scheme to build a job posting data set for the semiconductor industry, a strategically critical sector of the United States economy; we focus on Southern California as a central hub of advanced technologies. We report on the employer base and relative needs of various job functions. Our work contributes on three fronts: First, we provide nearly real-time insight into workforce demand; second, we discuss disambiguation and semantic challenges in analysis of employer data bases at scale; and third, we report on the Southern California semiconductor engineering ecosystem.

附件下载

点击下载今日全部论文列表