本篇博文主要内容为 2025-08-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-27)

今日共更新491篇论文,其中:

  • 自然语言处理73篇(Computation and Language (cs.CL))
  • 人工智能164篇(Artificial Intelligence (cs.AI))
  • 计算机视觉102篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习143篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] StepWiser: Stepwise Generative Judges for Wiser Reasoning

【速读】: 该论文旨在解决多步推理模型在复杂问题求解过程中,如何有效监督中间推理步骤逻辑有效性的问题。现有方法依赖于分类器式的奖励模型,缺乏解释性且受限于静态标注数据集,导致泛化能力不足。其解决方案的关键在于将步骤奖励建模从分类任务重构为一个推理任务本身,提出一种生成式裁判(generative judge)——StepWiser,该模型通过元推理(meta-reasoning)生成思考token以解释推理步骤的合理性,最终输出判断结果;训练采用基于轨迹相对结果的强化学习策略,从而显著提升中间步骤判断准确性,并可反向优化策略模型的训练与推理阶段搜索性能。

链接: https://arxiv.org/abs/2508.19229
作者: Wei Xiong,Wenting Zhao,Weizhe Yuan,Olga Golovneva,Tong Zhang,Jason Weston,Sainbayar Sukhbaatar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model’s reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
zh

[NLP-1] Generative Interfaces for Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多轮、信息密集且探索性任务中因受限于线性请求-响应交互模式而导致的交互效率低下问题。其解决方案的关键在于提出生成式接口(Generative Interfaces for Language Models)范式,即LLM通过主动生成用户界面(User Interface, UI)来实现更自适应和互动式的参与方式;该框架利用结构化的界面特定表示与迭代优化机制,将用户查询转化为任务定制的UI,从而提升交互的效率与用户体验。

链接: https://arxiv.org/abs/2508.19227
作者: Jiaqi Chen,Yanzhe Zhang,Yutong Zhang,Yijia Shao,Diyi Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.
zh

[NLP-2] Evaluating the Evaluators: Are readability metrics good measures of readability?

【速读】: 该论文旨在解决当前Plain Language Summarization (PLS)领域中可读性评估标准的局限性问题,即传统可读性指标(如Flesch-Kincaid Grade Level, FKGL)与人类对可读性的主观判断相关性较差,且未能有效捕捉非专家读者所需背景知识等深层可读性特征。其解决方案的关键在于引入语言模型(Language Models, LMs)作为更可靠的可读性评判工具,实证表明最佳LM模型与人类判断之间的皮尔逊相关系数达0.56,显著优于传统指标,并能更好地识别PLS摘要中涉及的知识门槛等复杂可读性维度,从而为PLS评估提供更贴近实际用户需求的新范式。

链接: https://arxiv.org/abs/2508.19221
作者: Isabel Cachola,Daniel Khashabi,Mark Dredze
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.
zh

[NLP-3] VibeVoice Technical Report

【速读】: 该论文旨在解决长时多说话人语音合成(Long-form Multi-speaker Speech Synthesis)中的效率与保真度难题,即如何在保持音频质量的同时实现高计算效率的连续语音生成。其解决方案的关键在于提出了一种新型连续语音分词器(Continuous Speech Tokenizer),相较于主流的Encodec模型,在数据压缩比提升80倍的前提下仍能维持相当的性能;同时结合基于下一token扩散(Next-Token Diffusion)的统一建模方法,使得模型能够在64K上下文窗口长度内实现长达90分钟的高质量多说话人语音合成,从而更真实地捕捉对话场景中的“vibe”,优于现有开源和专有对话模型。

链接: https://arxiv.org/abs/2508.19205
作者: Zhiliang Peng,Jianwei Yu,Wenhui Wang,Yaoyao Chang,Yutao Sun,Li Dong,Yi Zhu,Weijiang Xu,Hangbo Bao,Zehua Wang,Shaohan Huang,Yan Xia,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe’’ and surpassing open-source and proprietary dialogue models.
zh

[NLP-4] Demystifying Scientific Problem-Solving in LLM s by Probing Knowledge and Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学问题求解中面临的两大核心挑战:一是缺乏统一的、全面的科学推理评估基准,二是现有方法未能系统性地分离知识(knowledge)与推理(reasoning)在科学任务中的独立作用。为此,作者提出SciReas这一涵盖多种科学推理任务的基准套件,并进一步构建了更具复杂性的子集SciReas-Pro;同时设计KRUX探针框架以量化分析知识与推理各自对性能的影响。关键解决方案在于:首先通过多维度基准揭示单一评测指标下的隐藏性能特征,其次发现从模型参数中检索任务相关知识是LLMs科学推理的主要瓶颈,且外部知识注入能显著增强推理效果,最后证明提升推理过程的显式表达有助于模型更有效地调用相关知识。这一系列发现为优化科学推理能力提供了可操作的改进路径。

链接: https://arxiv.org/abs/2508.19202
作者: Alan Li,Yixin Liu,Arpan Sarkar,Doug Downey,Arman Cohan
机构: Yale University (耶鲁大学); Harvard University (哈佛大学); Northwestern University (西北大学); Allen Institute of AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 28 pages, 16 figures

点击查看摘要

Abstract:Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs’ ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
zh

[NLP-5] he Ramon Llulls Thinking Machine for Automated Ideation

【速读】: 该论文旨在解决科研创新中研究思路生成效率低、缺乏系统性框架的问题,提出以中世纪罗蒙·鲁尔(Ramon Llull)的组合逻辑(Ars combinatoria)为理论基础,构建一种现代“鲁尔思维机器”(Llull’s thinking machine),用于辅助科研人员进行研究构想的探索。其解决方案的关键在于定义三个可组合的抽象维度:主题(Theme,如效率、适应性)、领域(Domain,如问答、机器翻译)和方法(Method,如对抗训练、线性注意力),通过从专家知识或会议论文中提取这些元素并进行结构化组合,利用大语言模型(LLM)生成多样、相关且扎根于现有文献的研究创意,从而提供一种轻量级、可解释的工具,增强科学创造力并促进人机协同创新。

链接: https://arxiv.org/abs/2508.19200
作者: Xinran Zhao,Boyuan Zheng,Chenglei Si,Haofei Yu,Ken Liu,Runlong Zhou,Ruochen Li,Tong Chen,Xiang Li,Yiming Zhang,Tongshuang Wu
机构: CMU(卡内基梅隆大学); OSU(俄亥俄州立大学); Stanford(斯坦福大学); UIUC(伊利诺伊大学厄巴纳-香槟分校); UT Dallas(德克萨斯大学达拉斯分校); UW(华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 3 figures

点击查看摘要

Abstract:This paper revisits Ramon Llull’s Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull’s thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.
zh

[NLP-6] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs EMNLP2025

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在视觉问答(Visual Question Answering, VQA)任务中存在幻觉(hallucination)的问题,核心在于提升模型对其知识边界的感知能力——即准确识别自身已知与未知的范围。解决方案的关键在于系统评估三种类型的置信度信号(probabilistic confidence、answer consistency-based confidence 和 verbalized confidence),并发现概率性和一致性置信度信号更可靠,而语义化置信度常导致过度自信;在此基础上,作者借鉴大型语言模型(Large Language Models, LLMs)中的置信度校准方法,提出三种有效改进策略,并通过对比LVLM与LLM的性能差异,揭示联合处理视觉和文本输入虽降低问答准确率但提升了置信度校准水平,从而改善了知识边界感知能力。

链接: https://arxiv.org/abs/2508.19111
作者: Zhikai Ding,Shiyu Ni,Keping Bi
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: EMNLP2025 Findings

点击查看摘要

Abstract:Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs’ perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs’ perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.
zh

[NLP-7] Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic

【速读】: 该论文旨在解决当前定量话语分析(Quantitative Discourse Analysis, QDA)在使用黑箱软件(如MAXQDA和NVivo)时所面临的方法论透明度不足研究目标对齐困难的问题。其解决方案的关键在于提出一种混合、可解释的框架,通过结合词汇层面(lexical)与语义层面(semantic)的方法实现三角验证(triangulation)、可复现性(reproducibility)与可解释性(interpretability)。具体而言,作者构建了基于Python的定制化处理管道,集成NLTK、spaCy和Sentence Transformers进行细粒度预处理与嵌入生成,并采用迭代式BERTopic建模流程(包括UMAP降维、HDBSCAN聚类及c-TF-IDF关键词提取),并通过参数调优提升主题一致性与覆盖度,从而在保持代码级透明的同时增强研究者的控制力与方法学严谨性。

链接: https://arxiv.org/abs/2508.19099
作者: Thomas Compton
机构: University of York (约克大学)
类目: Computation and Language (cs.CL)
备注: 5 pages conference paper, 4 tables

点击查看摘要

Abstract:Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic coherence and coverage. By juxtaposing precise lexical searches with context-aware semantic clustering, we argue for a multi-layered approach that mitigates the limitations of either method in isolation. Our workflow underscores the importance of code-level transparency, researcher agency, and methodological triangulation in computational discourse studies. Code and supplementary materials are available via GitHub.
zh

[NLP-8] Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

【速读】: 该论文旨在解决艺术传承研究中因档案数据碎片化、多语言化而导致的检索效率低下问题,传统搜索门户依赖精确元数据,限制了探索性查询的开展。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架,通过语义检索与上下文摘要技术实现自然语言和多语言查询,从而减少对结构化元数据的依赖,提升对艺术品拍卖记录等复杂档案的可访问性与可用性。

链接: https://arxiv.org/abs/2508.19093
作者: Mathew Henrickson
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG’s capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.
zh

[NLP-9] Its All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLM s EMNLP2025

【速读】: 该论文旨在解决极端低资源语言(尤其是使用罕见书写系统的语言)在大型语言模型(LLM)中支持不足的问题,其核心挑战在于训练数据稀缺以及模型对这些语言和脚本的表征能力有限。解决方案的关键在于系统性地比较三种方法:零样本上下文学习(zero-shot in-context learning, ICL)、少样本ICL、以及参数高效微调(parameter-efficient fine-tuning, PEFT),并引入语言对齐信号作为辅助。研究发现,当语言及其脚本均未被模型充分覆盖时,PEFT效果受限;而仅依赖语言对齐的零样本ICL在极端低资源语言上表现优异,表明利用预训练知识进行提示工程比直接微调更有效,为LLM在低资源语言上的适配提供了实证指导。

链接: https://arxiv.org/abs/2508.19089
作者: Yue Li,Zhixue Zhao,Carolina Scarton
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Extremely low-resource languages, especially those written in rare scripts, as shown in Figure 1, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.
zh

[NLP-10] “Where does it hurt?” - Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues ECAI2025

【速读】: 该论文旨在解决医生在医患对话中意图轨迹(physician intent trajectories)的识别与建模问题,以提升医疗对话理解与辅助诊断系统的性能。其核心挑战在于如何准确捕捉医生在问诊过程中从主观描述(Subjective)、客观检查(Objective)、评估(Assessment)到治疗计划(Plan)等SOAP结构中的意图转换。解决方案的关键在于:首先基于SOAP框架构建细粒度的医师意图分类体系,并借助Prolific平台招募大量医学专家对超过5000个对话回合进行标注,形成高质量、大规模的标注数据集;其次利用该数据集对生成式和编码器模型进行基准测试,发现当前模型虽能较好理解整体对话结构,但在SOAP类别间转换识别上存在不足;最后通过引入意图过滤机制显著提升医疗对话摘要任务的性能,为设计差分诊断系统提供结构化路径洞察。

链接: https://arxiv.org/abs/2508.19077
作者: Tom Röhr,Soumyadeep Roy,Fares Al Mohamad,Jens-Michalis Papaioannou,Wolfgang Nejdl,Felix Gers,Alexander Löser
机构: Berlin University of Applied Sciences (柏林应用科学大学); Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校); Charité – Universitätsmedizin Berlin Rheumatologie (夏里特医学院柏林风湿病学); L3S Research Center, Hannover (汉诺威L3S研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted at ECAI 2025

点击查看摘要

Abstract:In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing differential diagnosis’ systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at this https URL.
zh

[NLP-11] HiPlan: Hierarchical Planning for LLM -Based Agents with Adaptive Global-Local Guidance

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在复杂、长周期规划任务中因缺乏宏观引导和持续监督而导致的决策失效与环境适应能力不足的问题。其解决方案的关键在于提出一种分层规划框架HiPlan,通过构建包含里程碑动作指导(milestone action guides)和步骤级提示(step-wise hints)的双层结构,在离线阶段利用专家演示建立里程碑库以实现语义相似任务的经验复用,并在执行阶段动态调整历史轨迹片段生成适配当前观测的步骤提示,从而实现全局目标对齐与局部行为纠偏,有效提升LLM代理在复杂任务中的鲁棒性和规划能力。

链接: https://arxiv.org/abs/2508.19076
作者: Ziyue Li,Yuan Chang,Gaihong Yu,Xiaoqiu Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents’decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.
zh

[NLP-12] MovieCORE: COgnitive REasoning in Movies EMNLP’2025

【速读】: 该论文旨在解决当前视频问答(VQA)模型在理解电影内容时普遍依赖表面层次信息、缺乏对深层认知推理能力评估的问题。现有数据集多聚焦于直观事实性问题,难以检验模型是否具备类似人类的系统2型思维(System-2 thinking),即需要推理、抽象和批判性思考的能力。为应对这一挑战,作者提出MovieCORE数据集,其核心创新在于采用多大语言模型(LLM)作为“思考代理”(thought agents)进行协同头脑风暴,生成具有认知深度、启发性和语法复杂性的高质量问答对;同时设计了包含认知测试的评估体系,并引入Agentic Choice Enhancement(ACE)模块,在不重新训练的前提下显著提升视频语言模型(VLM)的推理性能(最高达25%)。此方案不仅推动了电影理解任务向更高阶认知层面发展,也为评估AI系统的深层语义理解能力提供了新范式。

链接: https://arxiv.org/abs/2508.19026
作者: Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Ying Cheng,Hung-Ting Su,Yung-Hao Tang,Shang-Hong Lai,Winston H. Hsu
机构: National Taiwan University (台湾大学); NVIDIA (英伟达); National Tsing Hua University (清华大学); National Chengchi University (政治大学)
类目: Computation and Language (cs.CL)
备注: Accepted for EMNLP’2025 Main Conference. Project Page: this https URL

点击查看摘要

Abstract:This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at this https URL.
zh

[NLP-13] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

【速读】: 该论文旨在解决当前人工智能系统在迈向通用智能(Artificial General Intelligence, AGI)过程中,从静态任务优化向持续学习、自我演化的开放智能体转变的挑战。其核心问题在于如何构建具备长期适应能力的智能体,使其能够在动态环境中通过真实交互实现知识积累、技能抽象与行为内化,从而支持持续成长而非一次性任务完成。解决方案的关键在于提出Experience-driven Lifelong Learning (ELL)框架,该框架围绕四大核心原则:经验探索(Experience Exploration)、长期记忆(Long-term Memory)、技能学习(Skill Learning)和知识内化(Knowledge Internalization),形成闭环机制,使智能体能够自主生成丰富经验轨迹、结构化存储历史知识、抽象可迁移技能并最终将显性经验转化为隐式直觉能力。此外,作者还构建了StuLife基准数据集,模拟学生从入学到发展的完整生命周期,为评估此类持续学习能力提供标准化平台,推动AGI发展中上下文工程(context engineering)的作用研究。

链接: https://arxiv.org/abs/2508.19005
作者: Yuxuan Cai,Yipeng Hao,Jie Zhou,Hang Yan,Zhikai Lei,Rui Zhen,Zhenhua Han,Yutao Yang,Junsong Li,Qianjun Pan,Tianyu Huai,Qin Chen,Xin Li,Kai Chen,Bo Zhang,Xipeng Qiu,Liang He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.19005 [cs.AI] (or arXiv:2508.19005v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.19005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-14] Automatic Prompt Optimization with Prompt Distillation

【速读】: 该论文旨在解决自动化提示选择(autoprompting)中的效率与效果问题,即如何在不依赖梯度信息的前提下,从庞大的提示空间中高效探索并生成针对特定任务优化的提示。其解决方案的关键在于提出了一种名为DistillPrompt的新方法,该方法通过多阶段融合任务特异性信息到提示中,并利用知识蒸馏(distillation)、压缩(compression)和聚合(aggregation)操作对提示空间进行系统性探索与优化,从而显著提升文本分类与生成任务的性能表现。

链接: https://arxiv.org/abs/2508.18992
作者: Viktor N. Zhuravlev,Artur R. Khairullin,Ernest A. Dyagin,Alena N. Sitkina,Nikita I. Kulin
机构: ITMO University (ITMO大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt – a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.
zh

[NLP-15] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

【速读】: 该论文旨在解决神经网络模型在推理过程中的可解释性与符号化能力不足的问题,即如何在保持模型性能的同时实现透明、可验证的推理路径。其核心解决方案是引入“AI母语”(AI Mother Tongue)框架,通过嵌入符号语言机制,使模型具备直观推理(intuitive reasoning)、组合符号链(compositional symbol chains)和内在可解释性(inherent interpretability)三大特性:符号捕捉语义模式,符号链追踪决策路径,门控归纳机制引导选择性注意力。关键创新在于将推理直接编码于模型表示中,而非依赖事后解释方法,并通过互补训练目标提升符号纯度与决策稀疏性,结合顺序专业化策略分阶段构建符号能力和直觉判断,从而在多项AI任务上实现高精度与可验证推理轨迹的统一。

链接: https://arxiv.org/abs/2508.18988
作者: Hung Ming Liu
机构: PARRAWA AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 9 figures. The AI Intuition Explorer dashboard is available at: this https URL

点击查看摘要

Abstract:We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model’s representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.
zh

[NLP-16] he Double-edged Sword of LLM -based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization CCS2025

【速读】: 该论文旨在解决差分隐私(Differential Privacy, DP)文本去标识化方法在词级别操作中存在的上下文漏洞(contextual vulnerability)问题,即随机化过程可能遗留原始文本的语境线索,从而被攻击者利用。其解决方案的关键在于:利用大型语言模型(Large Language Models, LLMs)进行数据重构攻击与后处理优化的双重作用——一方面揭示LLMs可有效利用上下文漏洞还原原始语义,暴露隐私风险;另一方面提出将LLM重构作为后处理步骤,通过对抗性思维增强DP文本的隐私保护能力并提升语义质量,实现隐私与效用的平衡。

链接: https://arxiv.org/abs/2508.18976
作者: Stephen Meisenbacher,Alexandra Klymenko,Andreea-Elena Bodea,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 8 tables. Accepted to WPES @ CCS 2025

点击查看摘要

Abstract:Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization \unicodex2013 this we refer to as \textitcontextual vulnerability . Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.
zh

[NLP-17] Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework ECAI2025

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中存在两大核心问题:一是评估数据集的设计质量不足,难以支撑可靠、可复现的性能评测;二是现有方法普遍忽视真实场景下的敏感信息保护需求,导致评估结果可能引入隐私泄露风险。解决方案的关键在于提出一种多智能体(multi-agent)合成问答(QA)数据集生成框架,其核心机制包括:(1) 多样性代理(Diversity agent)通过聚类技术最大化主题覆盖与语义变异性,确保评估数据的广泛代表性;(2) 隐私代理(Privacy agent)跨领域检测并掩码敏感信息,实现对个人或组织敏感内容的有效保护;(3) QA 编辑代理(QA curation agent)整合上述输出,生成兼具语义多样性与隐私安全性的高质量标注数据,作为 RAG 系统评估的基准真值(ground truth)。实验表明,该方法在多样性和隐私保护方面显著优于基线方法,为符合伦理规范和监管要求的 RAG 评估提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2508.18929
作者: Ilias Driouich,Hongliu Cao,Eoin Thomas
机构: AMADEUS France(阿玛杜斯法国公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ECAI 2025 TRUST AI workshop

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.
zh

[NLP-18] Affective Polarization across European Parliaments

【速读】: 该论文旨在解决政治话语中情感极化(affective polarization)在欧洲议会中的普遍存在及其机制问题。情感极化是指对对立群体产生更强负面情绪和敌意的现象,其在跨国家语境下的系统性证据仍较为稀缺。论文通过自动化方法,基于六国议会的演讲语料库,运用自然语言处理技术量化议员对本方与对方群体的情感倾向,并比较其差异以识别极化模式。解决方案的关键在于:首先,构建大规模、多国议会的文本数据集并采用无监督情感分析模型进行自动评估;其次,通过对比议员对同党与异党成员的表述情感强度,揭示出一致的情感极化现象;最后,发现互惠性(reciprocity)是驱动议员间情感极化的核心机制,而非议员活跃度差异。这一方法实现了对政治情感极化的可计算、可比较和可解释的测量,为理解民主制度中的社会分裂提供了实证基础。

链接: https://arxiv.org/abs/2508.18916
作者: Bojan Evkoski,Igor Mozetič,Nikola Ljubešić,Petra Kralj Novak
机构: Central European University (中央欧洲大学); Jožef Stefan Institute (约瑟夫·斯蒂芬研究所); University of Ljubljana (卢布尔雅那大学); Insitute of Contemporary History (当代历史研究所)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one’s own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.
zh

[NLP-19] Empowering Computing Education Researchers Through LLM -Assisted Content Analysis

【速读】: 该论文旨在解决计算教育研究(Computing Education Research, CER)中普遍存在的研究规模受限问题,即许多研究者因缺乏足够的人力、资源或方法论能力,难以开展具有广泛代表性或足够严谨性的质性数据分析。其解决方案的关键在于提出一种改进的LLM辅助内容分析方法(LLM-assisted Content Analysis, LACA),该方法将传统内容分析与大语言模型(Large Language Models, LLMs)相结合,在不显著增加研究人员负担的前提下,实现对大规模文本数据的系统化、可复现且严谨的分析。通过在计算教育数据集上的实证示例,论文展示了LACA如何提升CER研究的广度与可信度,从而推动学科实践和研究质量的整体进步。

链接: https://arxiv.org/abs/2508.18872
作者: Laurie Gale,Sebastian Mateos Nicolajsen
机构: Raspberry Pi Computing Education Research Centre (树莓派计算教育研究中心); University of Cambridge (剑桥大学); Center for Computing Education Research (计算教育研究中心); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline’s teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline. Comments: 7 pages, 2 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.18872 [cs.CL] (or arXiv:2508.18872v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.18872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-20] ReflectivePrompt: Reflective evolution in autoprompting algorithms

【速读】: 该论文旨在解决生成式 AI(Generative AI)中提示工程(prompt engineering)的自动化优化问题,即如何高效地自动搜索并生成针对特定任务表现最优的提示(prompt)。当前方法在搜索空间的探索效率和提示质量提升方面存在局限,难以兼顾精度与全面性。其解决方案的关键在于提出一种基于进化算法的新型自动提示方法——ReflectivePrompt,该方法通过引入短期与长期反射操作(short-term and long-term reflection operations),在交叉(crossover)前和精英变异(elitist mutation)过程中增强对提示的改进质量,并在每个进化周期动态更新知识积累,从而实现更精准、更系统的最优提示搜索。实验表明,该方法在多个分类与文本生成任务上显著优于现有先进方法,如在BBH基准上相较EvoPrompt提升达28%。

链接: https://arxiv.org/abs/2508.18870
作者: Viktor N. Zhuravlev,Artur R. Khairullin,Ernest A. Dyagin,Alena N. Sitkina,Nikita I. Kulin
机构: ITMO University (ITMO大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.
zh

[NLP-21] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险领域中因过度自信(overconfidence)而导致可靠性下降的问题,即模型常以高置信度输出错误答案。现有校准方法依赖于提示工程或启发式生成的不确定性估计,存在效果有限且泛化能力差的缺陷。解决方案的关键在于提出ConfTuner,一种基于新型损失函数——分词Brier分数(tokenized Brier score)的轻量级微调方法,该损失函数理论上被证明为适当的评分规则(proper scoring rule),能有效激励模型准确表达其预测正确概率,且无需真实置信度标签或代理估计值,从而显著提升多类推理任务中的校准性能,并可推广至黑盒模型如GPT-4o,增强自纠错与模型级联等下游应用的可信性。

链接: https://arxiv.org/abs/2508.18847
作者: Yibo Li,Miao Xiong,Jiaying Wu,Bryan Hooi
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as “overconfidence”. Recent efforts have focused on calibrating LLMs’ verbalized confidence: i.e., their expressions of confidence in text form, such as “I am 80% confident that…”. Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it “correctly incentivizes the model to report its true probability of being correct”. ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at this https URL.
zh

[NLP-22] Arrows of Math Reasoning Data Synthesis for Large Language Models : Diversity Complexity and Correctness

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理能力提升过程中面临的高质量训练数据稀缺问题,尤其是传统数据构建方法在可扩展性、成本和数据可靠性方面的瓶颈。其解决方案的关键在于提出了一种程序辅助的数据合成框架(program-assisted synthesis framework),该框架通过整合数学知识体系与领域专用工具生成可执行程序,并将其自动转化为自然语言的问题-解答对;同时引入双边验证机制,分别从程序输出的正确性和问题与程序的一致性两个维度确保数据质量。最终生成了包含1230万条问题-解答三元组的高质量数学语料库,显著提升了微调后模型的推理性能,在多个基准测试中达到最优效果。

链接: https://arxiv.org/abs/2508.18824
作者: Sirui Chen,Changxin Tian,Binbin Hu,Kunlong Chen,Ziqi Liu,Zhiqiang Zhang,Jun Zhou
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.
zh

[NLP-23] LLM -based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection

【速读】: 该论文旨在解决数字时代虚假信息(misinformation)传播带来的社会挑战,特别是现有方法在捕捉长距离依赖关系、复杂语义关联以及新闻传播的社会动态方面存在的局限性,同时克服对大规模标注数据的依赖问题。其解决方案的关键在于提出一种新颖的自监督虚假信息检测框架,通过融合基于抽象意义表示(Abstract Meaning Representation, AMR)的复杂语义关系与新闻传播动力学特征,并引入一种基于大语言模型(Large Language Model, LLM)的图对比损失(LLM-based graph contrastive loss, LGCL),利用LLM生成负样本锚点以实现零样本下的特征可分性增强;此外,结合多视角图掩码自编码器从社交上下文图中学习传播特征,从而在无需大量标注数据的情况下显著提升虚假新闻识别的准确性和泛化能力。

链接: https://arxiv.org/abs/2508.18819
作者: Shubham Gupta,Shraban Kumar Chatterjee,Suman Kundu
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The proliferation of misinformation in the digital age has led to significant societal challenges. Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination. Furthermore, these methods require extensive labelled datasets, making their deployment resource-intensive. In this study, we propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations using Abstract Meaning Representation (AMR) and news propagation dynamics. We introduce an LLM-based graph contrastive loss (LGCL) that utilizes negative anchor points generated by a Large Language Model (LLM) to enhance feature separability in a zero-shot manner. To incorporate social context, we employ a multi view graph masked autoencoder, which learns news propagation features from social context graph. By combining these semantic and propagation-based features, our approach effectively differentiates between fake and real news in a self-supervised manner. Extensive experiments demonstrate that our self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labelled datasets while improving generalizability.
zh

[NLP-24] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

【速读】: 该论文旨在解决结构化LaTeX文档在机器翻译(Machine Translation, MT)过程中面临的挑战,即如何在保持数学公式、表格、图表和交叉引用等复杂格式与语义完整性的同时实现高质量翻译。其解决方案的关键在于提出了一种协作式多智能体系统LaTeXTrans,该系统通过六个专业化智能体协同工作:首先由解析器(Parser)将LaTeX文档分解为可翻译单元并进行语法过滤;随后翻译、验证、摘要和术语提取智能体共同确保上下文感知、自校正及术语一致性;最终生成器(Generator)重构出结构完整且语义准确的译文LaTeX文档。这一架构显著提升了翻译准确性与结构保真度,优于主流MT系统。

链接: https://arxiv.org/abs/2508.18791
作者: Ziming Zhu,Chenglong Wang,Shunjie Xing,Yifu Huo,Fengning Tian,Quan Du,Di Yang,Chunliang Zhang,Tong Xiao,Jingbo Zhu
机构: Northeastern University (东北大学); NiuTrans Research (牛津研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.
zh

[NLP-25] Controllable Conversational Theme Detection Track at DSTC 12 SIGDIAL2025

【速读】: 该论文旨在解决对话分析中主题检测(Theme Detection)的自动化问题,即在大规模对话数据中自动识别并分类核心话题,从而减少人工标注和分析的成本,尤其适用于客户服务或销售等场景。其解决方案的关键在于提出“可控对话主题检测”(Controllable Conversational Theme Detection)任务,将主题检测建模为联合聚类与主题标签分配问题,并通过用户偏好数据实现对主题聚类粒度的可控调节,从而提升结果的灵活性与用户适配性。该方法在DSTC 12竞赛中作为公开赛道进行验证,结合自动与人工评估指标,有效推动了对话主题检测从固定意图识别向更贴近用户需求的动态语义归纳演进。

链接: https://arxiv.org/abs/2508.18783
作者: Igor Shalyminov,Hang Su,Jake Vincent,Siffi Singh,Jason Cai,James Gung,Raphael Shu,Saab Mansour
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注: DSTC12@SigDial2025; data and code available at this https URL

点击查看摘要

Abstract:Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation’s core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 – it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters’ granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams’ submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.
zh

[NLP-26] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

【速读】: 该论文旨在解决生成式 AI(Generative AI)在语法错误纠正(Grammatical Error Correction, GEC)任务中因依赖监督微调而导致模型推理能力受限的问题。现有方法通常通过监督学习直接训练大语言模型(LLM)生成修正后的句子,但这种方式未能充分发挥LLM的逻辑推理潜力。论文提出了一种基于规则引导的强化学习(Rule-Based Reinforcement Learning, Rule-Based RL)框架,其关键在于利用规则作为奖励信号来引导LLM进行更可控、更可靠的纠错决策,从而显著提升召回率(recall),并实现当前最先进的性能表现。

链接: https://arxiv.org/abs/2508.18780
作者: Yilin Li,Xunjian Yin,Yilin Chen,Xiaojun Wan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code will be released upon publication

点击查看摘要

Abstract:Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model’s powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbfstate-of-the-art performance, with a notable increase in \textbfrecall. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.
zh

[NLP-27] hinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中难以控制计算资源消耗的问题,特别是如何实现类似OpenAI gpt-oss系列所具备的离散操作模式下的可编程推理控制。其解决方案的关键在于提出ThinkDial——首个开源端到端框架,通过嵌入预算模式监督微调(budget-mode supervised fine-tuning)和两阶段预算感知强化学习(budget-aware reinforcement learning)与自适应奖励塑造机制,将可控推理能力直接融入训练流程,从而在保持性能阈值的前提下,实现高、中、低三种推理模式间的无缝切换,分别对应完整推理能力、50% token减少且性能下降10%,以及75% token减少且性能下降15%的压缩-性能权衡。

链接: https://arxiv.org/abs/2508.18773
作者: Qianyu He,Siyu Yuan,Xuefeng Li,Mingxuan Wang,Jiangjie Chen
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI’s gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with 10 percent performance degradation), and Low mode (75 percent token reduction with 15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.
zh

[NLP-28] Beyond the Textual: Generating Coherent Visual Options for MCQs EMNLP2025

【速读】: 该论文旨在解决教育领域中多选题(Multiple-choice Questions, MCQs)生成过程中存在的两个核心问题:一是现有方法主要聚焦于文本选项的生成,忽视了视觉选项的应用;二是高质量干扰项(distractors)的生成成本高且难以规模化。解决方案的关键在于提出一种跨模态选项合成框架(Cross-modal Options Synthesis, CmOS),其创新性地融合了多模态思维链(Multimodal Chain-of-Thought, MCoT)推理过程与检索增强生成(Retrieval-Augmented Generation, RAG)技术,从而生成语义合理且视觉相似的答案与干扰项,并通过一个判别模块识别适合作为视觉选项的内容,显著提升了题目生成的质量与多样性。

链接: https://arxiv.org/abs/2508.18772
作者: Wanqiang Wang,Longzhu He,Wei Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.
zh

[NLP-29] Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在面对无法回答的问题(如缺少充分条件的数学题)时,未能正确选择“放弃回答”(abstention)的问题,从而影响可信人工智能(Trustworthy AI)的可靠性。研究表明,LRM具备识别此类问题缺陷的认知能力,但其输出行为与内部认知存在错位,导致错误生成答案而非适当回避。解决方案的关键在于提出一种轻量级、两阶段的方法:第一阶段通过认知监控识别问题不可答性,第二阶段在推理过程中引入干预机制促使模型做出正确的 abstention 行为。实验表明,该方法显著提升了 abstention 率,同时保持了整体推理性能。

链接: https://arxiv.org/abs/2508.18760
作者: Yi Liu,Xiangyu Liu,Zequn Sun,Wei Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.
zh

[NLP-30] xt to Query Plans for Question Answering on Large Tables

【速读】: 该论文旨在解决大规模表格数据集在自然语言查询与复杂数据分析中的效率与灵活性问题,尤其是传统Text-to-SQL方法因依赖SQL语法而存在处理海量数据效率低、难以支持高级分析功能(如主成分分析和异常检测)的局限性。其解决方案的关键在于提出一种基于大语言模型(LLM)的新型框架,将自然语言查询转化为可执行的查询计划(query plan),并在传统数据库外部实现该框架,从而避免SQL固有的性能瓶颈;通过迭代式地解析查询并逐步构建操作序列,结合直接对数据执行操作的方式,有效突破了模型上下文长度限制,同时支持复杂分析函数,显著提升了可扩展性和实用性。

链接: https://arxiv.org/abs/2508.18758
作者: Yipeng Zhang,Chen Wang,Yuzhe Zhang,Jacky Jiang
机构: CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61); CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61); CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61); CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient querying and analysis of large tabular datasets remain significant challenges, especially for users without expertise in programming languages like SQL. Text-to-SQL approaches have shown promising performance on benchmark data; however, they inherit SQL’s drawbacks, including inefficiency with large datasets and limited support for complex data analyses beyond basic querying. We propose a novel framework that transforms natural language queries into query plans. Our solution is implemented outside traditional databases, allowing us to support classical SQL commands while avoiding SQL’s inherent limitations. Additionally, we enable complex analytical functions, such as principal component analysis and anomaly detection, providing greater flexibility and extensibility than traditional SQL capabilities. We leverage LLMs to iteratively interpret queries and construct operation sequences, addressing computational complexity by incrementally building solutions. By executing operations directly on the data, we overcome context length limitations without requiring the entire dataset to be processed by the model. We validate our framework through experiments on both standard databases and large scientific tables, demonstrating its effectiveness in handling extensive datasets and performing sophisticated data analyses.
zh

[NLP-31] Chronological Passage Assembling in RAG framework for Temporal Question Answering

【速读】: 该论文旨在解决长上下文叙事类问答任务中因受限于上下文窗口而难以准确重建事件时间线的问题,尤其针对现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理叙事文本时无法充分保留段落间顺序关系和整体语境连贯性所带来的挑战。其解决方案的关键在于提出ChronoRAG框架,通过两个核心机制实现:一是将分散的文档信息重构为结构化且语义连贯的段落,二是显式捕获并维持检索到的段落之间的时序顺序,从而保障叙事流的完整性,实验证明该方法在需要事实识别与复杂时序推理的任务上显著优于传统RAG方法。

链接: https://arxiv.org/abs/2508.18748
作者: Byeongjeong Kim,Jeonghyun Park,Joonho Yang,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.
zh

[NLP-32] CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks EMNLP2025

【速读】: 该论文旨在解决长链式思维(Chain-of-Thought, CoT)提示在大型语言模型(Large Language Models, LLMs)中引发的效率与性能矛盾问题:虽然长CoT有助于提升复杂推理任务(System-2)的准确性,但其冗长的推理轨迹会显著拖慢快速直觉型任务(System-1)的响应速度,甚至损害其表现。解决方案的关键在于提出Connector-Aware Compact CoT(CAC-CoT),该方法通过限定推理过程仅使用一组固定数量的连接词(connector phrases),强制模型生成结构紧凑、逻辑清晰的推理路径,从而在保持高精度的同时大幅压缩推理长度(平均约300个token,ART),实现高效且稳定的多任务适应性。

链接: https://arxiv.org/abs/2508.18743
作者: Sunguk Choi,Yonghoon Kwon,Heondeuk Lee
机构: selectstar.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 findings

点击查看摘要

Abstract:Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) – a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well – structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.
zh

[NLP-33] M3HG: Multimodal Multi-scale and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations ACL2025

【速读】: 该论文针对多模态对话中情感因果三元组抽取(Multimodal Emotion Cause Triplet Extraction in Conversations, MECTEC)任务中存在的两大问题展开研究:一是缺乏多样化的标注数据集,现有唯一公开数据集仅覆盖高度同质的对话场景,限制了模型的泛化能力;二是现有方法未能显式建模情感和因果语境,且忽视跨层次语义信息的融合,导致性能受限。解决方案的关键在于提出M3HG模型,该模型通过构建多模态异构图(Multimodal Heterogeneous Graph)显式捕捉情感与因果上下文,并有效融合句间与句内层次的语境信息,从而提升三元组抽取的准确性。

链接: https://arxiv.org/abs/2508.18740
作者: Qiao Liang,Ying Shen,Tiantian Chen,Lin Zhang
机构: Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures. Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at this https URL.
zh

[NLP-34] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

【速读】: 该论文旨在解决广告标题生成中质量与多样性难以兼顾的问题,现有方法通常仅优化语言模型的标题质量或点击率(CTR),导致生成结果趋于同质化,无法有效覆盖多元用户群体。其解决方案的关键在于提出DIVER框架,该框架基于大语言模型(LLM)联合优化质量和多样性:首先设计了一个语义和风格感知的数据生成管道,自动构建高质量的训练样本对(广告内容与多个多样化标题);其次采用多阶段多目标优化策略,结合监督微调(SFT)与强化学习(RL),在单次前向传播中实现高质量且多样化的标题生成。实验证明,该方法在真实工业数据集上显著提升了广告主价值(ADVV)和CTR。

链接: https://arxiv.org/abs/2508.18739
作者: Chang Wang,Siyu Yan,Depeng Yuan,Yuqi Chen,Yanhua Huang,Yuanhang Zheng,Shuhao Li,Yinqi Zhang,Kedi Chen,Mingrui Zhu,Ruiwen Xu
机构: Xiaohongshu Inc.(小红书公司); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.
zh

[NLP-35] Bias Mitigation Agent : Optimizing Source Selection for Fair and Balanced Knowledge Retrieval KDD’2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)系统中因内部和外部信息源偏见导致的检索内容不公平问题,这会削弱用户信任并影响知识传播的平衡性。解决方案的关键在于提出一种新型偏见缓解代理(Bias Mitigation Agent),其为一个多智能体系统,通过专业化代理协同工作,优化信息源的选择策略,从而在保证高度相关性的前提下最小化偏见,实现公平且均衡的知识获取与分发。实验结果表明,该方法相较于基线朴素检索策略可减少81.82%的偏见。

链接: https://arxiv.org/abs/2508.18724
作者: Karanbir Singh,Deepak Muppiri,William Ngu
机构: Salesforce( Salesforce)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at KDD’2025 Agent4IR workshop

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed the field of artificial intelligence by unlocking the era of generative applications. Built on top of generative AI capabilities, Agentic AI represents a major shift toward autonomous, goal-driven systems that can reason, retrieve, and act. However, they also inherit the bias present in both internal and external information sources. This significantly affects the fairness and balance of retrieved information, and hence reduces user trust. To address this critical challenge, we introduce a novel Bias Mitigation Agent, a multi-agent system designed to orchestrate the workflow of bias mitigation through specialized agents that optimize the selection of sources to ensure that the retrieved content is both highly relevant and minimally biased to promote fair and balanced knowledge dissemination. The experimental results demonstrate an 81.82% reduction in bias compared to a baseline naive retrieval strategy.
zh

[NLP-36] EMMM Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在客户服务场景中被恶意利用进行用户 impersonation(冒充)的问题,尤其是现有机器生成文本(Machine-Generated Text, MGT)检测方法在在线对话环境中准确性低、解释性差,难以满足非专家操作员对可信AI部署的需求。解决方案的关键在于提出EMMM框架——一个“先解释后检测”的机制,通过优化延迟、准确率与面向非专家用户的可解释性之间的平衡,在保证检测性能的同时,提供人类可理解的解释,实验表明其在70%的人类评估者中获得偏好,且响应时间控制在1秒以内。

链接: https://arxiv.org/abs/2508.18715
作者: Angela Yifei Yuan,Haoyi Li,Soyeon Caren Han,Christopher Leckie
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at this https URL.
zh

[NLP-37] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLM s

【速读】: 该论文旨在解决多语言谜题生成中大型语言模型(Large Language Models, LLMs)难以在文化贴合性与创造性抽象之间取得平衡的问题。现有提示策略(如零样本、少样本、思维链)往往导致记忆重现或浅层改写,缺乏真正的原创性和跨语言一致性。其解决方案的关键在于提出自适应原创性过滤(Adaptive Originality Filtering, AOF),通过基于余弦相似度的冗余剔除机制,在保证词汇新颖性和跨语言忠实度的同时,有效提升生成内容的多样性与原创性,且无需任务特定微调即可实现文化语境下的创造性输出。

链接: https://arxiv.org/abs/2508.18709
作者: Duy Le,Kent Ziti,Evan Girard-Sun,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt0.177 Self-BLEU and \texttt0.915 Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.
zh

[NLP-38] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

【速读】: 该论文旨在解决生成式语音大语言模型(Speech Large Language Models, SLMs)在处理领域特定术语或新词时准确率不足的问题。现有方法难以有效识别和生成专业术语,导致语音到文本(speech-to-text)系统在专业场景下的表现受限。解决方案的关键在于提出Attention2Probability:一种基于注意力机制的概率估计方法,通过将语音与术语之间的交叉注意力权重转化为术语存在概率,并结合课程学习(curriculum learning)策略提升检索准确性。该方法轻量、灵活且高效,在测试集中实现了高达92.57%(中文)和86.83%(英文)的最大召回率,同时每查询延迟仅为8.71ms,显著优于VectorDB方法,并能有效提升术语识别准确率6–17%。

链接: https://arxiv.org/abs/2508.18701
作者: Yanfan Du,Jun Zhang,Bin Wang,Jin Qiu,Lu Huang,Yuan Ge,Xiaoqian Liu,Tong Xiao,Jingbo Zhu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs’ recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.
zh

[NLP-39] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

【速读】: 该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, Med-VLMs)在面对语义等价但表述不同的医疗问题时,回答一致性差的问题,这严重影响了其在高风险医疗场景中的可靠性。研究指出,该问题源于两个关键局限:一是医学概念对齐不足导致推理路径不一致,二是训练数据中隐含的语法捷径偏好削弱了语义理解能力。为应对这一挑战,作者提出了一种名为一致性与对比学习(Consistency and Contrastive Learning, CCL)的新方法,其核心创新在于:(1) 基于医学知识锚定的一致性学习,引导模型依据专业医学知识而非浅层特征模式进行推理;(2) 有偏见感知的对比学习,通过判别性表征优化来缓解数据特定先验带来的偏差。实验表明,CCL 在多个主流VQA基准上达到最先进性能,并在具有挑战性的RoMed测试集上将答案一致性提升50%,显著增强了模型鲁棒性。

链接: https://arxiv.org/abs/2508.18687
作者: Songtao Jiang,Yuxi Chen,Sibo Song,Yan Zhang,Yeying Jin,Yang Feng,Jian Wu,Zuozhu Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.
zh

[NLP-40] FALCON: Autonomous Cyber Threat Intelligence Mining with LLM s for IDS Rule Generation

【速读】: 该论文旨在解决传统基于签名的入侵检测系统(Signature-based Intrusion Detection Systems, IDS)在面对不断演化的网络威胁时,因规则更新频率低、部署延迟高而导致安全响应能力下降的问题。其解决方案的关键在于提出一个名为FALCON的自主代理框架(autonomous agentic framework),该框架利用大语言模型(Large Language Models, LLMs)从网络安全情报(Cyber Threat Intelligence, CTI)数据中自动提取并生成可直接部署的IDS规则,并通过内置多阶段验证器进行自我评估,从而实现规则生成与质量控制的闭环自动化。实验表明,FALCON在规则生成准确性上达到平均95%,且经多位安全分析师的定性评估验证,一致性达84%,证明了LLM驱动的数据挖掘方法在实时威胁缓解中的可行性与有效性。

链接: https://arxiv.org/abs/2508.18684
作者: Shaswata Mitra,Azim Bazarov,Martin Duclos,Sudip Mittal,Aritran Piplai,Md Rayhanur Rahman,Edward Zieglar,Shahram Rahimi
机构: The University of Alabama(阿拉巴马大学); Mississippi State University(密西西比州立大学); The University of Texas at El Paso(埃尔帕索德州大学); National Security Agency(国家安全局)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 11 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Signature-based Intrusion Detection Systems (IDS) detect malicious activities by matching network or host activity against predefined rules. These rules are derived from extensive Cyber Threat Intelligence (CTI), which includes attack signatures and behavioral patterns obtained through automated tools and manual threat analysis, such as sandboxing. The CTI is then transformed into actionable rules for the IDS engine, enabling real-time detection and prevention. However, the constant evolution of cyber threats necessitates frequent rule updates, which delay deployment time and weaken overall security readiness. Recent advancements in agentic systems powered by Large Language Models (LLMs) offer the potential for autonomous IDS rule generation with internal evaluation. We introduce FALCON, an autonomous agentic framework that generates deployable IDS rules from CTI data in real-time and evaluates them using built-in multi-phased validators. To demonstrate versatility, we target both network (Snort) and host-based (YARA) mediums and construct a comprehensive dataset of IDS rules with their corresponding CTIs. Our evaluations indicate FALCON excels in automatic rule generation, with an average of 95% accuracy validated by qualitative evaluation with 84% inter-rater agreement among multiple cybersecurity analysts across all metrics. These results underscore the feasibility and effectiveness of LLM-driven data mining for real-time cyber threat mitigation.
zh

[NLP-41] ailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

【速读】: 该论文旨在解决多模态链式思维(Multimodal Chain-of-Thought, MCoT)提示中因随机或人工选择示例而导致的性能不稳定问题,这些问题未能充分考虑模型自身的知识分布和任务本身的内在复杂性。解决方案的关键在于提出一种受“差异化教学与难度平衡”教育原则启发的新框架,将提示选择重构为提示课程设计问题:构建一个与模型当前能力相匹配的有序训练示例集。其核心创新在于整合两种互补信号——模型感知难度(通过主动学习设置中的预测不一致度量)和样本固有复杂性(独立于任何模型的任务难度),并基于两者联合分析设计出难度均衡的采样策略,从而确保所选提示示例在模型感知难度和任务固有复杂性两个维度上均具有多样性,显著提升多模态推理的稳定性与效果。

链接: https://arxiv.org/abs/2508.18673
作者: Xinglong Yang,Quan Feng,Zhongying Pan,Xiang Chen,Yu Tian,Wentong Li,Shuofei Qiao,Yuxia Geng,Xingyu Zhao,Sheng-Jun Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of “tailored teaching with balanced difficulty”. We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model’s current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.
zh

[NLP-42] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks ICML

【速读】: 该论文旨在解决当前大规模语言模型(Large Language Models, LLMs)中因架构变化(特别是Mixture-of-Experts, MoE结构引入的稀疏性)导致经验缩放定律系数不稳定的问题,并探究MoE稀疏性对模型两种能力范式——记忆(memorization)与推理(reasoning)的影响机制。其解决方案的关键在于:在固定计算预算的前提下,系统性地调节总参数量、活跃参数量(active parameters)以及路由策略中的top-k值,从而分离训练-测试泛化差距与损失-准确率差距;实验发现,记忆能力随总参数单调提升,而推理性能在达到一定稀疏度后趋于饱和甚至退化,且仅通过调整稀疏性或经典超参数(如学习率和初始化)无法显著改善推理缺陷,表明MoE稀疏性对推理能力具有不可逆的负面影响。

链接: https://arxiv.org/abs/2508.18672
作者: Taishi Nakamura,Satoki Ishikawa,Masaki Kawamura,Takumi Okamoto,Daisuke Nohara,Jun Suzuki,Rio Yokota
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Presented at the Second AI for Math Workshop at ICML

点击查看摘要

Abstract:Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top- k routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top- k alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at this https URL.
zh

[NLP-43] Membership Inference Attacks on LLM -based Recommender Systems

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的推荐系统(Recommender Systems, RecSys)中存在的隐私泄露问题,特别是用户历史交互数据在系统提示(system prompts)中被用于上下文学习(in-context learning, ICL)时可能遭受成员推理攻击(Membership Inference Attacks, MIAs)的风险。解决方案的关键在于设计并验证了四种针对LLM推荐系统的成员推理攻击方法:直接询问攻击、幻觉攻击、相似性攻击和投毒攻击,这些攻击利用了LLM或推荐系统特有的行为特征,实验证明其在多个主流LLM和推荐数据集上均表现出显著的攻击优势,尤其以直接询问和投毒攻击最为有效,从而揭示了LLM RecSys在实际部署中的隐私风险。

链接: https://arxiv.org/abs/2508.18665
作者: Jiajie He,Yuechun Gu,Min-Chun Chen,Keke Chen
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims’ historical interactions have been used by system prompts. They are \emphdirect inquiry, hallucination, similarity, and poisoning attacks, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.
zh

[NLP-44] Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models ICASSP2026

【速读】: 该论文旨在解决当前语音大语言模型(speech LLMs)在人机交互中缺乏情感理解能力的问题,即现有模型通常仅将文本响应转换为语音,而未能充分捕捉用户语音输入中的情绪和副语言特征(paralinguistic cues),导致无法根据语调、语气等情感信息生成具有同理心的回应。这一问题严重影响了用户体验,尤其在需要情感共鸣的应用场景中。解决方案的关键在于提出一种名为Emotion Omni的新颖模型架构,该架构能够识别用户语音中的情感内容并生成具同理心的语音回复;同时,研究团队基于开源文本转语音(TTS)框架构建了一个包含20万条对话的情感对话数据集,从而支持在有限数据条件下训练具备情感理解与响应能力的语音助手,避免对大规模标注数据和高算力资源的依赖。

链接: https://arxiv.org/abs/2508.18655
作者: Haoyu Wang,Guangyan Zhang,Jiale Chen,Jingyu Li,Yuehai Wang,Yiwen Guo
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, submitted to ICASSP 2026

点击查看摘要

Abstract:With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user’s query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-machine interaction. Currently, most speech LLMs with empathetic capabilities are trained on massive datasets. This approach requires vast amounts of data and significant computational resources. Therefore, a key challenge lies in how to develop a speech LLM capable of generating empathetic responses with limited data and without the need for large-scale training. To address this challenge, we propose Emotion Omni, a novel model architecture designed to understand the emotional content of user speech input and generate empathetic speech responses. Additionally, we developed a data generation pipeline based on an open-source TTS framework to construct a 200k emotional dialogue dataset, which supports the construction of an empathetic speech assistant. The demos are available at this https URL
zh

[NLP-45] UniC-RAG : Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对知识库污染攻击时的脆弱性问题,特别是针对现有攻击方法仅能针对特定查询或相似主题查询的局限性。为实现更广泛、高效的攻击效果,作者提出UniC-RAG,其核心创新在于将攻击建模为一个联合优化问题:通过少量精心设计的对抗文本(adversarial texts),同时诱导大量跨领域、多样化用户查询产生恶意输出,从而达成如引导至恶意网站、触发有害命令执行或发起拒绝服务攻击等目标。解决方案的关键在于引入一种基于平衡相似度的聚类方法,以提升对抗文本对不同查询的覆盖能力和攻击有效性,实验表明该方法可在注入百条对抗文本的情况下,对数千个查询实现超过90%的攻击成功率,且显著优于现有基线方法。

链接: https://arxiv.org/abs/2508.18652
作者: Runpeng Geng,Yanting Wang,Ying Chen,Jinyuan Jia
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack’s effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.
zh

[NLP-46] Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成响应时难以同时保持事实准确性(faithfulness)与表达自然性(expressiveness)的问题。当前方法要么因缺乏外部知识支持而产生幻觉,要么因过度依赖知识而显得冗长不自然。解决方案的关键在于提出一种名为协同解码(Collaborative Decoding, CoDe)的新框架,通过动态融合有无外部知识条件下生成的概率分布,并基于分布差异和模型置信度进行选择性激活,从而实现可靠且自然的输出;此外引入知识感知重排序机制,避免对参数化知识的过度依赖,确保外部信息被合理利用。

链接: https://arxiv.org/abs/2508.18651
作者: Chenxu Yang,Qingyi Si,Zheng Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrificing expressiveness. In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. This integration is guided by distribution divergence and model confidence, enabling the selective activation of relevant and reliable expressions from the model’s internal parameters. Furthermore, we introduce a knowledge-aware reranking mechanism that prevents over-reliance on prior parametric knowledge while ensuring proper utilization of provided external information. Through comprehensive experiments, our plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, validating both its effectiveness and generalizability.
zh

[NLP-47] hinking Before You Speak: A Proactive Test-time Scaling Approach

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务(如数学问题求解)中表现不足的问题,其核心原因在于训练数据中缺乏人类在推理过程中隐含的中间思考过程(即“in-sights”),这些过程通常未被显式表达,导致模型难以有效衔接推理步骤。解决方案的关键在于提出一种名为“Thinking Before You Speak (TBYS)”的推理框架,通过在连续推理步骤之间主动插入由模型自动生成的“in-sights”,以回顾当前状态并引导下一步推理,从而模拟人类先思考后输出的模式。该方法区别于传统静态提示策略,利用自动化管道收集和过滤上下文示例来生成这些in-sights,显著降低了人工标注与微调成本,并在多个数学基准数据集上验证了其有效性。

链接: https://arxiv.org/abs/2508.18648
作者: Cong Li,Wenchang Chai,Hejun Wu,Yan Pan,Pengxu Wei,Liang Lin
机构: Sun Yat-sen University (中山大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs’ training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emphinsights between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emphinsights are \emphproactively generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emphThinking Before You Speak (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emphinsights, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: this https URL
zh

[NLP-48] Beyond Benchmark: LLM s Evaluation with an Anthropomorphic and Value-oriented Roadmap

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在基准测试性能与实际应用价值之间存在的脱节问题,当前评估框架碎片化,偏重技术指标而忽视部署场景下的整体效能。其核心解决方案是提出一种以人类智能为参照的拟人化评估范式,构建包含智力商数(Intelligence Quotient, IQ)、情感商数(Emotional Quotient, EQ)和专业商数(Professional Quotient, PQ)的三维分类体系,分别衡量基础能力、对齐能力和专业技能;并进一步设计价值导向评估(Value-oriented Evaluation, VQ)框架,从经济可行性、社会影响、伦理一致性及环境可持续性四个维度量化模型的实际价值。该方案通过模块化架构整合六项组件并提供实施路线图,为开发兼具技术先进性、情境相关性和伦理合规性的LLM提供系统性指导。

链接: https://arxiv.org/abs/2508.18646
作者: Jun Wang,Ninglun Gu,Kailai Zhang,Zijiao Zhang,Yelun Bao,Jin Yang,Xu Yin,Liwei Liu,Yihuan Liu,Pengyong Li,Gary G. Yen,Junchi Yan
机构: China Mobile Communications Group Co.,Ltd.(中国移动通信集团有限公司); Xidian University (西安电子科技大学); Oklahoma State University (俄克拉荷马州立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: this https URL.
zh

[NLP-49] RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

【速读】: 该论文旨在解决生成式 AI 在创意写作任务中难以同时优化主观写作质量(如文学性和情感表达)与客观约束遵循能力(如格式要求和字数限制)的问题。现有强化学习方法要么仅依赖单一奖励策略无法兼顾两者,要么采用固定权重的混合奖励机制缺乏对不同写作场景的适应性。解决方案的核心在于提出一种动态混合奖励强化学习方法(Reinforcement Learning with Mixed Rewards, RLMR),其关键创新是基于写作质量采样组内动态调整约束遵循奖励权重:通过一个写作质量评估模型(writing reward model)与约束验证模型(constraint verification model)联合生成奖励信号,并在GRPO训练过程中使违反约束的样本获得负优势值从而被惩罚,确保在在线训练中实现多维目标的协同优化。

链接: https://arxiv.org/abs/2508.18642
作者: Jianxing Liao,Tian Zhang,Xiao Feng,Yusong Zhang,Rui Yang,Haorui Wang,Bosi Wen,Ziying Wang,Runzhi Shi
机构: Tencent(腾讯)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.
zh

[NLP-50] Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

【速读】: 该论文旨在解决后训练量化(Post-Training Quantization, PTQ)对大语言模型(Large Language Models, LLMs)不同知识能力影响机制不明确的问题,尤其是现有缩放定律(scaling laws)未能充分考虑PTQ特有参数(如有效位宽、校准集大小)和任务敏感性差异。其解决方案的关键在于通过系统性的实证研究,将LLM知识能力解耦为记忆能力(memorization)与利用能力(utilization),并构建一个统一的定量框架,整合模型规模、有效位宽、校准集大小和分组大小等关键变量,从而揭示记忆能力对PTQ参数变化更为敏感,而利用能力则相对鲁棒,为设计面向特定认知功能的知识感知型量化策略提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2508.18609
作者: Chenxi Zhou,Pengfei Cao,Jiang Li,Jun Zhao,Kang Liu
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Inner Mongolia University (内蒙古大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ’s impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.
zh

[NLP-51] A New NMT Model for Translating Clinical Texts from English to Spanish ML4H ALT NEURIPS2018

【速读】: 该论文旨在解决电子健康记录(Electronic Health Record, EHR)从英文到西班牙语的翻译问题,其核心挑战在于缺乏平行对齐语料库以及大量未知词(Out-of-Vocabulary, OOV)的存在。为应对这些挑战,作者提出了一种名为NOOV的新一代神经机器翻译(Neural Machine Translation, NMT)系统,其关键创新在于:通过自动从平行语料库中学习双语词典,并结合从大型生物医学知识资源中提取的短语查找表,有效缓解了NMT中的未知词问题和词汇重复难题,从而提升模型在EHR文本上的短语生成能力与整体翻译质量。

链接: https://arxiv.org/abs/2508.18607
作者: Rumeng Li,Xun Wang,Hong Yu
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校); University of Massachusetts Medical School (马萨诸塞大学医学院); Bedford VAMC and CHOIR (贝德福德退伍军人医疗中心和CHOIR)
类目: Computation and Language (cs.CL)
备注: This work was accepted by the Machine Learning for Health (ML4H) Workshop at NeurIPS 2018

点击查看摘要

Abstract:Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbfNOOV (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.
zh

[NLP-52] What do language models model? Transformers automata and the format of thought

【速读】: 该论文试图解决的核心问题是:大型语言模型(Large Language Models, LLMs)究竟在建模什么?它们是否反映了人类的语言能力,还是仅仅是对训练语料库的统计性拟合?论文提出了一种非贬义性的辩护立场,认为LLMs本质上是对训练语料库的建模,而非对人类认知能力的模拟。其解决方案的关键在于指出:人类语言能力依赖于超线性(supralinear)计算结构,而Transformer架构仅支持至多线性(linear)计算格式,这种计算架构上的根本差异表明LLMs无法复现人类语言处理机制。进一步地,作者基于Liu等人(2022)关于“捷径自动机”(shortcut automata)的假说,提出一个积极解释:Transformer通过学习语境中的模式关联来生成新语言,这使其成为一种“话语机器”(discourse machine),即不仅用于表达内在状态,更是一种能在适当上下文中创造新语言的工具。因此,尽管LLMs的学习路径与人类不同,但它们仍展现出对语言使用方式的有效掌握。

链接: https://arxiv.org/abs/2508.18598
作者: Colin Klein
机构: The Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we’ve trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don’t think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of ‘discourse machine’ that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.
zh

[NLP-53] he Minds Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

【速读】: 该论文旨在解决视觉隐喻生成(Visual Metaphor Generation)问题,即根据输入的文本隐喻生成语义一致且视觉连贯的图像。其核心挑战在于理解语言中源概念(source concept)与目标概念(target concept)之间的映射关系,并在图像生成过程中保持这种语义对齐。解决方案的关键在于提出了一种自评估的视觉隐喻生成框架,通过结合现有指标与新提出的隐喻分解评分(metaphor decomposition score)和意义对齐(meaning alignment, MA)指标,实现更精准的隐喻对齐。该框架包含两种创新方法:一是无需训练的结构化提示管道(S-T-M prompting),显式将提示分解为源-目标-意义映射以指导图像合成;二是基于轻量级强化学习(reinforcement learning)的训练增强管道,利用自评估奖励机制优化对齐效果而无需大规模重新训练。实验表明,这两种方法在抽象或长句隐喻上表现优于闭源模型(如GPT-4o、Imagen),尤其在结构化提示和小规模计算资源下实现了良好的对齐性能。

链接: https://arxiv.org/abs/2508.18569
作者: Girish A. Koushik,Fatemeh Nazarieh,Katherine Birch,Shenbin Qian,Diptesh Kanojia
机构: NICE Research Group; Centre for Translation Studies; Institute for People-Centred AI; School of Computer Science & Electronic Engineering, University of Surrey, UK
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.
zh

[NLP-54] COMET-poly: Machine Translation Metric Grounded in Other Candidates

【速读】: 该论文旨在解决自动化机器翻译评估指标与人类判断之间存在的差距问题,即当前多数指标仅基于源句和单一译文进行评估,而人类通常在多个候选译文中进行比较来做出判断,这种评估方式的不一致可能影响指标性能。解决方案的关键在于引入额外信息以模拟人类的多候选对比评估机制:一是提出COMET-polycand,利用同一源句的多个替代译文与待评估译文进行对比,提升评估准确性;二是提出COMET-polyic,借鉴检索式上下文学习(retrieval-based in-context learning)思想,引入相似源句及其人工标注质量分数作为参考,引导对目标译文的评价。实验表明,这两种方法均能显著提升段落级指标的相关性(Kendall’s tau-b),验证了引入多候选或外部参考信息的有效性。

链接: https://arxiv.org/abs/2508.18549
作者: Maike Züfle,Vilém Zouhar,Tu Anh Dinh,Felipe Maia Polo,Jan Niehues,Mrinmaya Sachan
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zurich (苏黎世联邦理工学院); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注: Maike Züfle, Vilém Zouhar, and Tu Anh Dinh contributed equally

点击查看摘要

Abstract:Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall’s tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall’s tau-b correlation). We release our models publicly.
zh

[NLP-55] Principled Detection of Hallucinations in Large Language Models via Multiple Testing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时容易产生幻觉(hallucination)的问题,即模型输出看似合理但实际错误或无意义的内容。解决方案的关键在于将幻觉检测建模为假设检验(hypothesis testing)问题,并借鉴机器学习中分布外检测(out-of-distribution detection)的思想,提出一种受多重检验(multiple testing)启发的检测方法,从而在多个维度上提升检测的鲁棒性与准确性。

链接: https://arxiv.org/abs/2508.18473
作者: Jiawei Li,Akshayaa Magesh,Venugopal V. Veeravalli
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.
zh

[NLP-56] Integrating gender inclusivity into large language models via instruction tuning

【速读】: 该论文旨在解决波兰语中因历史和政治惯例导致的性别偏见问题,即在语言使用中普遍采用阳性形式指代男性、女性及混合性别群体,从而使得训练数据中的性别不平等被大型语言模型(LLM)继承并放大。解决方案的关键在于利用IPIS数据集(包含人工编写的人性化性别包容性校对与波兰语到英语翻译指令),通过系统性地微调多语言和波兰语专用的LLM(如Llama-8B、Mistral-7B、Mistral-Nemo、Bielik和PLLuM),并设计一个基于理论语言学框架的系统提示(system prompt),明确引入性别包容性指导原则,从而将性别包容性作为模型生成过程中的内在特征,实现对波兰语生成中性别偏见的有效缓解。

链接: https://arxiv.org/abs/2508.18466
作者: Alina Wróblewska,Bartosz Żuk
机构: Institute of Computer Science (计算机科学研究所); Polish Academy of Sciences (波兰科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.
zh

[NLP-57] How Reliable are LLM s for Reasoning on the Re-ranking task?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在重排序任务中因语义理解能力提升而带来的透明性缺失问题,以及在用户交互有限、排名数据不足的新系统中难以实现准确重排序的挑战。其解决方案的关键在于系统性分析不同训练方法对LLM语义理解能力的影响,并评估这些模型是否能生成更具解释性的文本推理过程,从而增强决策透明度并缓解数据稀缺带来的性能瓶颈。研究通过使用地球科学领域的小规模重排序数据集进行实验,验证了特定训练方法在提升可解释性方面的优势,揭示了抽象知识优化评价指标而非真正语义理解的现象,为构建更可靠、可解释的LLM重排序系统提供了实证依据。

链接: https://arxiv.org/abs/2508.18444
作者: Nafis Tanveer Islam,Zhiming Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at FQAS Conference 2024. DOI will be provided in 3 weeks after the conference has published the paper

点击查看摘要

Abstract:With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM’s internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.
zh

[NLP-58] A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLM s

【速读】: 该论文旨在解决漏洞数据库(如NVD)中缺乏真实世界影响信息的问题,特别是如何高效地将Common Vulnerabilities and Exposures (CVE) 与攻击者使用的战术、技术和程序(Tactics, Techniques, and Procedures, TTPs)进行自动化映射。当前手动关联CVE与ATT&CK知识库中的技术耗时且难以应对每年海量新增漏洞。解决方案的关键在于提出TRIAGE框架,其核心是结合两种基于大语言模型(Large Language Models, LLMs)的策略:首先利用基于MITRE CVE映射方法论的提示指令生成初始技术列表,再通过上下文学习(in-context learning)模块进一步优化预测结果,从而实现规则驱动推理与数据驱动推断的融合。实验表明,该混合方法显著提升对利用技术的召回率,且GPT-4o-mini在该任务上优于Llama3.3-70B,验证了LLMs在自动评估漏洞实际影响方面的有效性。

链接: https://arxiv.org/abs/2508.18439
作者: Anders Mølmen Høst,Pierre Lison,Leon Moonen
机构: Simula & University of Oslo (斯莫拉与奥斯陆大学); Norwegian Computing Center (挪威计算中心); Simula (斯莫拉)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATTCK knowledge base. We first prompt an LLM with instructions based on MITRE’s CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATTCK more efficient. Keywords: vulnerability impact, CVE, ATTCK techniques, large language models, automated mapping. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2508.18439 [cs.CR] (or arXiv:2508.18439v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.18439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-59] Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering EMNLP2025

【速读】: 该论文旨在解决当前主流基于分布外(out-of-distribution, OOD)数据集评估模型泛化能力的方法可能无法准确反映真实场景下模型失败模式的问题。其核心挑战在于,现有OOD评估假设能有效捕捉模型在实际部署中的脆弱性,但作者通过分析问答(question-answering, QA)模型中已知的“依赖伪特征”或“预测捷径”(reliance on spurious features or prediction shortcuts)等特定失败模式,发现不同OOD数据集对模型抗捷径能力的评估质量差异显著,部分甚至不如简单的分布内(in-distribution, ID)评估。解决方案的关键在于揭示了:1)伪特征捷径在ID和OOD数据集中普遍存在,导致OOD评估难以区分模型是否真正具备鲁棒性;2)某些数据集在训练与评估任务之间存在严重脱节,进一步削弱了评估有效性。论文因此提出更稳健的泛化评估方法,并为QA及其他领域提供可操作的改进路径。

链接: https://arxiv.org/abs/2508.18407
作者: Michal Štefánik,Timothee Mickus,Marek Kadlčík,Michal Spiegel,Josef Kuchař
机构: University of Helsinki (赫尔辛基大学); Masaryk University (马萨里克大学); TransformersClub; Kempelen Institute of Intelligent Technologies (肯佩伦智能技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in Findings of EMNLP 2025

点击查看摘要

Abstract:A majority of recent work in AI assesses models’ generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models’ robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset’s quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly. Comments: To appear in Findings of EMNLP 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T01, 68T07, 68T50 ACMclasses: I.2 Cite as: arXiv:2508.18407 [cs.CL] (or arXiv:2508.18407v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.18407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-60] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在概率解码过程中产生的输出不一致性问题,尤其是在复杂或长文本问答任务中表现尤为明显。现有方法如Self-Consistency(SC)仅适用于短格式答案,而Universal Self-Consistency(USC)和Weighted Unigram Consistency Score(WUCS)虽扩展至长文本但牺牲了短文本任务的准确性。其解决方案的关键在于提出Latent Self-Consistency(LSC),通过可学习的token嵌入(learnable token embeddings)选择语义最一致的响应,引入轻量级前向生成摘要token,推理时间增加不足1%,且无需修改模型架构;LSC在6个短文本和5个长文本推理基准上均优于SC、USC和WUCS,同时保持低期望校准误差(Expected Calibration Error),展现出跨答案格式的鲁棒性与实用性。

链接: https://arxiv.org/abs/2508.18395
作者: Jeong-seok Oh,Jay-yoon Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.18395 [cs.CL] (or arXiv:2508.18395v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.18395 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-61] Integral Transformer: Denoising Attention Not Too Much Not Too Little EMNLP2025

【速读】: 该论文旨在解决软最大自注意力(Softmax self-attention)中存在的注意力噪声问题,即模型在处理过程中对语义信息量低的标记(如特殊标记和标点符号)分配了过高的权重,从而影响模型性能。其解决方案的关键在于提出一种新型的积分注意力机制(Integral Transformer),通过从logit分布中采样信号并进行积分操作来实现注意力去噪,能够在抑制噪声的同时保留对模型性能至关重要的特殊标记贡献,从而优化注意力分布并缓解高层层中的秩塌陷(rank collapse)问题。

链接: https://arxiv.org/abs/2508.18387
作者: Ivan Kobyzev,Abbas Ghaddar,Dingtao Hu,Boxing Chen
机构: Huawei Noah’s Ark Lab, Montreal Research Center (华为诺亚方舟实验室,蒙特利尔研究中心)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.
zh

[NLP-62] Backprompting: Leverag ing Synthetic Production Data for Health Advice Guardrails

【速读】: 该论文旨在解决企业在部署大语言模型(Large Language Models, LLMs)时面临的显著风险问题,尤其是针对生成式AI(Generative AI)输出中潜在的有害或不当内容(如健康建议)缺乏高质量标注数据的问题。传统方法难以在部署前获取真实LLM输出的生产级标注数据,从而限制了安全护栏(Guardrails)检测器的训练效果。解决方案的关键在于提出一种名为“backprompting”的简单而直观的数据生成方法,用于合成贴近真实LLM输出的标注样本,并结合稀疏人工介入聚类技术对这些合成数据进行标注,构建与原始数据分布相似但更易训练的平行语料库。随后,将合成样本注入现有数据集以增强训练集鲁棒性,最终在健康建议识别这一高难度任务上实现优于GPT-4o的检测性能,且模型参数量仅为后者的1/400。

链接: https://arxiv.org/abs/2508.18384
作者: Kellen Tan Cheng,Anna Lisa Gentile,Chad DeLuca,Guang-Jie Ren
机构: IBM Research - Almaden(IBM研究-阿尔马登); Princeton University (普林斯顿大学); Adobe Research (Adobe研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs’ input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.
zh

[NLP-63] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models EMNLP2025

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多语言理解能力上存在的不平衡问题,即模型在不同语言间的性能差异较大。其解决方案的关键在于发现浅层网络中语言特异性神经元激活与多语言理解能力之间的显著相关性,并基于此提出PLAST训练策略——通过精准识别涉及多语言理解的层,并利用问题-翻译对进行微调,实现语言特定的信息对齐。该方法仅需调整约14%的参数即可显著提升LVLM的多语言能力,且具有良好的泛化性,适用于低资源和复杂视觉推理任务。

链接: https://arxiv.org/abs/2508.18381
作者: Yuchun Fan,Yilin Wang,Yongyu Mu,Lei Huang,Bei Li,Xiaocheng Feng,Tong Xiao,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 findings

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.
zh

[NLP-64] raining Language Model Agents to Find Vulnerabilities with CTF-Dojo

【速读】: 该论文旨在解决当前大规模语言模型(Large Language Models, LLMs)在训练过程中缺乏可扩展、通用且具备验证反馈能力的执行环境的问题,从而限制了高阶机器学习智能体(ML agents)的发展。其解决方案的关键在于提出 CTF-Dojo——首个大规模可执行运行时环境,包含658个容器化部署的CTF(Capture-The-Flag)挑战任务,支持可验证反馈机制;并开发自动化工具 CTF-Forge,可在分钟级内将公开资源自动转化为可用执行环境,显著减少传统依赖专家配置的数周耗时。通过在仅486条高质量执行验证轨迹上训练,模型在多个基准测试中实现显著性能提升,证明执行接地(execution-grounded)训练信号对构建高性能ML智能体具有决定性作用。

链接: https://arxiv.org/abs/2508.18370
作者: Terry Yue Zhuo,Dingmin Wang,Hantian Ding,Varun Kumar,Zijian Wang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.
zh

[NLP-65] Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective

【速读】: 该论文旨在解决多语言网页内容对视障用户造成的可访问性障碍问题,特别是当网页同时使用拉丁文与非拉丁文字(如中文、阿拉伯文等)时,屏幕阅读器因缺乏对非拉丁脚本的充分支持而出现误读或误发音,从而加剧了无障碍访问的困难。其解决方案的关键在于构建首个大规模多语言网页数据集LangCrUX(涵盖12种主要使用非拉丁文字的语言,共12万网站),并通过该数据集系统分析发现当前网页中辅助技术提示(accessibility hints)普遍存在语言不一致的问题;进而提出Kizuki——一种能感知语言的自动化可访问性测试扩展工具,以提升屏幕阅读器在多语言环境下的准确性和可用性。

链接: https://arxiv.org/abs/2508.18328
作者: Masudul Hasan Masud Bhuiyan,Matteo Varvello,Yasir Zaki,Cristian-Alexandru Staicu
机构: CISPA Helmholtz Center for Information Security (信息安全亥姆霍兹研究中心); Nokia Bell Labs (诺基亚贝尔实验室); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:English is the predominant language on the web, powering nearly half of the world’s top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.
zh

[NLP-66] LLM s Cant Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体系统(Multi-Agent Systems, MAS)中如何通过社会互动形成信任、抵御误导信息并整合同伴输入,从而实现集体智能的问题。其关键解决方案是提出KAIROS基准测试框架,模拟包含不同可靠性同伴的问答竞赛场景,通过精细控制专家-新手角色、噪声群体和对抗性同伴等条件,系统评估LLMs在历史交互与当前同伴响应下决策行为的变化;进一步对比提示工程、监督微调与基于组相对策略优化(Group Relative Policy Optimisation, GRPO)三种策略,发现结合多智能体上下文、基于结果的奖励机制以及无约束推理的GRPO方法在整体性能上最优,但会降低对社会影响的鲁棒性。

链接: https://arxiv.org/abs/2508.18321
作者: Maojia Song,Tej Deep Pala,Weisheng Jin,Amir Zadeh,Chuan Li,Dorien Herremans,Soujanya Poria
机构: 1. University of Antwerp (安特卫普大学); 2. University of Southern California (南加州大学); 3. University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: this https URL.
zh

[NLP-67] SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds

【速读】: 该论文旨在解决大规模预训练Transformer语言模型在输入扰动下的鲁棒性问题,尤其是现有方法在小参数模型与大语言模型(LLM)之间存在分歧,且多依赖于人工设计的复杂对抗样本。其解决方案的关键在于提出一个统一的、基于样本级别的鲁棒性评估框架SALMAN,核心创新是引入一种新的距离映射失真(Distance Mapping Distortion, DMD)度量方法,通过近线性复杂度的方式比较每个样本的输入到输出的距离映射关系,从而无需修改模型内部参数或设计复杂扰动即可有效衡量模型稳定性,为提升基于Transformer的自然语言处理系统的可靠性提供了一种通用、高效的工具。

链接: https://arxiv.org/abs/2508.18306
作者: Wuxinlin Cheng,Yupeng Cao,Jinwen Wu,Koduvayur Subbalakshmi,Tian Han,Zhuo Feng
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent strides in pretrained transformer-based language models have propelled state-of-the-art performance in numerous NLP tasks. Yet, as these models grow in size and deployment, their robustness under input perturbations becomes an increasingly urgent question. Existing robustness methods often diverge between small-parameter and large-scale models (LLMs), and they typically rely on labor-intensive, sample-specific adversarial designs. In this paper, we propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbation heuristics. Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample’s susceptibility by comparing input-to-output distance mappings in a near-linear complexity manner. By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.
zh

[NLP-68] Can VLMs Recall Factual Associations From Visual References? EMNLP2025

【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在多模态接地(multimodal grounding)中存在的系统性缺陷问题,即当实体的参考信息从文本变为图像时,VLMs 回忆相关事实知识的能力显著下降。研究表明,强制模型依赖图像表示会使其实体知识召回能力降低约50%,表明其难以将内部知识与图像表征有效关联。解决方案的关键在于识别并利用模型内部状态中与链接失败相关的特定模式,通过无需重新训练的探测器(probes)实现对不可靠响应的高精度检测(准确率超92%),从而在视觉问答任务中提升预测覆盖率并降低错误风险。

链接: https://arxiv.org/abs/2508.18297
作者: Dhananjay Ashok,Ashutosh Chaubey,Hirona J. Arai,Jonathan May,Jesse Thomason
机构: University of Southern California (南加州大学); Information Sciences Institute (信息科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear at EMNLP 2025 (Findings)

点击查看摘要

Abstract:Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.
zh

[NLP-69] H-PRM: A Pluggable Hotword Pre-Retrieval Module for Various Speech Recognition Systems

【速读】: 该论文旨在解决自动语音识别(ASR)中领域特定术语的识别准确率问题,尤其是当热词(hotword)规模增大时,现有模型的识别率显著下降的问题。解决方案的关键在于提出了一种热词预检索模块(hotword pre-retrieval module, H-PRM),通过测量热词与语音片段之间的声学相似性,精准筛选出最相关的热词候选集,从而提升热词召回率(post-recall rate, PRR)。该模块可无缝集成到传统模型(如SeACo-Paraformer)和音频大语言模型(Audio LLMs)中,显著改善大规模热词场景下的识别性能。

链接: https://arxiv.org/abs/2508.18295
作者: Huangyu Dai,Lingtao Mao,Ben Chen,Zihan Wang,Zihan Liang,Ying Han,Chenyi Lei,Han Li
机构: Kuaishou Technology(快手科技); Zhejiang Gongshang University(浙江工商大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Hotword customization is crucial in ASR to enhance the accuracy of domain-specific terms. It has been primarily driven by the advancements in traditional models and Audio large language models (LLMs). However, existing models often struggle with large-scale hotwords, as the recognition rate drops dramatically with the number of hotwords increasing. In this paper, we introduce a novel hotword customization system that utilizes a hotword pre-retrieval module (H-PRM) to identify the most relevant hotword candidate by measuring the acoustic similarity between the hotwords and the speech segment. This plug-and-play solution can be easily integrated into traditional models such as SeACo-Paraformer, significantly enhancing hotwords post-recall rate (PRR). Additionally, we incorporate H-PRM into Audio LLMs through a prompt-based approach, enabling seamless customization of hotwords. Extensive testing validates that H-PRM can outperform existing methods, showing a new direction for hotword customization in ASR.
zh

[NLP-70] Semantic Attractors and the Emergence of Meaning: Towards a Teleological Model of AGI

【速读】: 该论文旨在解决当前基于统计预测的生成式 AI(Generative AI)模型在语义理解上的局限性,尤其是其无法真正捕捉语言中的深层意义结构,如讽刺、多义性和歧义等复杂语用现象。解决方案的关键在于提出一种基于复数意义空间中语义吸引子(semantic attractor)的理论框架,该吸引子作为具有目的性的算子(Microvitum),通过梯度流、张量变形和迭代矩阵动力学引导语义向稳定、清晰且富有表现力的方向收敛,从而实现从语言预测到意义构建的认知架构跃迁。

链接: https://arxiv.org/abs/2508.18290
作者: Hans-Joachim Rudolph
机构: Microvita Research e.V
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:This essay develops a theoretical framework for a semantic Artificial General Intelligence (AGI) based on the notion of semantic attractors in complex-valued meaning spaces. Departing from current transformer-based language models, which operate on statistical next-token prediction, we explore a model in which meaning is not inferred probabilistically but formed through recursive tensorial transformation. Using cyclic operations involving the imaginary unit \emphi, we describe a rotational semantic structure capable of modeling irony, homonymy, and ambiguity. At the center of this model, however, is a semantic attractor – a teleological operator that, unlike statistical computation, acts as an intentional agent (Microvitum), guiding meaning toward stability, clarity, and expressive depth. Conceived in terms of gradient flows, tensor deformations, and iterative matrix dynamics, the attractor offers a model of semantic transformation that is not only mathematically suggestive, but also philosophically significant. We argue that true meaning emerges not from simulation, but from recursive convergence toward semantic coherence, and that this requires a fundamentally new kind of cognitive architecture – one designed to shape language, not just predict it.
zh

[NLP-71] Designing across domains with declarative thinking: Insights from the 96-Eyes ptychographic imager project

【速读】: 该论文旨在解决多学科协同设计复杂成像系统(如用于高通量药物发现的96-Eyes项目)时,因不同领域团队(光学、算法、硬件加速计算及生命科学)间需求表述不一致而导致的设计偏差与沟通成本高的问题。其解决方案的关键在于采用声明式、第五代编程语言(5GL)将项目需求——包括硬件约束和生命科学任务目标——形式化为机器可读的问题陈述,从而实现跨团队的需求透明化、设计可追溯性,并减少因理解差异造成的昂贵错误对齐。此方法通过结构化问题定义重构了研发流程,强调编程范式对研究工作流的隐性塑造作用,尤其适用于并发研发(concurrent R&D)场景而非传统的阶段驱动型流程。

链接: https://arxiv.org/abs/2508.18512
作者: Antony C Chan
机构: 未知
类目: Optics (physics.optics); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article presents a practitioner’s reflection on applying declarative, 5th generation, problem formulation language (5GL) to de novo imaging system design, informed by experiences across the interdisciplinary research in academia and cross-functional product development within the private sector. Using the 96-Eyes project: 96-camera parallel multi-modal imager for high-throughput drug discovery as a representative case, I illustrate how project requirements, ranging from hardware constraints to life sciences needs, can be formalized into machine-readable problem statements to preserve mission-critical input from diverse domain stakeholders. This declarative approach enhances transparency, ensures design traceability, and minimizes costly misalignment across optical, algorithmic, hardware-accelerated compute, and life sciences teams. Alongside the technical discussion of 5GL with real-world code examples, I reflect on the practical barriers to adopting 5GL in environments where imperative, 3rd-generation languages (3GL) remain the default medium for inter-team collaboration. Rather than offering an one-size-fits-all solution, these learned lessons highlight how programming paradigms implicitly shapes research workflows through existing domain hierarchies. The discussion aims to invite further explorations into how declarative problem formulations can facilitate innovation in settings where concurrent R\D workflows are gaining traction, as opposed to environments where sequential, phase-driven workflows remain the norm. Subjects: Optics (physics.optics); Computation and Language (cs.CL) Cite as: arXiv:2508.18512 [physics.optics] (or arXiv:2508.18512v1 [physics.optics] for this version) https://doi.org/10.48550/arXiv.2508.18512 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-72] oward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology AAAI

【速读】: 该论文旨在解决自动语音识别(ASR)系统在服务非裔美国英语(AAE)及其他语言多样性群体时存在的公平性、偏见与不平等问题,尤其关注这些系统如何加剧语言边缘化现象。其核心解决方案在于提出一种以治理为中心的ASR生命周期框架,强调社区赋权、语言正义和参与式问责机制,从而弥补当前技术导向的公平干预措施在制度层面的不足,推动更具包容性和责任性的语音人工智能系统开发。

链接: https://arxiv.org/abs/2508.18288
作者: Jay L. Cunningham,Adinawa Adjagbodjou,Jeffrey Basoah,Jainaba Jawara,Kowe Kadoma,Aaleyah Lewis
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 10 pages, 9 Pages (References and Appendices). The archival version has been accepted to AAAI (AIES 2025) without the extended Appendices. This extended version includes Appendices

点击查看摘要

Abstract:This scoping literature review examines how fairness, bias, and equity are conceptualized and operationalized in Automatic Speech Recognition (ASR) and adjacent speech and language technologies (SLT) for African American English (AAE) speakers and other linguistically diverse communities. Drawing from 44 peer-reviewed publications across Human-Computer Interaction (HCI), Machine Learning/Natural Language Processing (ML/NLP), and Sociolinguistics, we identify four major areas of inquiry: (1) how researchers understand ASR-related harms; (2) inclusive data practices spanning collection, curation, annotation, and model training; (3) methodological and theoretical approaches to linguistic inclusion; and (4) emerging practices and design recommendations for more equitable systems. While technical fairness interventions are growing, our review highlights a critical gap in governance-centered approaches that foreground community agency, linguistic justice, and participatory accountability. We propose a governance-centered ASR lifecycle as an emergent interdisciplinary framework for responsible ASR development and offer implications for researchers, practitioners, and policymakers seeking to address language marginalization in speech AI systems.
zh

计算机视觉

[CV-0] VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

【速读】:该论文旨在解决3D局部编辑中难以精确保留未编辑区域一致性与整体结构 coherence 的问题。现有方法通常在多视角图像上进行编辑后再重建3D模型,易导致未编辑区域失真或整体结构不连贯。其解决方案的关键在于提出一种无需训练的新型方法 VoxHammer,该方法在3D潜在空间(3D latent space)中直接操作:首先预测3D模型的反演轨迹,获取每个时间步的潜在表示(inverted latents)和键值缓存(key-value tokens);随后在去噪与编辑阶段,将保留区域的去噪特征替换为对应的反转潜在表示和缓存信息,从而保持未编辑区域的一致性并实现编辑部分的自然融合。

链接: https://arxiv.org/abs/2508.19247
作者: Lin Li,Zehuan Huang,Haoran Feng,Gengxiong Zhuang,Rui Chen,Chunchao Guo,Lu Sheng
机构: Beihang University (北京航空航天大学); Renmin University of China (中国人民大学); Tsinghua University (清华大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at this https URL.
zh

[CV-1] Articulate3D: Zero-Shot Text-Driven 3D Object Posing

【速读】:该论文旨在解决通过自然语言指令对3D资产进行姿态控制(pose manipulation)的问题,即在不依赖训练过程的前提下,实现对3D模型的语义可控姿态调整。其解决方案的关键在于提出了一种两阶段方法:首先利用自注意力重布线机制(RSActrl)修改强大的图像生成模型,使其在保持源结构一致性的前提下解耦姿态与结构信息,从而生成符合文本指令的目标图像;其次通过多视角姿态优化步骤,借助关键点匹配建立输入图像与目标图像之间的对应关系,而非依赖不可靠的可微渲染信号,最终将3D网格对齐至目标姿态。该方法在多样化的3D对象和自由形式文本提示下均表现出优异的性能,且定量评估与用户研究表明其显著优于现有方法。

链接: https://arxiv.org/abs/2508.19244
作者: Oishi Deb,Anjun Hu,Ashkan Khakzar,Philip Torr,Christian Rupprecht
机构: University of Oxford (牛津大学); Google DeepMind (谷歌深度智谋)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:this https URL
zh

[CV-2] Style4D-Bench: A Benchmark Suite for 4D Stylization

【速读】:该论文旨在解决动态三维场景的四维风格化(4D stylization)问题,即在保持时空一致性的前提下对具有复杂运动和背景的4D场景进行高质量的艺术风格迁移。其核心挑战在于如何实现精细的风格细节保留、稳定的时间动态表现以及多视角渲染一致性。解决方案的关键在于提出Style4D框架,该框架基于4D高斯泼溅(4D Gaussian Splatting, 4DGS)构建,包含三个关键组件:1)基础4DGS场景表示用于可靠几何建模;2)风格高斯表示(Style Gaussian Representation),通过轻量级每高斯MLP实现时空感知的外观控制;3)整体几何保真风格迁移模块,利用对比一致性学习和结构内容保持机制提升时空一致性。实验表明,Style4D在所提出的Style4D-Bench基准上实现了当前最优性能。

链接: https://arxiv.org/abs/2508.19243
作者: Beiqi Chen,Shuai Shao,Haitang Feng,Jianhuang Lai,Jianlou Si,Guangcong Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); Vision, Graphics, and X Group, Great Bay University (大湾大学视觉、图形与X组); Nanjing University (南京大学); Sun Yat-Sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: this https URL . Code: this https URL .
zh

[CV-3] Autoregressive Universal Video Segmentation Model

【速读】:该论文旨在解决当前视频分割领域中 prompted(有提示)与 unprompted(无提示)分割任务割裂的问题,即现有方法难以统一处理需要外部交互提示的分割场景与无需提示、自动检测并跟踪视频中所有对象的通用流式分割任务。其解决方案的关键在于提出一种自回归通用分割模型(Autoregressive Universal Segmentation Model, AUSM),将视频分割建模为类似语言建模的序列掩码预测过程,利用状态空间模型(State-space Model, SSM)构建固定大小的空间状态表示,从而支持任意长度视频流的高效处理,并通过全帧并行训练设计显著提升训练效率,实现prompted和unprompted视频分割的统一建模。

链接: https://arxiv.org/abs/2508.19242
作者: Miran Heo,Sukjun Hwang,Min-Hung Chen,Yu-Chiang Frank Wang,Albert Gu,Seon Joo Kim,Ryo Hachiuma
机构: NVIDIA; Yonsei University (延世大学); Carnegie Mellon University (卡内基梅隆大学); National Taiwan University (台湾国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today’s landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 2019, MOSE, YouTube-VIS 2019 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.
zh

[CV-4] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

【速读】:该论文旨在解决机器人操作任务中长期时序依赖性问题,即主流视觉-语言-动作(VLA)模型通常忽略时间上下文信息,导致在非马尔可夫性质的长程任务中表现不佳。其解决方案的核心是提出MemoryVLA框架,该框架受认知科学启发,构建了“认知-记忆-行动”机制:通过预训练视觉语言模型(VLM)生成感知与认知标记(token)构成工作记忆,并利用一个感知-认知记忆库(Perceptual-Cognitive Memory Bank)存储低层细节与高层语义;工作记忆从记忆库中检索决策相关条目,自适应融合当前输入并更新记忆库,最终由记忆条件扩散动作专家生成具有时序感知能力的动作序列。此设计有效提升了机器人在复杂、长程任务中的性能表现。

链接: https://arxiv.org/abs/2508.19236
作者: Hao Shi,Bin Xie,Yingfei Liu,Lin Sun,Fengrong Liu,Tiancai Wang,Erjin Zhou,Haoqiang Fan,Xiangyu Zhang,Gao Huang
机构: Tsinghua University (清华大学); Dexmal; MEGVII Technology; Tianjin University (天津大学); Harbin Institute of Technology (哈尔滨工业大学); StepFun
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The project is available at this https URL

点击查看摘要

Abstract:Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: this https URL
zh

[CV-5] Automated Feature Tracking for Real-Time Kinematic Analysis and Shape Estimation of Carbon Nanotube Growth ICCV2025

【速读】:该论文旨在解决碳纳米管(Carbon Nanotubes, CNTs)在生长过程中动态行为难以实时、自动表征的问题,传统离线(ex situ)方法仅能提供静态分析,而现有原位(in situ)技术常依赖人工初始化且缺乏对单颗粒轨迹的连续分解能力。其解决方案的关键在于提出Visual Feature Tracking (VFTrack)框架,该框架集成手工设计或深度学习特征检测器与匹配器,实现扫描电子显微镜(SEM)图像序列中CNT粒子的自动检测与跟踪,从而支持对CNT微柱生长过程中的轴向生长、侧向漂移和振荡等运动分量进行量化分析,并据此计算区域异质性生长速率及重构演化形貌,为基于物理模型的实验观测提供了自动化桥梁,推动CNT合成过程的实时优化。

链接: https://arxiv.org/abs/2508.19232
作者: Kaveh Safavigerdini,Ramakrishna Surya,Jaired Collins,Prasad Calyam,Filiz Bunyak,Matthew R. Maschmann,Kannappan Palaniappan
机构: University of Missouri-Columbia (密苏里大学哥伦比亚分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF ICCV 2025, CV4MS Workshop (Computer Vision for Materials Science), Code available at: this https URL

点击查看摘要

Abstract:Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, \alpha -score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.
zh

[CV-6] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

【速读】:该论文旨在解决现有视频人像生成模型在动作表达上仅局限于物理相似性而缺乏语义深度的问题,即模型通常仅根据音频节奏等低层次线索进行同步,难以体现角色的情感、意图或情境语义。为实现更自然且富有表现力的角色动画生成,其解决方案的关键在于两个核心技术贡献:一是利用多模态大语言模型(Multimodal Large Language Models, MLLMs)构建结构化的条件文本表示,提供高层语义引导以驱动动作生成;二是设计一种专用于多模态融合的DiT架构,并引入“伪最后帧”(Pseudo Last Frame)机制,有效整合音频、图像与文本信息,缓解跨模态冲突,从而确保生成动作在语义层面与角色、场景及语言提示高度一致。

链接: https://arxiv.org/abs/2508.19209
作者: Jianwen Jiang,Weihong Zeng,Zerong Zheng,Jiaqi Yang,Chao Liang,Wang Liao,Han Liang,Yuan Zhang,Mingyuan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character’s authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbfwe propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive. Our model, \textbfOmniHuman-1.5, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \hrefthis https URL
zh

[CV-7] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

【速读】:该论文旨在解决当前机器人学习中大规模场景数据生成的局限性问题:现有神经重建方法虽能恢复物理接地的室外场景,但受限于静态环境和有限的场景与轨迹多样性;而基于图像或视频扩散模型的数据生成方法虽具备可控性,却缺乏几何接地性和因果一致性。解决方案的关键在于提出一种结合代理几何体与环境表征生成的方法,并利用从学习到的二维图像先验中提取的得分蒸馏(score distillation),从而实现高保真度、几何一致且可控制的3D驾驶场景生成——支持基于地图布局的提示引导生成、物体恒常性以及因果视角合成。

链接: https://arxiv.org/abs/2508.19204
作者: Julian Ost,Andrea Ramazzina,Amogh Joshi,Maximilian Bömer,Mario Bijelic,Felix Heide
机构: 1. University of Oxford (牛津大学); 2. ETH Zurich (苏黎世联邦理工学院); 3. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control – they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts – producing realistic and geometrically consistent 3D generations of complex driving scenes.
zh

[CV-8] All-in-One Slider for Attribute Manipulation in Diffusion Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成高保真图像时,难以对特定属性进行渐进式、细粒度控制的问题,尤其在人脸等细节丰富的内容上更为显著。传统方法采用“一对一”(One-for-One)的滑块模块设计,即为每个属性单独训练一个滑块,导致参数冗余、扩展性差且无法支持未见属性的零样本操作。其解决方案的关键在于提出一种轻量级的“All-in-One Slider”模块,该模块通过将文本嵌入空间分解为稀疏且语义明确的属性方向,实现单一滑块即可对多种属性进行可解释的连续调控;同时,借助方向重组能力,支持未见属性的零样本操纵(如种族、名人特征)及多属性组合控制,从而显著提升属性操控的准确性、灵活性与可扩展性。

链接: https://arxiv.org/abs/2508.19195
作者: Weixin Ye,Hongguang Zhu,Wei Wang,Yahui Liu,Mengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: this https URL.
zh

[CV-9] FastMesh:Efficient Artistic Mesh Generation via Component Decoupling

【速读】:该论文旨在解决当前基于自回归模型的三角网格生成方法中存在的冗余问题,即在将网格切分为token序列时,由于顶点被多个面共享,导致顶点被重复表示,从而产生过长的token序列和低效的生成过程。解决方案的关键在于将顶点与面的生成过程分离:首先使用自回归模型仅生成顶点,使token数量减少至现有最紧凑分词器的约23%;随后通过双向Transformer模型一次性构建邻接矩阵以确定面的连接关系,从而避免重复表示;此外还引入保真度增强模块优化顶点位置并设计后处理框架移除不良边连接,显著提升了生成速度(超过8倍)和网格质量。

链接: https://arxiv.org/abs/2508.19188
作者: Jeonghwan Kim,Yushi Lan,Armando Fortes,Yongwei Chen,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent mesh generation approaches typically tokenize triangle meshes into sequences of tokens and train autoregressive models to generate these tokens sequentially. Despite substantial progress, such token sequences inevitably reuse vertices multiple times to fully represent manifold meshes, as each vertex is shared by multiple faces. This redundancy leads to excessively long token sequences and inefficient generation processes. In this paper, we propose an efficient framework that generates artistic meshes by treating vertices and faces separately, significantly reducing redundancy. We employ an autoregressive model solely for vertex generation, decreasing the token count to approximately 23% of that required by the most compact existing tokenizer. Next, we leverage a bidirectional transformer to complete the mesh in a single step by capturing inter-vertex relationships and constructing the adjacency matrix that defines the mesh faces. To further improve the generation quality, we introduce a fidelity enhancer to refine vertex positioning into more natural arrangements and propose a post-processing framework to remove undesirable edge connections. Experimental results show that our method achieves more than 8 \times faster speed on mesh generation compared to state-of-the-art approaches, while producing higher mesh quality.
zh

[CV-10] SoccerNet 2025 Challenges Results

【速读】:该论文旨在解决足球视频理解中的多个计算机视觉挑战,包括团队球类动作检测(Team Ball Action Spotting)、单目深度估计(Monocular Depth Estimation)、多视角犯规识别(Multi-View Foul Recognition)以及比赛状态重建(Game State Reconstruction)。其解决方案的关键在于提供大规模标注数据集、统一的评估协议和强基线模型,以推动社区在这些任务上的可复现性研究与技术进步。通过四个具体任务的设计与实施,SoccerNet 2025 Challenges 构建了一个开放且标准化的研究平台,促进生成式 AI (Generative AI) 和传统视觉算法在体育场景中的深度融合与性能提升。

链接: https://arxiv.org/abs/2508.19182
作者: Silvio Giancola,Anthony Cioppa,Marc Gutiérrez-Pérez,Jan Held,Carlos Hinojosa,Victor Joos,Arnaud Leduc,Floriane Magera,Karen Sanchez,Vladimir Somers,Artur Xarles,Antonio Agudo,Alexandre Alahi,Olivier Barnich,Albert Clapés,Christophe De Vleeschouwer,Sergio Escalera,Bernard Ghanem,Thomas B. Moeslund,Marc Van Droogenbroeck,Tomoki Abe,Saad Alotaibi,Faisal Altawijri,Steven Araujo,Xiang Bai,Xiaoyang Bi,Jiawang Cao,Vanyi Chao,Kamil Czarnogórski,Fabian Deuser,Mingyang Du,Tianrui Feng,Patrick Frenzel,Mirco Fuchs,Jorge García,Konrad Habel,Takaya Hashiguchi,Sadao Hirose,Xinting Hu,Yewon Hwang,Ririko Inoue,Riku Itsuji,Kazuto Iwai,Hongwei Ji,Yangguang Ji,Licheng Jiao,Yuto Kageyama,Yuta Kamikawa,Yuuki Kanasugi,Hyungjung Kim,Jinwook Kim,Takuya Kurihara,Bozheng Li,Lingling Li,Xian Li,Youxing Lian,Dingkang Liang,Hongkai Lin,Jiadong Lin,Jian Liu,Liang Liu,Shuaikun Liu,Zhaohong Liu,Yi Lu,Federico Méndez,Huadong Ma,Wenping Ma,Jacek Maksymiuk,Henry Mantilla,Ismail Mathkour,Daniel Matthes,Ayaha Motomochi,Amrulloh Robbani Muhammad,Haruto Nakayama,Joohyung Oh,Yin May Oo,Marcelo Ortega,Norbert Oswald,Rintaro Otsubo,Fabian Perez,Mengshi Qi,Cristian Rey,Abel Reyes-Angulo,Oliver Rose,Hoover Rueda-Chacón,Hideo Saito,Jose Sarmiento,Kanta Sawafuji,Atom Scott,Xi Shen,Pragyan Shrestha,Jae-Young Sim,Long Sun,Yuyang Sun,Tomohiro Suzuki,Licheng Tang,Masato Tonouchi,Ikuma Uchida,Henry O. Velesaca,Tiancheng Wang
机构: King Abdullah University of Science and Technology; University of Liége (ULiége); Institut de Robòtica i Informàtica Industrial (CSIC-UPC); UCLouvain; EVS Broadcast Equipment; EPFL; Sportradar; Universitat de Barcelona; Computer Vision Center; Aalborg University; Keio University; TAHAKOM; Escuela Superior Politecnica del Litoral; HUST-iPad, Huazhong University of Science and Technology; State Key Laboratory of Networking and Switching Technology; Beijing University of Posts and Telecommunications; Opus AI Research; AI-Robotics, KIST School, University of Science and Technology; Korea Institute of Science and Technology; int8.io; University of the Bundeswehr Munich; Laboratory for Biosignal Processing, Leipzig University of Applied Sciences; Department of Computer Science, Universidad Industrial de Santander; MIXI Inc.; University of Tokyo; Max Planck Institute for Informatics; Intelligent Perception and Image Understanding Lab; Playbox Inc.; Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology; Shenzhen Institutes of Advanced Technology; eidos.ai; University of Tsukuba; Michigan Technological University; Nagoya University; Intellindust AI Lab; Southeast University; Suzhou Institute for Advanced Research, University of Science and Technology of China; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year’s challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at this https URL, with baselines and development kits available at this https URL.
zh

[CV-11] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

【速读】:该论文旨在解决Vision Transformer(ViT)在处理图像时因依赖可学习的一维位置嵌入而导致的二维空间结构破坏问题,传统位置编码方法缺乏几何约束,无法建立欧氏空间距离与序列索引距离之间的单调对应关系,从而限制了模型对空间邻近先验的有效利用。解决方案的关键在于提出Weierstrass椭圆函数位置编码(WEF-PE),该方法通过自然复数域表示直接建模二维坐标,利用椭圆函数的双周期性与视觉数据中常见的平移不变性高度契合,并借助其非线性几何特性天然编码空间距离关系;同时,代数加法公式使得从绝对编码中直接推导任意补丁对之间的相对位置信息成为可能,从而显著增强模型的几何归纳偏置和语义聚焦能力。

链接: https://arxiv.org/abs/2508.19167
作者: Zhihang Xin,Xitong Hu,Rui Wang
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model’s capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional this http URL source code implementing the methods described in this paper is publicly available on GitHub.
zh

[CV-12] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

【速读】:该论文针对单目3D视觉定位(Monocular 3D visual grounding)任务中,预训练语言模型在处理带几何信息的文本描述时对数值单位不敏感的问题展开研究。具体而言,现有方法在将不同单位(如米、分米、厘米)表示的距离映射到同一空间时,因忽略单位差异导致文本嵌入(text embeddings)产生误导性特征,从而削弱了模型对3D几何结构的理解能力。解决方案的关键在于两个核心模块:一是提出3D-text Enhancement(3DTE),通过增强文本查询中距离描述的多样性来提升模型对单位间映射关系的认知;二是设计Text-Guided Geometry Enhancement(TGE)模块,将基础文本特征投影至几何一致空间,以生成更精准的3D文本引导信号,进而精确调控几何特征的注意力机制。该方法显著提升了模型在复杂场景下的3D定位精度,在Mono3DRefer数据集上“远距离”场景下实现11.94%的准确率提升,达到当前最优性能。

链接: https://arxiv.org/abs/2508.19165
作者: Yuzhen Li,Min Liu,Yuan Bian,Xueping Wang,Zhaoyang Li,Gen Li,Yaonan Wang
机构: Hunan University (湖南大学); Hunan Normal University (湖南师范大学); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit “meter” to “decimeters” or “centimeters” leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the “Far” scenario. Our code will be made publicly available.
zh

[CV-13] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

【速读】:该论文旨在解决历史文档数字分析中的文本行分割(text line segmentation)问题,其核心挑战在于深度学习模型通常需要大量标注数据,而历史文献往往缺乏此类资源,且人工标注成本高、耗时长。为此,作者提出一种数据高效的方法:采用轻量级UNet++架构结合拓扑感知损失函数(topology-aware loss),该损失函数最初用于神经元形态学分析,能显式惩罚结构错误如文本行断裂和非预期合并。通过仅使用每手稿3页标注数据并从中提取小块训练样本,该方法在U-DIADS-TL数据集上实现了识别准确率提升200%、线段交并比(Line Intersection over Union)提升75%,且F-Measure达到或超过DIVA-HisDB基准任务优胜方案水平,充分验证了该解决方案在减少数据依赖方面的有效性。

链接: https://arxiv.org/abs/2508.19162
作者: Rafael Sterzinger,Tingyu Lin,Robert Sablatnig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, accepted at ACPR2025

点击查看摘要

Abstract:A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: this https URL.
zh

[CV-14] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

【速读】:该论文旨在解决现有基于sRGB域的扩散模型在图像恢复任务中面临的高保真度与真实感之间的权衡问题,以及因忽略传感器RAW数据可访问性(尤其在边缘设备场景下)而导致的性能局限。其核心解决方案是提出一种端到端的RAW域扩散模型(RDDM),通过直接在RAW域进行图像重建,替代传统的两阶段图像信号处理(ISP)+图像恢复(IR)流程。关键创新包括:(1) 设计RAW域变分自编码器(RVAE)以学习最优潜在表示;(2) 引入可微分的后期色调处理(PTP)模块,实现RAW与sRGB空间的联合优化;(3) 构建可扩展的退化流水线,利用现有sRGB数据集合成大量RAW低质-高质量(LQ-HQ)对用于训练;(4) 提出可配置多拜耳(CMB)LoRA模块以适配不同RAW模式(如RGGB、BGGR等)。实验表明,RDDM在保真度和伪影控制方面显著优于当前最先进的sRGB扩散方法。

链接: https://arxiv.org/abs/2508.19154
作者: Yan Chen,Yi Wen,Wei Li,Junchao Liu,Yong Guo,Jie Hu,Xinghao Chen
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.
zh

[CV-15] A Bag of Tricks for Efficient Implicit Neural Point Clouds

【速读】:该论文旨在解决Implicit Neural Point Cloud (INPC)在实际应用中因渲染速度较慢而限制其可用性的问题。INPC作为一种结合神经场表达能力与点云渲染效率的混合表示方法,虽能实现高质量的新视角合成,但其在渲染过程中频繁查询神经网络导致性能瓶颈。解决方案的关键在于提出一系列优化策略:包括改进的光栅化器实现、更高效的采样技术以及用于孔洞填充卷积神经网络的预训练机制;此外,通过在推理阶段将点建模为小高斯分布,进一步提升了外推区域(如近景特写)的图像质量。这些优化显著提升了训练和推理效率,使INPC在保持视觉保真度的同时,达到最高25%的训练加速、2倍的渲染提速及20%的显存占用减少。

链接: https://arxiv.org/abs/2508.19140
作者: Florian Hahlbohm,Linus Franke,Leon Overkämping,Paula Wespe,Susana Castillo,Martin Eisemann,Marcus Magnor
机构: TU Braunschweig (不伦瑞克工业大学); FAU Erlangen-Nürnberg (埃尔朗根-纽伦堡大学); University of New Mexico (新墨西哥大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Implicit Neural Point Cloud (INPC) is a recent hybrid representation that combines the expressiveness of neural fields with the efficiency of point-based rendering, achieving state-of-the-art image quality in novel view synthesis. However, as with other high-quality approaches that query neural networks during rendering, the practical usability of INPC is limited by comparatively slow rendering. In this work, we present a collection of optimizations that significantly improve both the training and inference performance of INPC without sacrificing visual fidelity. The most significant modifications are an improved rasterizer implementation, more effective sampling techniques, and the incorporation of pre-training for the convolutional neural network used for hole-filling. Furthermore, we demonstrate that points can be modeled as small Gaussians during inference to further improve quality in extrapolated, e.g., close-up views of the scene. We design our implementations to be broadly applicable beyond INPC and systematically evaluate each modification in a series of experiments. Our optimized INPC pipeline achieves up to 25% faster training, 2x faster rendering, and 20% reduced VRAM usage paired with slight image quality improvements.
zh

[CV-16] ZeST: an LLM -based Zero-Shot Traversability Navigation for Unknown Environments

【速读】:该论文旨在解决机器人自主导航系统中地形可通行性(traversability)预测模型训练数据获取困难且存在安全隐患的问题。传统方法依赖机器人进入潜在危险环境采集数据,不仅风险高,而且效率低。其解决方案的关键在于提出一种名为ZeST的新方法,利用大语言模型(Large Language Models, LLMs)的视觉推理能力,在不暴露机器人于危险环境的前提下实时生成可通行性地图,实现了零样本(zero-shot)可通行性预测,显著提升了安全性与开发效率,同时具备成本低和可扩展的优势。

链接: https://arxiv.org/abs/2508.19131
作者: Shreya Gummadi,Mateus V. Gasparino,Gianluca Capezzuto,Marcelo Becker,Girish Chowdhary
机构: Field Robotics Engineering and Science Hub (FRESH), Illinois Autonomous Farm, University of Illinois at Urbana-Champaign (UIUC); Mobile Robotics Group, São Carlos School of Engineering, University of São Paulo (EESC-USP)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancement of robotics and autonomous navigation systems hinges on the ability to accurately predict terrain traversability. Traditional methods for generating datasets to train these prediction models often involve putting robots into potentially hazardous environments, posing risks to equipment and safety. To solve this problem, we present ZeST, a novel approach leveraging visual reasoning capabilities of Large Language Models (LLMs) to create a traversability map in real-time without exposing robots to danger. Our approach not only performs zero-shot traversability and mitigates the risks associated with real-world data collection but also accelerates the development of advanced navigation systems, offering a cost-effective and scalable solution. To support our findings, we present navigation results, in both controlled indoor and unstructured outdoor environments. As shown in the experiments, our method provides safer navigation when compared to other state-of-the-art methods, constantly reaching the final goal.
zh

[CV-17] VibES: Induced Vibration for Persistent Event-Based Sensing

【速读】:该论文旨在解决事件相机(event camera)在静态或低运动场景下因固定光照条件而无法生成有效事件信号的问题,从而限制其在计算机视觉任务中的应用。解决方案的关键在于引入一种轻量级的振动激励机制:通过一个简单的偏心质量旋转装置产生周期性振动,以激发持续的事件输出;同时设计了一个运动补偿流水线,用于消除人为注入的振动并恢复干净、运动校正后的事件数据,从而提升下游感知任务(如图像重建和边缘检测)的性能。

链接: https://arxiv.org/abs/2508.19094
作者: Vincenzo Polizzi,Stephen Yang,Quentin Clark,Jonathan Kelly,Igor Gilitschenski,David B. Lindell
机构: University of Toronto, Robotics Institute (多伦多大学机器人研究所); University of Toronto, Department of Computer Science (多伦多大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.
zh

[CV-18] Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation

【速读】:该论文旨在解决单像素成像(Single-Pixel Imaging)中如何学习任务特定的二值照明模式以提升重建性能的问题,尤其针对单像素荧光显微成像场景。其关键解决方案在于提出一种双层优化方法(bilevel optimisation),通过引入直通估计器(Straight-Through Estimator)处理二值模式优化中的不可微性,并在双层框架中结合总深度变化正则项(Total Deep Variation regulariser)以增强重建质量,从而在高度欠采样条件下实现优于基线方法的图像重建性能。

链接: https://arxiv.org/abs/2508.19068
作者: Serban C. Tudosie,Alexander Denker,Zeljko Kereta,Simon Arridge
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Single-Pixel Imaging enables reconstructing objects using a single detector through sequential illuminations with structured light patterns. We propose a bilevel optimisation method for learning task-specific, binary illumination patterns, optimised for applications like single-pixel fluorescence microscopy. We address the non-differentiable nature of binary pattern optimisation using the Straight-Through Estimator and leveraging a Total Deep Variation regulariser in the bilevel formulation. We demonstrate our method on the CytoImageNet microscopy dataset and show that learned patterns achieve superior reconstruction performance compared to baseline methods, especially in highly undersampled regimes.
zh

[CV-19] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

【速读】:该论文旨在解决工业表面缺陷检测中现有方法在不同监督场景下适应性差、性能不稳定以及效率不足的问题,尤其是在实际制造环境中常见的无监督、弱监督、混合监督和全监督等多样化的数据标注条件下难以有效应用。解决方案的关键在于提出SuperSimpleNet,一个基于SimpleNet构建的高效且可适配的判别模型,其核心创新包括:一种新颖的合成异常生成过程、增强的分类头结构以及改进的学习策略,使得模型能够在四种监督场景下均实现高效训练与推理,首次实现了对所有可用数据标注的充分利用,并在多个基准数据集上达到最优性能,同时保持推理时间低于10毫秒,显著提升了工业落地的可行性与实用性。

链接: https://arxiv.org/abs/2508.19060
作者: Blaž Rolih,Matic Fučka,Danijel Skočaj
机构: University of Ljubljana(卢布尔雅那大学); Faculty of Electrical Engineering(电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by The Journal of Intelligent Manufacturing

点击查看摘要

Abstract:Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: this https URL
zh

[CV-20] GReAT: leverag ing geometric artery data to improve wall shear stress assessment MICCAI2025

【速读】:该论文旨在解决在冠状动脉血流动力学生物标志物(如壁面剪切应力,Wall Shear Stress)评估中因临床数据量有限而导致的机器学习模型训练困难问题。其解决方案的关键在于利用大规模几何血管模型数据集(8449个3D血管形状)进行自监督预训练,通过计算热核签名(Heat Kernel Signature)构建无监督目标,从而学习到具有泛化能力的几何表征;这些表征可在小规模临床数据(49名患者)上微调,显著提升对冠状动脉不同壁面剪切应力区域(低、中、高)的分割精度。

链接: https://arxiv.org/abs/2508.19030
作者: Julian Suk,Jolanda J. Wentzel,Patryk Rygiel,Joost Daemen,Daniel Rueckert,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: (MICCAI 2025) Workshop on Shape in Medical Imaging (ShapeMI)

点击查看摘要

Abstract:Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.
zh

[CV-21] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval EMNLP2025

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)问题,即根据仅与视频中特定片段相关的查询来检索视频,这在实际应用中具有重要意义但技术挑战较大。现有方法多基于单模态特征建模,而强大的预训练视觉-语言模型(如CLIP)尚未被充分探索用于此任务。为此,作者提出ProPy模型,通过系统性地适配CLIP架构以应对PRVR需求,其核心创新在于:(1) 提出提示金字塔(Prompt Pyramid)结构,按多粒度层级组织事件提示以捕捉不同层次的语义信息;(2) 设计祖先-后代交互机制(Ancestor-Descendant Interaction Mechanism),基于金字塔结构实现事件间的动态语义交互。这两项设计使ProPy在三个公开数据集上达到当前最优性能(SOTA)。

链接: https://arxiv.org/abs/2508.19024
作者: Yi Pan,Yujia Zhang,Michael Kampffmeyer,Xiaoguang Zhao
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Department of Physics and Technology, UiT The Arctic University of Norway (挪威特罗姆瑟北极大学物理与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by EMNLP 2025 Findings

点击查看摘要

Abstract:Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at this https URL.
zh

[CV-22] MicroDetect-Net (MDN): Leverag ing Deep Learning to Detect Microplastics in Clam Blood a Step Towards Human Blood Analysis

【速读】:该论文旨在解决微塑料(microplastic)在人类血液中检测困难的问题,以评估其潜在健康风险。传统方法难以高效、准确地识别和量化血样中的微塑料颗粒,而本文提出了一种基于荧光显微成像与深度学习相结合的新模型——MicroDetect-Net (MDN)。该方案的关键在于将尼罗红(Nile Red)染色增强荧光信号与卷积神经网络(Convolutional Neural Network, CNN)图像分割技术融合,实现了对血液样本中微塑料碎片的自动定位与计数。实验表明,MDN在276张染色血样图像上达到了92%的准确率、87.4%的交并比(Intersection over Union)、92.1%的F1分数,验证了其在微塑料检测中的高精度与鲁棒性。

链接: https://arxiv.org/abs/2508.19021
作者: Riju Marwah,Riya Arora,Navneet Yadav,Himank Arora
机构: Guru Gobind Singh Indraprastha University (甘古比·辛格·印德拉普拉斯特大学); Maharaja Agrasen Institute of Technology (玛哈拉吉·阿格拉斯恩技术学院); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted to ICICC 2025 (Innovative Computation in Biomedical Imaging)

点击查看摘要

Abstract:With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.
zh

[CV-23] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation

【速读】:该论文旨在解决当前基于深度学习的屋顶平面分割方法中存在的三个关键问题:一是多数方法并非真正端到端,导致分割结果非最优;二是边缘区域点特征区分度较低,造成平面边界不准确;三是平面几何特性未被充分考虑以约束网络训练。解决方案的关键在于提出一种全新的边缘感知Transformer架构——RoofSeg,其核心创新包括:1)采用基于Transformer的编码器-解码器框架,通过可学习的平面查询实现层次化平面实例掩膜预测,确保端到端训练;2)设计边缘感知掩膜模块(Edge-Aware Mask Module, EAMM),融合平面几何先验信息增强边缘区域的特征判别力,提升掩膜精修精度;3)引入自适应加权掩膜损失函数以降低误分类点的影响,并提出新的平面几何损失项,强化网络对平面几何特性的建模能力。

链接: https://arxiv.org/abs/2508.19003
作者: Siyuan You,Guozheng Xu,Pengwei Zhou,Qiwen Jin,Jian Yao,Li Li
机构: Wuhan University (武汉大学); Wuhan Unversity Shenzhen Research Institute (武汉大学深圳研究院); Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 38 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.
zh

[CV-24] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender Race Age and Skin Tone

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在实际应用中可能存在的种族、性别、年龄和皮肤色调等人口统计学偏见问题,以提升其公平性和可靠性。解决方案的关键在于提出GRAS基准测试框架,涵盖最广泛的多样性样本,并引入可解释的GRAS Bias Score指标来量化偏见水平;同时发现,在使用视觉问答(Visual Question Answering, VQA)评估VLM偏见时,必须考虑同一问题的不同表述形式,这对准确识别和测量偏见具有重要方法论意义。

链接: https://arxiv.org/abs/2508.18989
作者: Shaivi Malik,Hasnat Md Abdullah,Sriparna Saha,Amit Sheth
机构: AI Institute, University of South Carolina, USA; Texas A&M University, USA; IIT Patna, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
zh

[CV-25] Enhancing Document VQA Models via Retrieval-Augmented Generation ICDAR

【速读】:该论文旨在解决文档视觉问答(Document VQA)中因处理多页文档而导致的高内存消耗问题,现有方法通常通过拼接所有页面或依赖超大规模视觉-语言模型来应对,但效率低下。其解决方案的关键在于引入检索增强生成(Retrieval-Augmented Generation, RAG),通过不同检索策略(基于OCR文本和纯视觉)从文档中筛选出相关片段作为证据,再基于这些精简信息生成答案。实验表明,文本导向的RAG可使基线模型提升高达22.5 ANLS,而无需文本提取的纯视觉RAG也能带来5.0 ANLS改进,验证了精准证据选择在多页文档场景下的有效性与普适性。

链接: https://arxiv.org/abs/2508.18984
作者: Eric López,Artemis Llabrés,Ernest Valveny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

点击查看摘要

Abstract:Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the “concatenate-all-pages” baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.
zh

[CV-26] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

【速读】:该论文旨在解决加速磁共振成像(MRI)数据在下游分割任务中性能不足的问题,即如何在保证分割精度的前提下,有效利用采样不充分的MRI数据。其关键解决方案在于首次建立了一个统一的基准框架,系统比较了7种方法,特别是对比了一阶段(end-to-end联合重建与分割)和两阶段(先使用成熟MRI重建方法再进行分割)策略。研究发现,简单但考虑数据一致性约束的两阶段方法在分割性能上优于复杂的专用一阶段模型,表明在加速MRI分割任务中,稳健的数据重建优先于复杂模型设计。

链接: https://arxiv.org/abs/2508.18975
作者: Jan Nikolas Morshuis,Matthias Hein,Christian F. Baumgartner
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textitone-stage approaches, that combine reconstruction and segmentation into a unified model, with \textittwo-stage approaches, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.
zh

[CV-27] Can we make NeRF-based visual localization privacy-preserving?

【速读】:该论文旨在解决基于神经辐射场(NeRF)的视觉定位(Visual Localization, VL)方法在云服务部署中引发的隐私泄露问题。现有NeRF方法虽能实现高质量的新视角合成,但其几何表示中隐含了精细场景细节,即使移除颜色预测头,仍可能被攻击者恢复敏感信息。解决方案的关键在于提出一种新的隐私评估协议,验证了仅使用光度损失训练的NeRF存在隐私风险;并进一步设计ppNeSF(Privacy-Preserving Neural Segmentation Field),通过自监督学习的分割标签作为监督信号训练NeRF,使场景表示在保持三维可区分性的同时,有效模糊可识别细节,从而在保障隐私的前提下实现高精度视觉定位,达到当前最优性能。

链接: https://arxiv.org/abs/2508.18971
作者: Maxime Pietrantoni,Martin Humenberger,Torsten Sattler,Gabriela Csurka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.
zh

[CV-28] Enhanced UAV Path Planning Using the Tangent Intersection Guidance (TIG) Algorithm

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)在静态与动态环境中高效且安全的路径规划问题,以支持作战支援、快递配送及搜救等应用场景。其核心解决方案是提出一种基于椭圆切线交点法的新型路径规划算法——切线交点引导(Tangent Intersection Guidance, TIG),通过为每个威胁区域生成两条子路径并依据启发式规则选择最优路径,迭代优化直至到达目标;同时结合基于二次贝塞尔曲线的改进平滑技术,在满足UAV运动学与动力学约束的前提下生成平滑、高效的飞行轨迹。实验表明,TIG算法在静态环境下比A*、PRM、RRT*、Tangent Graph和Static APPATT算法更快速地生成最短路径,且转向角更少;在未知或部分已知环境中,其实时避障能力优于人工势场法(APF)和Dynamic APPATT算法。

链接: https://arxiv.org/abs/2508.18967
作者: Hichem Cheriet,Khellat Kihel Badra,Chouraqui Samira
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in JAMRIS Journal

点击查看摘要

Abstract:Efficient and safe navigation of Unmanned Aerial Vehicles (UAVs) is critical for various applications, including combat support, package delivery and Search and Rescue Operations. This paper introduces the Tangent Intersection Guidance (TIG) algorithm, an advanced approach for UAV path planning in both static and dynamic environments. The algorithm uses the elliptic tangent intersection method to generate feasible paths. It generates two sub-paths for each threat, selects the optimal route based on a heuristic rule, and iteratively refines the path until the target is reached. Considering the UAV kinematic and dynamic constraints, a modified smoothing technique based on quadratic Bézier curves is adopted to generate a smooth and efficient route. Experimental results show that the TIG algorithm can generate the shortest path in less time, starting from 0.01 seconds, with fewer turning angles compared to A*, PRM, RRT*, Tangent Graph, and Static APPATT algorithms in static environments. Furthermore, in completely unknown and partially known environments, TIG demonstrates efficient real-time path planning capabilities for collision avoidance, outperforming APF and Dynamic APPATT algorithms.
zh

[CV-29] USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

【速读】:该论文旨在解决风格驱动(style-driven)与主体驱动(subject-driven)图像生成任务之间的对立问题,即传统方法将二者视为互斥目标:前者强调风格相似性,后者注重主体一致性,导致难以同时优化。其解决方案的关键在于提出一个统一的定制化模型USO(Unified Style-Subject Optimized customization model),通过三个核心机制实现内容与风格的解耦重组:首先构建大规模三元组数据集(包含内容图、风格图及其风格化结果);其次设计一种解耦学习方案,结合风格对齐训练(style-alignment training)与内容-风格解耦训练(content-style disentanglement training)以同步优化风格一致性与主体保真度;最后引入风格奖励学习(Style Reward Learning, SRL)进一步提升性能。该方法在首个联合评估风格相似性和主体一致性的基准USO-Bench上验证了其优越性,实现了开源模型中在两个维度上的最先进表现。

链接: https://arxiv.org/abs/2508.18966
作者: Shaojin Wu,Mengqi Huang,Yufeng Cheng,Wenxu Wu,Jiahe Tian,Yiming Luo,Fei Ding,Qian He
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL Code and model: this https URL

点击查看摘要

Abstract:Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model’s performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: this https URL
zh

[CV-30] Enhancing compact convolutional transformers with super attention

【速读】:该论文旨在解决视觉任务中Transformer模型在固定上下文长度下效率与性能难以兼顾的问题,尤其针对计算资源受限场景下的高效推理需求。其解决方案的关键在于引入三种核心设计:token mixing机制以增强特征交互,sequence-pooling策略实现序列压缩与信息保留的平衡,以及卷积式tokenizers用于高效提取局部空间特征。这些改进使模型在CIFAR100上显著提升top-1和top-5验证准确率(分别从36.50%和66.33%提升至46.29%和76.31%),且在上下文长度小于嵌入维度时比标准Scaled Dot Product Attention (SDPA) Transformer更高效,仅需其60%参数量,同时具备训练稳定性高、无需依赖数据增强或学习率调度等复杂技巧的优势。

链接: https://arxiv.org/abs/2508.18960
作者: Simpenzwe Honore Leandre,Natenaile Asmamaw Shiferaw,Dillip Rout
机构: C.V. Raman Global University (C.V.拉曼全球大学); Assam Royal Global University (阿萨姆皇家全球大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.
zh

[CV-31] Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers

【速读】:该论文旨在解决传统地图制作高度依赖地理信息系统(GIS)且需专业领域知识、耗时较长的问题,尤其针对重复性任务效率低下。其解决方案的关键在于将矢量数据(vector data)引入生成式AI(GenAI)模型中,通过文本提示控制地图风格的同时,利用矢量数据精准引导空间布局与语义结构,从而实现对地图生成过程的可控性和准确性提升。该方法首次实现了在指定风格下生成高保真度地图,并集成至网页应用以增强可用性与可访问性。

链接: https://arxiv.org/abs/2508.18959
作者: Claudio Affolter,Sidi Wu,Yizi Chen,Lorenz Hurni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making.
zh

[CV-32] he point is the mask: scaling coral reef segmentation with weak supervision

【速读】:该论文旨在解决大尺度珊瑚礁监测中因无人机航拍影像分辨率有限而难以可靠识别细粒度生态类别(如珊瑚形态类型)的问题,同时克服像素级标注成本高、劳动密集导致深度学习分割方法难以扩展的瓶颈。解决方案的关键在于提出一种多尺度弱监督语义分割框架,通过将水下影像中的细粒度生态信息迁移至航空影像,结合基于分类的监督信号、空间插值和自蒸馏(self-distillation)技术,在最小人工标注条件下实现大范围珊瑚礁高分辨率映射,从而构建了一种低成本、可扩展的珊瑚礁监测方法。

链接: https://arxiv.org/abs/2508.18958
作者: Matteo Contini,Victor Illien,Sylvain Poulain,Serge Bernard,Julien Barde,Sylvain Bonhommeau,Alexis Joly
机构: IFREMER; INRIA; CNRS; IRD
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.
zh

[CV-33] PanoHair: Detailed Hair Strand Synthesis on Volumetric Heads

【速读】:该论文旨在解决数字人中真实感发丝(hair strand)生成的难题,尤其是现有方法在数据采集复杂度高、长发体积估计与发丝合成效率低等方面的局限性。其关键解决方案在于提出PanoHair模型,通过知识蒸馏(knowledge distillation)从预训练的生成式头部合成教师模型中学习,将头部位姿建模为符号距离场(signed distance fields),从而高效预测头发区域的语义分割掩码和3D方向图,并支持基于潜在空间操作的多样化发型生成;对于真实图像,该方法通过逆向推理获取潜在编码,在不到5秒内即可生成干净的拓扑一致发丝网格,显著优于现有方法。

链接: https://arxiv.org/abs/2508.18944
作者: Shashikant Verma,Shanmuganathan Raman
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high-fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi-view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre-trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi-view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.
zh

[CV-34] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories

【速读】:该论文旨在解决如何在真实轨迹数据中区分群体行人与单人行人的问题,以分析其在空间利用和行为模式上的差异。解决方案的关键在于构建一个基于Transformer的配对分类模型,通过将行人轨迹分割为固定时间窗口,并结合结构化的序列过滤流程,实现对群体与单人行人的识别;同时建立包含空间利用(如凸包面积、最小外接圆半径、热力图密度)和行为特征(如速度变化、运动角度偏差、清空半径、轨迹直线度)的综合指标体系,并引入三类相遇类型(单对单、单对群、群对群)来量化不同交互场景,从而为人群动力学研究中的仿真建模与空间设计验证提供可扩展的基础框架。

链接: https://arxiv.org/abs/2508.18939
作者: Amartaivan Sanjjamts,Morita Hiroshi
机构: The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.
zh

[CV-35] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025

【速读】:该论文旨在解决传统图像理解任务(如图像描述生成和跨模态检索)仅关注人物、物体和场景的表面级识别,而忽视了定义真实世界事件的上下文、时间与语义维度的问题。解决方案的关键在于构建首个面向事件级别的多模态理解大规模基准——Event-Enriched Image Analysis (EVENTA) Grand Challenge,其通过整合上下文、时序与语义信息来捕捉图像背后的“谁(who)、何时(when)、何地(where)、何事(what)及为何(why)”,并基于OpenEvents V1数据集设立两个赛道:事件增强型图像检索与描述生成、基于事件的图像检索,从而推动面向叙事驱动的上下文感知多媒体人工智能发展。

链接: https://arxiv.org/abs/2508.18904
作者: Thien-Phuc Tran,Minh-Quang Nguyen,Minh-Triet Tran,Tam V. Nguyen,Trong-Le Do,Duy-Nam Ly,Viet-Tham Huynh,Khanh-Duy Le,Mai-Khiem Tran,Trung-Nghia Le
机构: University of Science,VNU-HCMVietnam; University of Dayton,OhioUnited States
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: this https URL.
zh

[CV-36] Interpretable Decision-Making for End-to-End Autonomous Driving ICCV2025

【速读】:该论文旨在解决自动驾驶中生成式 AI(Generative AI)模型决策过程缺乏可解释性的问题,尤其是在复杂城市场景下,由于深度神经网络具有非线性决策边界,导致难以理解其控制指令的逻辑依据。解决方案的关键在于设计了一种新的损失函数,通过生成稀疏且局部化的特征图来增强模型的可解释性,使得模型能够明确指出图像中哪些区域对预测的控制命令有贡献,从而在优化控制性能的同时提升决策透明度。实验表明,该方法不仅提高了可解释性,还降低了违规行为,实现了更安全、高性能的自动驾驶表现。

链接: https://arxiv.org/abs/2508.18898
作者: Mona Mirzaie,Bodo Rosenhahn
机构: Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)

点击查看摘要

Abstract:Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
zh

[CV-37] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

【速读】:该论文旨在解决基于DETR(Detection Transformer)框架的Human-Object Interaction (HOI)检测中查询(query)初始化不合理导致的交互识别不准确问题。现有方法通常采用随机初始化查询,使得对象和交互语义表示模糊,限制了模型性能。解决方案的关键在于提出双查询增强网络(Dual Query Enhancement Network, DQEN):一方面通过引入对象感知编码器特征来增强对象查询,使模型能更聚焦于人与物体之间的交互;另一方面设计交互语义融合模块(Interaction Semantic Fusion module),利用CLIP模型提取的语义特征优化交互查询的初始化,从而提升对复杂交互关系的理解能力;此外还引入辅助预测单元以增强交互特征表示。该方法在HICO-Det和V-COCO数据集上实现了具有竞争力的性能。

链接: https://arxiv.org/abs/2508.18896
作者: Zhehao Li,Chong Wang,Yi Chen,Yinghao Lu,Jiangbo Qian,Jiong Wang,Jiafei Wu
机构: Ningbo University (宁波大学); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model’s effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model’s ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at this https URL.
zh

[CV-38] oward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models

【速读】:该论文旨在解决医疗影像诊断中因分布偏移(distribution shifts)导致的公平性问题,尤其是在不同成像设备和临床实践中,模型对不同人口统计学群体(如性别、种族等)的诊断性能存在偏差。现有去偏方法通常独立处理视觉和文本模态,忽略了跨模态的残余不对齐与公平性缺口。其解决方案的关键在于提出DualFairVL框架,该框架采用并行双分支结构,通过线性投影构建近正交的文本锚点(text anchors),引导跨注意力机制生成融合特征;同时引入超网络(hypernetwork)实现属性信息的解耦与实例感知的视觉提示生成,结合原型正则化在视觉分支中强化敏感特征分离与文本锚点对齐,从而在多模态层面实现公平性和鲁棒性的联合优化。

链接: https://arxiv.org/abs/2508.18886
作者: Yuexuan Xia,Benteng Ma,Jiang He,Zhiyong Wang,Qi Dou,Yong Xia
机构: Northwestern Polytechnical University (西北工业大学); Huaiyi Huaying (慧医慧影); University of Sydney (悉尼大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.
zh

[CV-39] C-Flat: Towards a More Efficient and Powerful Framework for Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中如何在适应新任务的同时保持对旧知识的稳定性这一核心挑战,尤其关注现有方法因依赖零阶曲率信息而导致倾向于更尖锐的极小值,从而影响模型鲁棒性和性能的问题。解决方案的关键在于提出一种名为C-Flat的新方法,通过促进损失函数景观的平坦性(flatness)来增强模型的泛化能力与记忆保留效果;其创新点在于设计了一个通用框架,可无缝集成到各类主流CL范式中,并引入C-Flat++以实现选择性平坦度驱动优化,在显著降低计算开销的同时维持优异性能。

链接: https://arxiv.org/abs/2508.18860
作者: Wei Li,Hangjie Yuan,Zixiang Zhao,Yifan Zhu,Aojun Lu,Tao Feng,Yanan Sun
机构: Sichuan University (四川大学); Zhejiang University (浙江大学); ETH Zürich (苏黎世联邦理工学院); Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbfContinual \textbfFlatness (\textbfC-Flat), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at this https URL.
zh

[CV-40] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

【速读】:该论文旨在解决视频稳定化(video stabilization)中像素级合成方法的泛化能力不足问题,尤其是在不同运动模式和视觉内容下难以通过固定参数实现鲁棒性能的问题。其关键解决方案在于提出一种测试时快速适应机制(rapid adaptation),利用推理阶段可获得的低层视觉线索,在单次适应迭代中显著提升输出视频的稳定性和视觉质量;同时引入“加速度突变定位模块”(jerk localization module)与针对性适应策略,聚焦于高加速度突变段进行优化,从而在较少适应步数内最大化稳定性,使现代全帧合成模型超越传统最优方法,并保留全帧特性及提供类似经典方法的可控性。

链接: https://arxiv.org/abs/2508.18859
作者: Muhammad Kashif Ali,Eun Woo Im,Dongjin Kim,Tae Hyun Kim,Vivek Gupta,Haonan Luo,Tianrui Li
机构: Southwest Jiaotong University (西南交通大学); Hanynag University (翰林大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.
zh

[CV-41] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis

【速读】:该论文旨在解决微血管吻合术(microsurgical anastomosis)技能评估中依赖主观判断导致的可靠性与效率问题。现有方法如基于结果的吻合口失误指数(outcome-oriented anastomosis lapse index)易受人为偏倚影响,难以实现客观、量化的能力评价。解决方案的关键在于构建一个基于图像处理技术的定量评估框架,通过几何误差建模与自动检测评分机制,实现对微外科操作质量的客观量化分析,从而提升评估的准确性与训练流程的科学性。

链接: https://arxiv.org/abs/2508.18836
作者: Luyin Hu,Soheil Gholami,George Dindelegan,Torstein R. Meling,Aude Billard
机构: École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院); Iuliu Hatieganu University of Medicine and Pharmacy(伊柳·哈蒂根努大学医学和药学大学); Erasmus University Medical Center(埃因霍温大学医学中心); National Hospital of Denmark(丹麦国家医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures, accepted at EMBC2025

点击查看摘要

Abstract:Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters’ scoring for the errors considered in this work.
zh

[CV-42] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression

【速读】:该论文旨在解决微表情(Micro-expression, ME)分析中因传统窗口级分类方法依赖固定窗口大小和硬决策而难以捕捉ME复杂时序动态的问题。现有视频级回归框架虽有所改进,但区间解码仍依赖人工预定义的窗口方法,未能彻底解决该问题。其解决方案的关键在于提出一种先验引导的视频级回归方法:首先设计了一种可扩展的区间选择策略,综合考虑ME的时间演化、持续时间和类别分布特征,从而精准定位微表情的起始(onset)、峰值(apex)和结束(offset)阶段;其次引入协同优化框架,使定位与识别任务共享除分类头外的所有参数,充分利用互补信息,在数据有限条件下提升模型性能。

链接: https://arxiv.org/abs/2508.18834
作者: Zizheng Guo,Bochao Zou,Yinuo Jia,Xiangyu Li,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual’s genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model’s capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME) ^3 and 0.2000 on SAMMLV. The code is available at this https URL.
zh

[CV-43] Automated Classification of Normal and Atypical Mitotic Figures Using ConvNeXt V2: MIDOG 2025 Track 2

【速读】:该论文旨在解决组织病理图像中正常有丝分裂结构(Normal Mitotic Figures, NMFs)与异常有丝分裂结构(Atypical Mitotic Figures, AMFs)的二分类问题,其核心挑战包括类别严重不平衡、形态学变异大以及跨肿瘤类型、物种和扫描仪的域异质性。解决方案的关键在于采用ConvNeXt V2基础模型,并结合中心裁剪预处理(60%中心区域裁剪)和五折交叉验证集成策略,同时通过混合精度训练提升效率,从而在复杂多变的数据分布下实现鲁棒且高效的分类性能。

链接: https://arxiv.org/abs/2508.18831
作者: Yosuke Yamagishi,Shouhei Hanaoka
机构: Graduate School of Medicine, The University of Tokyo, Japan (东京大学医学研究生院); Department of Radiology, The University of Tokyo Hospital, Japan (东京大学医院放射科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MIDOG 2025 solution

点击查看摘要

Abstract:This paper presents our solution for the MIDOG 2025 Challenge Track 2, which focuses on binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs) in histopathological images. Our approach leverages a ConvNeXt V2 base model with center cropping preprocessing and 5-fold cross-validation ensemble strategy. The method addresses key challenges including severe class imbalance, high morphological variability, and domain heterogeneity across different tumor types, species, and scanners. Through strategic preprocessing with 60% center cropping and mixed precision training, our model achieved robust performance on the diverse MIDOG 2025 dataset. The solution demonstrates the effectiveness of modern convolutional architectures for mitotic figure subtyping while maintaining computational efficiency through careful architectural choices and training optimizations.
zh

[CV-44] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory

【速读】:该论文旨在解决国家森林清查(National Forest Inventory, NFI)数据更新依赖人工实地调查、成本高且频率低的问题,提出利用遥感与深度学习相结合的方法提升树种分类精度。其解决方案的关键在于:通过Google Earth Engine从Sentinel-1、Sentinel-2、ERA5和SRTM等多源卫星数据中提取时间序列特征,并采用一个公开的遥感时序基础模型(remote sensing time series foundation model)进行微调(fine-tuning),从而在标注数据有限的情况下显著提升分类性能——实验表明该方法相较于当前NFI分类最优水平最高提升达10%,验证了深度特征优于传统手工设计的谐波特征,为数据受限场景下的森林资源监测提供了高效可行的新路径。

链接: https://arxiv.org/abs/2508.18829
作者: Takayuki Ishikawa,Carmelo Bonannella,Bas J. W. Lerink,Marc Rußwurm
机构: Wageningen University and Research (瓦赫宁根大学与研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.
zh

[CV-45] SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation

【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在医疗等伦理敏感领域中存在的偏见问题,此类偏见会损害模型的公平性、泛化能力,并可能加剧社会歧视。现有去偏方法通常需要访问原始训练数据并进行大量重训练,且常在公平性与判别性能之间存在权衡。其解决方案的关键在于提出Soft-Mask Weight Fine-Tuning (SWiFT)框架,该框架仅需少量外部数据和数个训练轮次即可实现高效去偏;其核心思想是首先量化模型参数对偏见和预测性能的不同贡献,随后通过两阶段微调过程,依据各参数的贡献差异施加不同的梯度更新流,从而在提升公平性的同时保持甚至优于原有判别性能。

链接: https://arxiv.org/abs/2508.18826
作者: Junyu Yan,Feng Chen,Yuyang Xue,Yuning Du,Konstantinos Vilouras,Sotirios A. Tsaftaris,Steven McDonagh
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.
zh

[CV-46] Embedding Font Impression Word Tags Based on Co-occurrence

【速读】:该论文旨在解决字体形状(font shape)与描述其印象的标签(impression tag)之间关系建模不足的问题,从而提升基于印象的字体生成和检索性能。解决方案的关键在于构建一个以印象标签为节点、共现关系为边的图结构,并通过谱嵌入(spectral embedding)方法学习具有语义一致性的印象标签向量表示——这种表示能有效捕捉高频共现标签的相似性,相较BERT和CLIP等标准词嵌入方法更具表征优势,尤其适用于印象引导的字体生成任务。

链接: https://arxiv.org/abs/2508.18825
作者: Yugo Kubota,Seiichi Uchida
机构: Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.
zh

[CV-47] Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际应用中因高推理成本而易受资源消耗攻击的问题。现有攻击方法通过优化对抗性图像扩展输出序列以提高推理开销,但常引入无关异常内容,导致攻击隐蔽性下降,存在效果与隐蔽性之间的权衡难题。解决方案的关键在于提出一种名为 Hidden Tail 的隐蔽式资源消耗攻击方法:该方法生成与提示无关的对抗性图像,诱导VLM生成最大长度输出,同时通过插入用户不可见的特殊标记实现隐蔽性;其核心创新在于设计了一个复合损失函数,平衡语义保真度、重复特殊标记诱导及结束符(EOS)token抑制,并采用动态权重策略进行优化,从而在不破坏输出语义的前提下显著延长输出长度(最高达19.2倍),并达到模型最大token限制,有效提升了攻击隐蔽性与有效性。

链接: https://arxiv.org/abs/2508.18805
作者: Rui Zhang,Zihan Wang,Tianli Yang,Hongwei Li,Wenbo Jiang,Qingchuan Zhao,Yang Liu,Guowen Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textitHidden Tail, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textitHidden Tail outperforms existing attacks, increasing output length by up to 19.2 \times and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at this https URL.
zh

[CV-48] Robust and Label-Efficient Deep Waste Detection BMVC2025

【速读】:该论文旨在解决当前生成式 AI 在垃圾分拣领域应用中因数据集有限和对传统目标检测器依赖而进展缓慢的问题。其关键解决方案在于提出了一种基于集成学习的半监督学习框架,通过引入优化后的语言模型(LLM)提示提升零样本检测准确率,并采用空间与共识感知加权的软伪标签策略融合多个模型预测结果,从而在未标注的ZeroWaste-s子集上实现超越全监督训练性能的鲁棒标注与模型训练效果,显著提升了垃圾识别的泛化能力与可扩展性。

链接: https://arxiv.org/abs/2508.18799
作者: Hassan Abid,Khan Muhammad,Muhammad Haris Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025

点击查看摘要

Abstract:Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: this https URL.
zh

[CV-49] PseudoMapTrainer: Learning Online Mapping without HD Maps ICCV2025

【速读】:该论文旨在解决在线地图建模(online mapping)中依赖昂贵且地理多样性不足的高精地图(high-definition maps)作为训练标签的问题。现有方法在训练阶段必须使用真实标注的高精地图,限制了模型的泛化能力与实际部署的可行性。其解决方案的关键在于提出PseudoMapTrainer框架,通过利用未标注传感器数据生成伪标签(pseudo-labels),具体包括:1)基于多视角相机图像,采用高斯点绘制(Gaussian splatting)重建道路表面,并结合预训练2D语义分割网络提取语义信息来生成伪标签;2)设计掩码感知分配算法和损失函数,有效处理部分遮挡的伪标签,从而首次实现完全无需真实高精地图即可训练在线地图模型;3)进一步验证伪标签可用于半监督预训练,以高效利用大规模未标注众包数据提升模型性能。

链接: https://arxiv.org/abs/2508.18788
作者: Christian Löwens,Thorben Funke,Jingchao Xie,Alexandru Paul Condurache
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at this http URL.
zh

[CV-50] Design Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring

【速读】:该论文旨在解决在资源受限的低功耗设备上实现实时远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)系统所面临的可扩展性、互操作性和性能挑战,以支持从面部视频流中提取心率(HR)、呼吸频率(RR)和血氧饱和度(SpO₂)等生理信号。其解决方案的关键在于构建一个基于Face2PPG管道的多线程架构,通过并行处理视频采集、实时信号分析、网络通信与图形用户界面(GUI)更新,确保30帧每秒(fps)稳定运行,并结合功能反应式编程(Functional Reactive Programming, FRP)与Actor模型的混合编程范式,实现事件驱动的数据流处理与高效任务并行化,从而在保障精度的同时显著降低计算开销,提升系统鲁棒性与实用性。

链接: https://arxiv.org/abs/2508.18787
作者: Constantino Álvarez Casado,Sasan Sharifipour,Manuel Lage Cañellas,Nhi Nguyen,Le Nguyen,Miguel Bordallo López
机构: University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 2 figures, 10 formulas, 3 tables

点击查看摘要

Abstract:The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.
zh

[CV-51] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

【速读】:该论文旨在解决现有HOI(Human-Object Interaction)检测基准在评估生成式视觉语言模型(VLMs)时存在的根本性不匹配问题。传统基准如HICO-DET采用严格的精确匹配机制,导致对图像中存在多种合理解释的场景(如“投掷”或“接住”)进行错误惩罚,从而无法公平评估VLMs的能力。解决方案的关键在于提出一种全新的多答案多选题形式的评估协议,将HOI检测重构为包含仅正样本选项与精心筛选负样本的任务,有效避免了因语义模糊性导致的有效预测被误判,首次实现了对通用VLMs与专用HOI方法的直接公平比较,为理解当前HOI理解能力提供了新的视角。

链接: https://arxiv.org/abs/2508.18753
作者: Qinqian Lei,Bo Wang,Robby T. Tan
机构: National University of Singapore (新加坡国立大学); University of Mississippi (密西西比大学); ASUS Intelligent Cloud Services (AICS) (华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either “throwing” or “catching”. When only “catching” is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when “catching” is annotated, “throwing” is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.
zh

[CV-52] Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction BMVC2025

【速读】:该论文旨在解决开放集测试时适应(Open-Set Test-Time Adaptation, OSTTA)中的关键挑战:在测试数据存在领域偏移且包含训练未见类别的情况下,如何有效区分已知类与未知类,并同时保持对已知类的分类精度。现有方法依赖源模型进行开集数据过滤,导致在域偏移场景下过滤准确率不足;而直接使用适应过程中不稳定的模型进行过滤又会引入误差累积。解决方案的关键在于提出两个核心机制:一是主辅过滤(Primary-Auxiliary Filtering, PAF),通过辅助过滤器验证主过滤器的输出,提升开集数据识别的鲁棒性;二是知识融合预测(Knowledge-Integrated Prediction, KIP),通过校准适应模型、指数移动平均(EMA)模型和源模型的输出,整合三者互补的知识以增强开集判别能力与闭集性能。

链接: https://arxiv.org/abs/2508.18751
作者: Byung-Joon Lee,Jin-Seop Lee,Jee-Hyong Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at BMVC 2025

点击查看摘要

Abstract:Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at this https URL .
zh

[CV-53] Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

【速读】:该论文旨在解决噪声环境下音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)系统鲁棒性不足的问题,特别是现有方法难以准确估计音频可靠性并动态调整多模态依赖关系。其解决方案的关键在于提出一种基于路由器门控的跨模态特征融合机制(router-gated cross-modal feature fusion),通过在每个解码器层中使用基于音频-视觉特征融合的路由器来计算音素级声学退化分数,并据此对音频特征进行加权抑制、强化视觉线索,从而实现模态间自适应的注意力控制,使模型在音频质量下降时能有效转向视觉模态,显著提升复杂噪声场景下的识别性能。

链接: https://arxiv.org/abs/2508.18734
作者: DongHoon Lim,YoungChae Kim,Dong-Hyun Kim,Da-Hee Yang,Joon-Hyuk Chang
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Accepted to IEEE ASRU 2025

点击查看摘要

Abstract:Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.
zh

[CV-54] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings ACM-MM2025

【速读】:该论文旨在解决从2D工程图纸(vector drawings)自动生成参数化CAD模型的问题,这是传统工业设计流程中的关键环节,但此前在生成式AI领域尚未得到充分探索。其解决方案的关键在于将CAD生成重构为序列到序列的学习任务,通过三个核心技术实现:(1) 一种面向网络的矢量图元表示方法,可保留精确的几何信息;(2) 一种双解码器Transformer架构,分离命令类型与参数生成并保持精准对应关系;(3) 一种软目标分布损失函数,以适应CAD参数固有的灵活性。该方法确保了从2D矢量图到参数化CAD模型转换过程中的几何精度和设计意图的一致性。

链接: https://arxiv.org/abs/2508.18733
作者: Feiwei Qin,Shichao Lu,Junhao Hou,Changmiao Wang,Meie Fang,Ligang Liu
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学); Shenzhen Research Institute of Big Data (深圳大数据研究院); Guangzhou University (广州大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at this https URL.
zh

[CV-55] Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

【速读】:该论文旨在解决水下目标检测中因物种类别性能差异显著而导致的检测不均衡问题,特别是针对表现较差的物种(如扇贝)进行系统性分析与改进。其关键在于将检测任务解耦为定位(localization)与分类(classification)两个阶段,并通过实证发现:即使数据量充足或类别分布平衡,扇贝类别的低精度仍主要源于定位阶段的前景-背景区分困难以及分类阶段固有的特征识别挑战,而非单纯的数据稀缺或类别间依赖关系。因此,解决方案的核心是聚焦于算法层面的改进,尤其是提升定位模块的鲁棒性,同时根据实际需求(精度优先或召回优先)选择合适的类别分布策略(不平衡或平衡)。

链接: https://arxiv.org/abs/2508.18729
作者: Melanie Wille,Tobias Fischer,Scarlett Raine
机构: Queensland University of Technology (昆士兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 10 pages

点击查看摘要

Abstract:Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.
zh

[CV-56] Flatness-aware Curriculum Learning via Adversarial Difficulty BMVC2025

【速读】:该论文旨在解决神经网络在经验风险最小化(Empirical Risk Minimization, ERM)训练中因过拟合特定样本或领域而导致泛化性能下降的问题,尤其是在结合课程学习(Curriculum Learning, CL)与尖锐感知最小化(Sharpness-Aware Minimization, SAM)时面临的挑战:当模型收敛至平坦极小值区域时,损失值和梯度范数趋于均匀变小,使得基于损失或梯度的样本难度评估失效。解决方案的关键在于提出对抗难度度量(Adversarial Difficulty Measure, ADM),该方法利用SAM训练所得模型在平坦区域的鲁棒性特性,通过计算原始样本与对抗样本之间的归一化损失差来量化对抗脆弱性,从而在训练后期仍能有效评估样本难度,并将其嵌入CL框架中实现动态样本选择。实验表明,该方法在图像分类、细粒度识别和域泛化任务上优于现有基于课程学习和曲率感知的训练策略。

链接: https://arxiv.org/abs/2508.18726
作者: Hiroaki Aizawa,Yoshikazu Hayashi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC2025

点击查看摘要

Abstract:Neural networks trained by empirical risk minimization often suffer from overfitting, especially to specific samples or domains, which leads to poor generalization. Curriculum Learning (CL) addresses this issue by selecting training samples based on the difficulty. From the optimization perspective, methods such as Sharpness-Aware Minimization (SAM) improve robustness and generalization by seeking flat minima. However, combining CL with SAM is not straightforward. In flat regions, both the loss values and the gradient norms tend to become uniformly small, which makes it difficult to evaluate sample difficulty and design an effective curriculum. To overcome this problem, we propose the Adversarial Difficulty Measure (ADM), which quantifies adversarial vulnerability by leveraging the robustness properties of models trained toward flat minima. Unlike loss- or gradient-based measures, which become ineffective as training progresses into flatter regions, ADM remains informative by measuring the normalized loss gap between original and adversarial examples. We incorporate ADM into CL-based training with SAM to dynamically assess sample difficulty. We evaluated our approach on image classification tasks, fine-grained recognition, and domain generalization. The results demonstrate that our method preserves the strengths of both CL and SAM while outperforming existing curriculum-based and flatness-aware training strategies.
zh

[CV-57] Class-wise Flooding Regularization for Imbalanced Image Classification

【速读】:该论文旨在解决神经网络在类别不平衡数据集上训练时,模型预测倾向多数类而严重削弱少数类识别性能的问题。其解决方案的关键在于提出一种类级别洪水正则化(class-wise flooding regularization),即为每个类别设定基于类别频次的特定洪水阈值(flooding level),通过抑制多数类的过拟合同时保障少数类的学习空间,从而提升少数类分类性能并实现更优的整体泛化能力。

链接: https://arxiv.org/abs/2508.18723
作者: Hiroaki Aizawa,Yuta Naito,Kohei Fukuda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACPR2025

点击查看摘要

Abstract:The purpose of training neural networks is to achieve high generalization performance on unseen inputs. However, when trained on imbalanced datasets, a model’s prediction tends to favor majority classes over minority classes, leading to significant degradation in the recognition performance of minority classes. To address this issue, we propose class-wise flooding regularization, an extension of flooding regularization applied at the class level. Flooding is a regularization technique that mitigates overfitting by preventing the training loss from falling below a predefined threshold, known as the flooding level, thereby discouraging memorization. Our proposed method assigns a class-specific flooding level based on class frequencies. By doing so, it suppresses overfitting in majority classes while allowing sufficient learning for minority classes. We validate our approach on imbalanced image classification. Compared to conventional flooding regularizations, our method improves the classification performance of minority classes and achieves better overall generalization.
zh

[CV-58] Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature

【速读】:该论文旨在解决高维图像特征在多类分类任务中效率与性能之间的矛盾问题,即如何在大幅压缩特征维度的同时保持甚至提升分类准确性。其核心解决方案是构建一个融合统计物理、编码理论与代数拓扑的统一框架:将冻结的MobileNetV2提取的高维特征向量视为稀疏多边类型准循环低密度奇偶校验(MET-QC-LDPC)图上的自旋,形成随机键伊辛模型(RBIM),并在Nishimori温度β_N下运行以最大化类别可分性;通过建立码图中的局部陷阱集(trapping sets)与特征流形拓扑不变量(贝蒂数、边界类)之间的对应关系,指导设计球面和环面结构的MET-QC-LDPC图集,并利用永久界抑制有害陷阱集,从而实现从1280维到32或64维的高效降维,在ImageNet-10和ImageNet-100子集上分别达到98.7%和82.7%的准确率,显著优于传统方法。

链接: https://arxiv.org/abs/2508.18717
作者: V.S. Usatyuk,D.A. Sapoznikov,S.I. Egorov
机构: T8 LLC (T8 LLC); South-West State University (西南州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Algebraic Topology (math.AT)
备注: 27 pages, 8 figures, 2 tables, was presented at the 9th International Conference ‘Deep Learning on Computational Physics (DLCP2025)’, and is currently under review for the Moscow University Physics Bulletin, Physics series

点击查看摘要

Abstract:We present a unified framework combining statistical physics, coding theory, and algebraic topology for efficient multi-class image classification. High-dimensional feature vectors from a frozen MobileNetV2 backbone are interpreted as spins on a sparse Multi-Edge Type quasi-cyclic LDPC (MET-QC-LDPC) graph, forming a Random-Bond Ising Model (RBIM). We operate this RBIM at its Nishimori temperature, \beta_N , where the smallest eigenvalue of the Bethe-Hessian matrix vanishes, maximizing class separability. Our theoretical contribution establishes a correspondence between local trapping sets in the code’s graph and topological invariants (Betti numbers, bordism classes) of the feature manifold. A practical algorithm estimates \beta_N efficiently with a quadratic interpolant and Newton correction, achieving a six-fold speed-up over bisection. Guided by topology, we design spherical and toroidal MET-QC-LDPC graph ensembles, using permanent bounds to suppress harmful trapping sets. This compresses 1280-dimensional features to 32 or 64 dimensions for ImageNet-10 and -100 subsets. Despite massive compression (40x fewer parameters), we achieve 98.7% accuracy on ImageNet-10 and 82.7% on ImageNet-100, demonstrating that topology-guided graph design yields highly efficient, physics-inspired embeddings with state-of-the-art performance. Comments: 27 pages, 8 figures, 2 tables, was presented at the 9th International Conference ‘Deep Learning on Computational Physics (DLCP2025)’, and is currently under review for the Moscow University Physics Bulletin, Physics series Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Algebraic Topology (math.AT) Cite as: arXiv:2508.18717 [cs.LG] (or arXiv:2508.18717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-59] Enhancing Video-Based Robot Failure Detection Using Task Knowledge

【速读】:该论文旨在解决机器人在复杂真实场景中执行任务时,如何可靠地检测执行失败以触发安全模式、恢复策略或重新规划任务的问题。现有许多失败检测方法在多样化的现实场景中表现不佳,难以保证性能的鲁棒性。解决方案的关键在于利用视频中蕴含的时空知识,即机器人执行的动作(action)和视野中与任务相关的目标物体(task-relevant objects),这两类信息在多数机器人应用场景中均可获得。作者通过在三个数据集上补充此类标注,并提出一种基于可变帧率的数据增强方法,在不增加计算成本的前提下显著提升了F1分数(如ARMBench数据集从77.9提升至80.0),进一步验证了时空信息对失败检测的重要性。

链接: https://arxiv.org/abs/2508.18705
作者: Santosh Thoduka,Sebastian Houben,Juergen Gall,Paul G. Plöger
机构: Fraunhofer Institute for Intelligent Analysis and Information Systems (弗劳恩霍夫智能分析与信息系统研究所); Hochschule Bonn-Rhein-Sieg (波恩-莱茵-锡格应用技术大学); University of Bonn (波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECMR 2025

点击查看摘要

Abstract:Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.
zh

[CV-60] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting

【速读】:该论文旨在解决内窥镜视频中可变形组织高保真重建难题,现有方法在捕捉细微颜色变化和建模全局形变方面存在局限:3D Gaussian Splatting(3DGS)因固定每个高斯体素的颜色分配难以表达复杂纹理,且线性形变模型无法保持手术交互引起的全局运动一致性。解决方案的关键在于提出ColorGS框架,其核心创新为两个方面:一是引入带可学习颜色参数的动态锚点的彩色高斯基元(Colored Gaussian Primitives),实现空间自适应的颜色编码以提升复杂光照和组织相似性下的颜色表现力;二是设计增强形变模型(Enhanced Deformation Model, EDM),结合时间感知高斯基函数与可学习的时间无关形变,精确捕获局部组织形变及由手术操作引发的全局运动一致性。

链接: https://arxiv.org/abs/2508.18696
作者: Qun Ji,Peng Li,Mingqiang Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.
zh

[CV-61] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)系统在实际应用中面临的高计算成本、冗余特征及可扩展性不足等关键问题,尤其在实时场景下的性能瓶颈。其解决方案的核心在于提出了一种优化的混合深度学习框架,该框架融合了定制化的InceptionV3模型用于提取多层级空间特征,结合LSTM架构建模帧间时序依赖以捕捉运动动态,并引入一种基于集成遗传算法的特征选择策略——自适应动态适应度共享与注意力机制(Adaptive Dynamic Fitness Sharing and Attention, ADFSA),以动态平衡准确率、冗余度、唯一性和复杂度降低等多个目标,从而筛选出紧凑且具有判别性的特征子集。此方法显著减少了特征维度(最低至7维),提升了推理效率,并支持在树莓派等边缘设备上的实时部署,适用于公共安全、辅助技术和自主监控等资源受限环境。

链接: https://arxiv.org/abs/2508.18695
作者: Wasi Ullah,Yasir Noman Khalid,Saddam Hussain Khan
机构: HITEC University (HITEC大学); UEAS (UEAS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 25 figures, 11 tables

点击查看摘要

Abstract:Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.
zh

[CV-62] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

【速读】:该论文旨在解决深度学习模型在跨域迁移时因领域偏移(domain shift)导致性能下降的问题,尤其针对当前无监督域适应(Unsupervised Domain Adaptation, UDA)方法依赖微调整个特征提取器所引发的效率低、可解释性差及难以扩展至现代架构等局限。其解决方案的关键在于发现预训练模型在大规模数据上具备域不变的几何特征结构(即类内聚类与类间分离),表明领域偏移主要表现为决策边界错位而非特征退化;因此提出特征空间平面搜索器(Feature-space Planes Searcher, FPS),通过优化决策边界来实现域适应,同时冻结特征编码器以避免特征扭曲,从而在保持高性能的同时显著降低计算和内存开销,并支持单次计算周期内对全数据集进行优化。

链接: https://arxiv.org/abs/2508.18693
作者: Zhitong Cheng,Yiran Jiang,Yulong Ge,Yufeng Li,Zhongheng Qin,Rongzhi Lin,Jianwei Ma
机构: Harbin Institute of Technology (哈尔滨工业大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. . Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.18693 [cs.CV] (or arXiv:2508.18693v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.18693 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhitong Cheng [view email] [v1] Tue, 26 Aug 2025 05:39:21 UTC (7,426 KB)
zh

[CV-63] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

【速读】:该论文旨在解决超声心动图视频中左心室心内膜自动分割技术在射血分数(Ejection Fraction, EF)估算精度不足的问题。现有方法虽在分割性能上表现良好,但其EF估计结果仍存在偏差,主要受限于局部细节丢失或全局动态信息捕捉不足。解决方案的关键在于提出一种分层时空分割网络(Hierarchical Spatio-temporal Segmentation Network, \ourmodel),通过低层级卷积模块保留单帧细节,高层级Mamba架构建模时序动态关系,并引入时空交叉扫描(Spatio-temporal Cross Scan, STCS)模块,实现跨帧与跨位置的长程上下文融合,从而有效缓解由超声图像噪声等因素引起的EF计算偏差。

链接: https://arxiv.org/abs/2508.18681
作者: Dongfang Wang,Jian Yang,Yizhe Zhang,Tao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.
zh

[CV-64] SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency Domain PRICAI2025

【速读】:该论文旨在解决当前基于学习的水下图像增强(Underwater Image Enhancement, UIE)方法中,空间域信噪比(Signal-to-Noise Ratio, SNR)先验存在的两个关键问题:一是难以有效分离跨通道干扰,二是对增强信息结构与抑制噪声的协同作用有限。解决方案的关键在于将SNR先验引入频率域,通过傅里叶变换将特征分解为幅度谱和相位谱,实现更精细的通道调制;并提出Fourier Attention SNR-prior Transformer(FAST)模块,利用频谱交互与SNR线索突出关键频段成分;同时设计Frequency Adaptive Transformer(FAT)瓶颈结构,以门控注意力机制融合低频与高频分支,提升感知质量。最终构建的SSForn模型在统一U型架构中整合RGB流与SNR引导分支,在UIEB、EUVP和LSUI数据集上实现PSNR提升3.1 dB和SSIM提升0.08,显著改善水下场景的颜色、纹理与对比度恢复效果。

链接: https://arxiv.org/abs/2508.18664
作者: Xin Tian,Yingtie Lei,Xiujun Zhang,Zimeng Li,Chi-Man Pun,Xuhang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by PRICAI2025

点击查看摘要

Abstract:Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.
zh

[CV-65] Clustering-based Feature Representation Learning for Oracle Bone Inscriptions Detection

【速读】:该论文旨在解决甲骨文(Oracle Bone Inscriptions, OBIs)在拓片图像中自动检测的难题,这一任务在数字考古学中具有基础性意义,但因图像中的噪声、裂纹等退化因素,传统检测网络效果受限。解决方案的关键在于提出一种基于聚类的特征空间表示学习方法,创新性地利用甲骨文字体(Oracle Bones Character, OBC)字体库作为先验知识,通过聚类结果构建特定损失函数以优化特征表示,并将其融入整体网络损失中,从而提升检测性能。实验表明,该方法在Faster R-CNN、DETR和Sparse R-CNN三种主流检测框架上均实现显著改进。

链接: https://arxiv.org/abs/2508.18641
作者: Ye Tao,Xinran Fu,Honglin Pang,Xi Yang,Chuntao Li
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Oracle Bone Inscriptions (OBIs), play a crucial role in understanding ancient Chinese civilization. The automated detection of OBIs from rubbing images represents a fundamental yet challenging task in digital archaeology, primarily due to various degradation factors including noise and cracks that limit the effectiveness of conventional detection networks. To address these challenges, we propose a novel clustering-based feature space representation learning method. Our approach uniquely leverages the Oracle Bones Character (OBC) font library dataset as prior knowledge to enhance feature extraction in the detection network through clustering-based representation learning. The method incorporates a specialized loss function derived from clustering results to optimize feature representation, which is then integrated into the total network loss. We validate the effectiveness of our method by conducting experiments on two OBIs detection dataset using three mainstream detection frameworks: Faster R-CNN, DETR, and Sparse R-CNN. Through extensive experimentation, all frameworks demonstrate significant performance improvements.
zh

[CV-66] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

【速读】:该论文旨在解决视频描述生成中普遍存在的运动细节失衡(motion-detail imbalance)问题,即现有模型在处理视频内容时往往过度关注运动信息或细节信息,导致生成的描述不完整且缺乏一致性。其解决方案的关键在于两个层面:一是构建了包含270K样本的Harmonizing Motion-Detail 270K(HMD-270K)数据集,通过两阶段管道(Motion-Detail Fusion与Fine-Grained Examination)实现运动与细节的协同优化;二是提出基于Group Relative Policy Optimization(GRPO)的Caption Set Equivalence Reward(CSER),通过单元到集合匹配和双向验证机制提升描述的完整性和准确性。最终基于此方法开发出OwlCap模型,在VDC和DREAM-1K两个基准上分别取得显著性能提升(+4.2 Acc 和 +4.6 F1)。

链接: https://arxiv.org/abs/2508.18634
作者: Chunlin Zhong,Qiuxia Hou,Zhangjun Zhou,Shuang Hao,Haonan Lu,Yanhao Zhang,He Tang,Xiang Bai
机构: 1. School of Information Science and Engineering, Central South University (中南大学信息科学与工程学院); 2. School of Computer Science and Engineering, Central South University (中南大学计算机科学与工程学院); 3. State Key Laboratory of High Performance Computing, National University of Defense Technology (国防科技大学高性能计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6figures

点击查看摘要

Abstract:Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.
zh

[CV-67] ROSE: Remove Objects with Side Effects in Videos

【速读】:该论文旨在解决视频对象移除(Video Object Removal)中因物体侧效应(如阴影、反射、光线、半透明和镜面效果)导致的消除困难问题,其核心挑战在于缺乏成对的视频数据作为监督信号。解决方案的关键在于提出ROSE框架,该框架通过引入一个基于3D渲染引擎的全自动数据生成管道,构建大规模且多样化的合成配对视频数据集,从而模拟真实世界中的五类常见侧效应;同时,模型采用基于扩散变换器(Diffusion Transformer)的视频修复架构,并结合参考图像引导的擦除机制与差分掩码监督策略,以显式预测并去除由物体引起的环境影响区域,显著提升了在复杂侧效应场景下的泛化能力与去除精度。

链接: https://arxiv.org/abs/2508.18633
作者: Chenxuan Miao,Yutong Feng,Jianshu Zeng,Zixiang Gao,Hantang Liu,Yunfeng Yan,Donglian Qi,Xi Chen,Bin Wang,Hengshuang Zhao
机构: Zhejiang University (浙江大学); KunByte AI; Peking University (北京大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object’s effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is this https URL.
zh

[CV-68] Decouple Reorganize and Fuse: A Multimodal Framework for Cancer Survival Prediction

【速读】:该论文旨在解决多模态癌症生存分析中现有融合方法面临的两个关键问题:一是固定融合策略(如拼接和注意力机制)导致模型对预定义特征组合过度依赖,限制了解耦特征的动态融合能力;二是基于MoE(Mixture-of-Experts)的融合方法中,每个专家网络独立处理解耦特征,造成特征间信息交互受限。解决方案的关键在于提出一种全新的“解耦-重组-融合”框架(Decoupling-Reorganization-Fusion, DeReF),其核心创新在于在解耦与动态MoE融合之间引入随机特征重组策略,从而提升特征组合的多样性与粒度,增强专家网络的泛化能力,并打破信息封闭性,使专家网络能更有效地捕捉不同模态解耦特征间的关联信息。此外,该方法还在模态解耦模块中嵌入区域交叉注意力网络,以优化解耦特征的表示质量。

链接: https://arxiv.org/abs/2508.18632
作者: Huayi Wang,Haochao Ying,Yuyang Xu,Qibo Qiu,Cheng Zhang,Danny Z. Chen,Ying Sun,Jian Wu
机构: Zhejiang University (浙江大学); State Key Laboratory of Transvascular Implantation Devices of the Second Affiliated Hospital, Zhejiang University School of Medicine (浙江大学医学院附属第二医院血管植入器械国家重点实验室); Transvascular Implantation Devices Research Institute (血管植入器械研究院); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室); China Mobile (Zhejiang) Research & Innovation Institute (中国移动(浙江)研究院); School of Public Health, Zhejiang University (浙江大学公共卫生学院); Sun Yat-sen University Cancer Center (中山大学肿瘤中心); State Key Laboratory of Oncology in South China (华南肿瘤学国家重点实验室); Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy (广东省鼻咽癌诊断与治疗重点实验室); Guangdong Provincial Clinical Research Center for Cancer (广东省临床肿瘤研究センター); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion this http URL advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.
zh

[CV-69] Uncertainty Awareness on Unsupervised Domain Adaptation for Time Series Data

【速读】:该论文旨在解决时间序列数据中因分布偏移(distribution shift)导致的无监督域适应(Unsupervised Domain Adaptation, UDA)性能下降问题,即模型在训练域上表现良好,但在未标注的目标域上泛化能力差。其解决方案的关键在于提出一种结合多尺度特征提取与不确定性感知机制的框架:首先通过多尺度混合输入架构(multi-scale mixed input architecture)增强训练多样性并减少源域与目标域间的特征差异;其次引入基于证据学习(evidential learning)的不确定性感知机制,通过在标签上施加Dirichlet先验来同时实现目标预测和不确定性估计,从而在跨域对齐相同标签特征的同时提升模型预测置信度的校准性(calibration),显著改善目标域性能并降低预期校准误差(Expected Calibration Error, ECE)。

链接: https://arxiv.org/abs/2508.18630
作者: Weide Liu,Xiaoyang Zhong,Lu Wang,Jingwen Hou,Yuemei Luo,Jiebin Yan,Yuming Fang
机构: Nanyang Technological University (南洋理工大学); Jiangxi University of Finance and Economics (江西财经大学); Institute for Infocomm Research (I2R) (资讯通信研究院); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Multimedia

点击查看摘要

Abstract:Unsupervised domain adaptation methods seek to generalize effectively on unlabeled test data, especially when encountering the common challenge in time series data that distribution shifts occur between training and testing datasets. In this paper, we propose incorporating multi-scale feature extraction and uncertainty estimation to improve the model’s generalization and robustness across domains. Our approach begins with a multi-scale mixed input architecture that captures features at different scales, increasing training diversity and reducing feature discrepancies between the training and testing domains. Based on the mixed input architecture, we further introduce an uncertainty awareness mechanism based on evidential learning by imposing a Dirichlet prior on the labels to facilitate both target prediction and uncertainty estimation. The uncertainty awareness mechanism enhances domain adaptation by aligning features with the same labels across different domains, which leads to significant performance improvements in the target domain. Additionally, our uncertainty-aware model demonstrates a much lower Expected Calibration Error (ECE), indicating better-calibrated prediction confidence. Our experimental results show that this combined approach of mixed input architecture with the uncertainty awareness mechanism achieves state-of-the-art performance across multiple benchmark datasets, underscoring its effectiveness in unsupervised domain adaptation for time series data.
zh

[CV-70] Wan-S2V: Audio-Driven Cinematic Video Generation

【速读】:该论文旨在解决当前音频驱动角色动画(audio-driven character animation)在复杂影视制作场景中表现不足的问题,例如角色互动细腻度不够、身体动作真实感弱以及动态摄像机运镜难以实现等。解决方案的关键在于提出一种名为Wan-S2V的新型音频驱动模型,该模型基于Wan架构构建,在电影级语境下显著提升了动画的表达力和保真度;实验表明,其性能优于Hunyuan-Avatar和Omnihuman等前沿方法,并展现出在长视频生成与精确唇形同步编辑等多场景下的强大泛化能力。

链接: https://arxiv.org/abs/2508.18621
作者: Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo
机构: Tongyi Lab (通义实验室); Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.
zh

[CV-71] SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis

【速读】:该论文旨在解决现有方法在生成多样化且空间一致的3D室内场景时,难以有效条件化于建筑约束(如房间布局、门窗位置)的问题。解决方案的关键在于提出SemLayoutDiff模型,该模型采用一种结合自顶向下语义地图与对象属性的场景布局表示方式,并引入一个可条件化的分类扩散模型(categorical diffusion model),能够显式地以房间掩码为条件进行场景合成;随后通过交叉注意力机制预测家具放置位置,确保其符合生成的布局并避开门、窗等建筑元素,从而实现更合理、多样且空间连贯的3D室内场景生成。

链接: https://arxiv.org/abs/2508.18597
作者: Xiaohao Sun,Divyam Goel,Angle X. Chang
机构: Simon Fraser University (西蒙菲莎大学); CMU (卡内基梅隆大学); Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.
zh

[CV-72] Adaptive Visual Navigation Assistant in 3D RPGs

【速读】:该论文旨在解决复杂3D游戏环境中自动识别可通行的空间过渡点(Spatial Transition Points, STPs)并从中筛选出唯一的主要STP(Main STP, MSTP)的问题,该MSTP位于玩家当前宏观目标所依赖的设计者预设关键路径上。这一任务对客户端自动制图(auto-mapping)和地图提示呈现效果的客观评估具有重要意义。解决方案的关键在于提出一个两阶段深度学习流水线:第一阶段使用Faster R-CNN检测潜在STPs,第二阶段通过轻量级MSTP选择器融合局部与全局视觉特征进行排序;两个阶段均采用参数高效适配器(parameter-efficient adapters)实现迁移学习,并引入可选的检索增强融合步骤以提升性能。实验表明,在数据有限场景下,仅使用适配器的迁移策略比全网络微调更具鲁棒性和有效性,尤其在MSTP选择任务中表现突出。

链接: https://arxiv.org/abs/2508.18539
作者: Kaijie Xu,Clark Verbrugge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player’s current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.
zh

[CV-73] SAT-SKYLINES: 3D Building Generation from Satellite Imagery and Coarse Geometric Priors

【速读】:该论文旨在解决现有基于图像的3D建筑生成方法在仅依赖卫星影像顶部视角时难以恢复准确建筑结构的问题,以及传统3D细节化方法对高精度体素输入依赖过强、无法从简单几何先验(如立方体)中生成满意结果的局限性。解决方案的关键在于建模从插值后的噪声粗略先验到精细几何结构的映射过程,从而在不增加额外计算成本的前提下实现灵活的几何控制。

链接: https://arxiv.org/abs/2508.18531
作者: Zhangyu Jin,Andrew Feng
机构: University of Southern California, Institute for Creative Technologies (南加州大学创意技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present SatSkylines, a 3D building generation approach that takes satellite imagery and coarse geometric priors. Without proper geometric guidance, existing image-based 3D generation methods struggle to recover accurate building structures from the top-down views of satellite images alone. On the other hand, 3D detailization methods tend to rely heavily on highly detailed voxel inputs and fail to produce satisfying results from simple priors such as cuboids. To address these issues, our key idea is to model the transformation from interpolated noisy coarse priors to detailed geometries, enabling flexible geometric control without additional computational cost. We have further developed Skylines-50K, a large-scale dataset of over 50,000 unique and stylized 3D building assets in order to support the generations of detailed building models. Extensive evaluations indicate the effectiveness of our model and strong generalization ability.
zh

[CV-74] Controllable Single-shot Animation Blending with Temporal Conditioning ICCV2025

【速读】:该论文旨在解决单样本生成式运动模型(single-shot motion generation)中缺乏显式可控性以实现多段动作在单一生成过程中无缝融合的问题。现有方法虽能基于单个骨骼运动序列生成变体,但无法在一次生成流程中对多个动作进行可控混合。解决方案的关键在于提出首个单样本运动混合框架,通过时间条件化(temporal conditioning)机制引导生成过程,并引入骨架感知归一化(skeleton-aware normalization)策略,从而实现平滑、数据驱动的动作过渡控制,使不同动画风格与骨骼结构下的运动混合既自然又可调控。

链接: https://arxiv.org/abs/2508.18525
作者: Eleni Tselepi,Spyridon Thermos,Gerasimos Potamianos
机构: ECE Dept., Univ. of Thessaly (电气与计算机工程系,希腊塞萨洛尼基大学); Moverse (移动领域公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the AI for Visual Arts Workshop at ICCV 2025

点击查看摘要

Abstract:Training a generative model on a single human skeletal motion sequence without being bound to a specific kinematic tree has drawn significant attention from the animation community. Unlike text-to-motion generation, single-shot models allow animators to controllably generate variations of existing motion patterns without requiring additional data or extensive retraining. However, existing single-shot methods do not explicitly offer a controllable framework for blending two or more motions within a single generative pass. In this paper, we present the first single-shot motion blending framework that enables seamless blending by temporally conditioning the generation process. Our method introduces a skeleton-aware normalization mechanism to guide the transition between motions, allowing smooth, data-driven control over when and how motions blend. We perform extensive quantitative and qualitative evaluations across various animation styles and different kinematic skeletons, demonstrating that our approach produces plausible, smooth, and controllable motion blends in a unified and efficient manner.
zh

[CV-75] DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance

【速读】:该论文旨在解决自动驾驶系统中3D场景流估计(3D scene flow estimation)的标注数据稀缺问题,即大规模、人工标注的LiDAR数据集难以获取,导致监督学习方法难以扩展,而现有自监督方法在长距离和恶劣天气等复杂场景下性能不足。解决方案的关键在于提出DoGFlow框架,通过跨模态标签迁移机制,利用4D雷达多普勒(4D radar Doppler)测量实时生成运动伪标签(motion pseudo-labels),并借助动态感知关联与歧义消除传播策略,将这些伪标签高效转移到LiDAR域,从而实现无需任何人工标注即可训练高性能LiDAR场景流模型。

链接: https://arxiv.org/abs/2508.18506
作者: Ajinkya Khoche,Qingwen Zhang,Yixi Cai,Sina Sharif Mansouri,Patric Jensfelt
机构: KTH Royal Institute of Technology (皇家理工学院); Scania Group (斯堪尼亚集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit this https URL
zh

[CV-76] Impact of Target and Tool Visualization on Depth Perception and Usability in Optical See-Through AR

【速读】:该论文旨在解决光学透视增强现实(Optical See-Through Augmented Reality, OST-AR)系统中,远距离操作场景下(如外科手术)虚拟全息内容的深度感知不准确及真实器械遮挡问题。研究发现,关键解决方案在于通过实时渲染虚拟内容为不透明并正确实现与真实工具的遮挡关系(occlusion),从而显著提升深度估计精度和系统可用性;若无法实现可靠工具追踪,则应谨慎使用高透明度以平衡深度感知与工具可见性,避免因过度透明导致深度线索削弱。

链接: https://arxiv.org/abs/2508.18481
作者: Yue Yang,Xue Xie,Xinkai Wang,Hui Zhang,Chiming Yu,Xiaoxian Xiong,Lifeng Zhu,Yuanyi Zheng,Jue Cen,Bruce Daniel,Fred Baik
机构: Stanford University (斯坦福大学); SJTU (上海交通大学); SEU (东南大学); HUST (华中科技大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm’s distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm’s length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ( 2 \times 3 : two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm’s distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.
zh

[CV-77] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

【速读】:该论文旨在解决监控视频中异常行为检测的难题,尤其是面对不可预测且依赖场景上下文的异常事件时,传统方法因缺乏异常样本而难以有效建模。其解决方案的关键在于提出了一种新颖的上下文感知零样本异常检测框架,通过融合TimeSformer(用于提取时空特征)、DPC(用于预测未来表示以识别时间偏离)和CLIP(基于文本提示实现概念级语义异常检测)的混合架构,结合InfoNCE与CPC损失函数进行联合训练,使视觉输入与其时空和语义表征对齐,并引入上下文门控机制,利用场景感知线索或全局视频特征增强决策准确性。该方法实现了在未见行为下的泛化能力,打通了时间推理与语义上下文之间的桥梁。

链接: https://arxiv.org/abs/2508.18463
作者: Md. Rashid Shahriar Khan,Md. Abrar Hasan,Mohammod Tareq Aziz Justice
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at this https URL.
zh

[CV-78] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results ICCV2025

【速读】:该论文旨在解决真实场景下人脸图像(face image)因噪声、模糊和压缩伪影等退化因素导致的图像质量下降问题,从而影响后续任务性能的问题。解决方案的关键在于组织了VQualA 2025挑战赛,鼓励参赛者开发轻量级且高效的面部图像质量评估(Face Image Quality Assessment, FIQA)模型(限制在0.5 GFLOPs和500万参数以内),以预测任意分辨率下具有现实退化特征的人脸图像的平均意见分数(Mean Opinion Score, MOS)。通过在野外采集的人脸图像数据集上进行综合相关性指标评估,验证了所提出方法的有效性和实用性。

链接: https://arxiv.org/abs/2508.18445
作者: Sizhuo Ma,Wei-Ting Chen,Qiang Gao,Jian Wang,Chris Wei Zhou,Wei Sun,Weixia Zhang,Linhan Cao,Jun Jia,Xiangyang Zhu,Dandan Zhu,Xiongkuo Min,Guangtao Zhai,Baoying Chen,Xiongwei Xiao,Jishen Zeng,Wei Wu,Tiexuan Lou,Yuchen Tan,Chunyi Song,Zhiwei Xu,MohammadAli Hamidi,Hadi Amirpour,Mingyin Bai,Jiawang Du,Zhenyu Jiang,Zilong Lu,Ziguan Cui,Zongliang Gan,Xinpeng Li,Shiqi Jiang,Chenhui Li,Changbo Wang,Weijun Yuan,Zhan Li,Yihang Chen,Yifan Deng,Ruting Deng,Zhanglu Chen,Boyang Yao,Shuling Zheng,Feng Zhang,Zhiheng Fu,Abhishek Joshi,Aman Agarwal,Rakhil Immidisetti,Ajay Narasimha Mopidevi,Vishwajeet Shukla,Hao Yang,Ruikun Zhang,Liyuan Pan,Kaixin Deng,Hang Ouyang,Fan yang,Zhizun Luo,Zhuohang Shi,Songning Lai,Weilin Ruan,Yutao Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 VQualA workshop FIQA track

点击查看摘要

Abstract:Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.
zh

[CV-79] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在皮肤病学视觉问答(VQA)任务中面临的两大挑战:一是通用模型在专业领域诊断准确性不足,二是模型规模庞大导致推理成本过高,难以在临床场景中部署。解决方案的关键在于提出一种“专家-泛化器”(Specialist-Generalist)框架CLARIFY,其核心创新在于将一个轻量级、领域特训的图像分类器( Specialist)与一个压缩后的强大对话式VLM(Generalist)相结合,其中 Specialist的预测结果直接引导Generalist的推理路径,同时引入基于知识图谱的检索模块以确保生成回答的医学事实准确性。这种分层协同机制显著提升了诊断精度(较最强基线提升18%),并降低了显存占用和延迟(分别至少减少20%和5%),从而实现了高效、可信且适合临床应用的AI系统设计。

链接: https://arxiv.org/abs/2508.18430
作者: Aranya Saha,Tanvir Ahmed Khan,Ismam Nur Swapnil,Mohammad Ariful Haque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, Prepared for submission to IEEE Transactions on Human-Machine Systems

点击查看摘要

Abstract:Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist’s predictions directly guide the Generalist’s reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist’s responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20% and 5%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.
zh

[CV-80] LPLC: A Dataset for License Plate Legibility Classification

【速读】:该论文旨在解决自动车牌识别(Automatic License Plate Recognition, ALPR)系统在面对模糊或无法辨识的车牌(illegible license plates, LPs)时性能下降的问题。现有方法如超分辨率(super-resolution, SR)技术虽能提升图像质量,但未从根本上解决如何高效判断何时需要增强、何时无法恢复的难题。解决方案的关键在于提出一种细粒度标注的数据集——LPLC数据集,包含10,210张车辆图像和12,687个标注车牌,涵盖四种清晰度等级(完美、良好、较差、无法辨识),并支持车辆级与车牌级遮挡标注及字符标签(排除无法辨识样本)。该数据集用于构建一个三分类任务基准:判断车牌图像是否可直接识别、需进行SR增强,或完全不可恢复。实验表明,即使采用ViT、ResNet和YOLO三种主流图像识别模型,F1分数仍低于80%,凸显了该任务的挑战性,强调了未来研究需聚焦于更智能的预处理决策机制与鲁棒性更强的识别模型。

链接: https://arxiv.org/abs/2508.18425
作者: Lucas Wojcik,Gabriel E. Lima,Valfride Nascimento,Eduil Nascimento Jr.,Rayson Laroca,David Menotti
机构: Federal University of Paraná (联邦帕拉纳大学); Paraná Military Police (帕拉纳州军事警察); Pontifical Catholic University of Paraná (天主教帕拉纳联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2025

点击查看摘要

Abstract:Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at this https URL.
zh

[CV-81] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?

【速读】:该论文试图解决当前视觉基础模型(Vision Foundation Models, FMs)在需要显式推理实体、角色及时空关系的任务中存在的局限性,例如细粒度人类活动识别、第一人称视频理解以及多模态医学图像分析等场景中,空间、时间与语义依赖对性能具有决定性影响。解决方案的关键在于引入显式的关联接口,具体表现为动态关系图(dynamic relational graphs),即根据输入和任务上下文推断拓扑结构与边语义的图结构;通过轻量级、上下文自适应的关系推理模块增强基础模型,在不显著增加计算负担的前提下提升语义精细度、分布外鲁棒性、可解释性及计算效率,同时实现内存与硬件资源的高效利用,从而支持实际部署。

链接: https://arxiv.org/abs/2508.18421
作者: Fatemeh Ziaeetabar
机构: University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.
zh

[CV-82] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems

【速读】:该论文旨在解决人道主义与紧急情境下生物识别技术应用中存在的隐私与安全风险问题,尤其是在脆弱群体中,传统生物特征数据处理方式可能引发数据泄露、身份关联等严重后果。其解决方案的关键在于提出并验证一种适用于移动环境的生物特征模板保护(Biometric Template Protection, BTP)方法——PolyProtect,该方法基于神经网络人脸嵌入(face embeddings)实现,并具备高准确性、不可逆性、不可链接性及轻量计算负担等特性;实验表明,PolyProtect在真实人道主义项目数据集上表现优异,且可扩展至指纹模态,为边缘部署提供可行路径。

链接: https://arxiv.org/abs/2508.18415
作者: Giuseppe Stragapede,Sam Merrick,Vedrana Krivokuća Hahn,Justin Sukaitis,Vincent Graf Narbel
机构: Simprints; Idiap Research Institute; International Committee of the Red Cross (国际红十字委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code
zh

[CV-83] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)人脸建模方法在实时性与身份保真度之间的权衡问题,尤其是针对单张任意姿态人脸图像生成高质量、姿态不变的3DGS人脸模型的挑战。解决方案的关键在于提出一种基于编码器-解码器结构的前馈框架FastAvatar:首先构建一个从多视角训练数据中学习得到的通用3DGS人脸“模板”模型;随后通过设计一个身份特定且姿态无关的潜在嵌入空间,将输入图像编码为该嵌入,并解码出对模板中每个高斯分布的结构和外观参数的残差预测。由于仅需前向推理残差,而非逐帧优化,该方法实现了近实时(10ms)的重建速度,同时保持了优异的重建质量和身份一致性,显著优于现有前馈方法(如GAGAvatar),并比传统优化方法快1000倍以上,且支持实时身份插值与属性编辑。

链接: https://arxiv.org/abs/2508.18389
作者: Hao Liang,Zhixuan Ge,Ashish Tiwari,Soumendu Majee,G.M. Dilshan Godaliyadda,Ashok Veeraraghavan,Guha Balakrishnan
机构: Rice University (莱斯大学); Samsung Research America (三星研究美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template’’ model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar’s novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar’s combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.
zh

[CV-84] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中因忽略模态特异性结构依赖和语义错位而导致的表示质量低、可解释性差及鲁棒性不足的问题。其解决方案的关键在于提出一种名为结构-语义统一框架(Structural-Semantic Unifier, SSU)的新方法:首先,通过动态构建模态特定图结构(textual syntax用于文本,轻量级文本引导注意力机制用于声学与视觉模态),捕捉细粒度的模内关系;其次,引入基于全局文本语义的语义锚点(semantic anchor),作为跨模态对齐枢纽以调和异构语义空间;最后,设计多视角对比学习目标,强化模内与模间的一致性、判别力与结构连贯性,从而实现更精准、高效且可解释的情感推理。

链接: https://arxiv.org/abs/2508.18322
作者: Jiangfeng Sun,Sihao He,Zhonghong Ou,Meina Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,7 figures,conference

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.
zh

[CV-85] Automated Landfill Detection Using Deep Learning: A Comparative Study of Lightweight and Custom Architectures with the AerialWaste Dataset

【速读】:该论文旨在解决非法垃圾填埋场(illegal landfill)难以通过人工方式高效识别的问题,从而减少其对人类健康和环境造成的潜在危害。研究的关键在于利用深度学习技术从多源遥感影像中自动检测非法垃圾填埋场,并克服因数据质量不一和模型过拟合导致的性能瓶颈。解决方案的核心是采用轻量级深度学习模型(如MobileNetV2、GoogLeNet、DenseNet和MobileViT),这些模型在AerialWaste数据集上表现出更强的泛化能力,避免了复杂模型的过拟合问题;进一步通过集成学习(ensemble learning)融合多个最优模型,最终实现了92.33%准确率、92.67%精确率、92.33%敏感度、92.41% F1分数和92.71%特异性的二分类性能,显著提升了检测精度与鲁棒性。

链接: https://arxiv.org/abs/2508.18315
作者: Nowshin Sharmily,Rusab Sarmun,Muhammad E. H. Chowdhury,Mir Hamidul Hussain,Saad Bin Abul Kashem,Molla E Majid,Amith Khandakar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Illegal landfills are posing as a hazardous threat to people all over the world. Due to the arduous nature of manually identifying the location of landfill, many landfills go unnoticed by authorities and later cause dangerous harm to people and environment. Deep learning can play a significant role in identifying these landfills while saving valuable time, manpower and resources. Despite being a burning concern, good quality publicly released datasets for illegal landfill detection are hard to find due to security concerns. However, AerialWaste Dataset is a large collection of 10434 images of Lombardy region of Italy. The images are of varying qualities, collected from three different sources: AGEA Orthophotos, WorldView-3, and Google Earth. The dataset contains professionally curated, diverse and high-quality images which makes it particularly suitable for scalable and impactful research. As we trained several models to compare results, we found complex and heavy models to be prone to overfitting and memorizing training data instead of learning patterns. Therefore, we chose lightweight simpler models which could leverage general features from the dataset. In this study, Mobilenetv2, Googlenet, Densenet, MobileVit and other lightweight deep learning models were used to train and validate the dataset as they achieved significant success with less overfitting. As we saw substantial improvement in the performance using some of these models, we combined the best performing models and came up with an ensemble model. With the help of ensemble and fusion technique, binary classification could be performed on this dataset with 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score and 92.71% specificity.
zh

[CV-86] SERES: Semantic-aware neural reconstruction from sparse views

【速读】:该论文旨在解决从稀疏图像中重建高保真三维模型时因特征匹配错误导致的辐射场歧义(radiance ambiguity)问题。其核心解决方案是通过在神经隐式表示中引入基于图像块(patch-based)的语义logits,并与符号距离场(signed distance field)和辐射场(radiance field)联合优化,从而增强场景的语义感知能力;同时提出一种基于几何原始体(geometric primitive masks)的新正则化策略,以缓解形状歧义,显著提升重建精度。

链接: https://arxiv.org/abs/2508.18314
作者: Bo Xu,Yuhu Guo,Yuchao Wang,Wenting Wang,Yeung Yam,Charlie C.L. Wang,Xinyi Le
机构: Shanghai Jiao Tong University (上海交通大学); The University of Manchester (曼彻斯特大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.
zh

[CV-87] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection

【速读】:该论文旨在解决脑肿瘤在磁共振成像(MRI)中自动检测的挑战,传统人工分析存在耗时长、易出错的问题,且现有方法普遍缺乏对异质性肿瘤的泛化能力、计算效率低、可解释性差,从而限制了其临床可信度。解决方案的关键在于提出一种双流融合模型 MobileDenseAttn,该模型结合 MobileNetV2 和 DenseNet201 的特征提取能力,采用特征级融合策略,在保持高精度的同时显著提升计算效率,并通过 GradCAM 提供可视化热力图以增强模型的可解释性,从而实现高效、稳定、可信赖的脑肿瘤识别。

链接: https://arxiv.org/abs/2508.18294
作者: Shudipta Banik,Muna Das,Trapa Banik,Md. Ehsanul Haque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted at ICCIT 2025 cox bazar, Bangladesh

点击查看摘要

Abstract:The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.
zh

[CV-88] owards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches

【速读】:该论文旨在解决水下三维目标检测中因真实标注声呐数据获取困难而导致的深度学习模型训练瓶颈问题。其核心挑战在于:传统基于深度学习的方法在陆地环境中表现优异,但在水下应用时受限于声学环境恶劣、训练数据稀缺且标注成本高昂,难以实现可靠检测。解决方案的关键在于提出两种无需真实训练数据的检测范式:一是基于物理的声呐仿真流水线,生成合成训练数据以训练先进神经网络;二是基于几何先验的模型模板匹配系统,利用目标对象的结构信息进行检测。实验表明,尽管神经网络在合成数据上达到98%平均精度(mAP),但在真实声呐数据上因域偏移性能骤降至40%;而模板匹配方法在不依赖任何训练的情况下仍保持83% mAP,展现出对声学噪声和环境变化的强鲁棒性,从而为数据匮乏场景下的水下自主导航与监测提供了新路径。

链接: https://arxiv.org/abs/2508.18293
作者: M. Salman Shaukat,Yannik Käckenmeister,Sebastian Bader,Thomas Kirste
机构: University of Rostock (罗斯托克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 12 pages, 7 figures, submitted to IEEE Journal of Oceanic Engineering (IEEE-JOE)

点击查看摘要

Abstract:Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.
zh

[CV-89] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

【速读】:该论文旨在解决现有基于sRGB域的扩散模型在图像恢复任务中面临的高保真度与真实感之间的权衡困境,尤其是在边缘设备中传感器RAW数据难以获取的情况下,传统两阶段图像信号处理(ISP)+图像恢复(IR)流水线导致性能受限的问题。其解决方案的关键在于提出一种端到端的RAW域扩散模型(RDDM),通过引入RAW域变分自编码器(RVAE)学习最优潜在表示,并设计可微分的后期色调处理(PTP)模块实现RAW与sRGB空间的联合优化;同时,构建可扩展的退化流水线从现有sRGB数据集中合成RAW低质量-高质量(LQ-HQ)配对以弥补数据不足,并采用可配置多拜耳(CMB)LoRA模块适配多种RAW滤色阵列模式(如RGGB、BGGR等),从而显著提升图像恢复的保真度并减少伪影。

链接: https://arxiv.org/abs/2508.19154
作者: Yan Chen,Yi Wen,Wei Li,Junchao Liu,Yong Guo,Jie Hu,Xinghao Chen
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.
zh

[CV-90] Random forest-based out-of-distribution detection for robust lung cancer segmentation

【速读】:该论文旨在解决基于Transformer的医学图像分割模型在分布内(in-distribution, ID)数据上表现优异,但在分布外(out-of-distribution, OOD)数据上性能显著下降的问题,这限制了其在真实临床场景中的可靠性。解决方案的关键在于提出RF-Deep方法——一个利用预训练Transformer编码器提取的深层特征进行分类的随机森林(Random Forest, RF)判别器,用于检测OOD扫描并提升分割结果的可靠性。该方法通过在大量未标注3D CT数据上预训练的Swin Transformer编码器提取特征,结合随机森林对ID与OOD样本进行区分,在多个OOD测试集(包括肺栓塞、新冠和肾癌等不同模态和病理条件)中实现了高灵敏度的OOD检测,从而有效增强癌症分割在ID和OOD场景下的鲁棒性。

链接: https://arxiv.org/abs/2508.19112
作者: Aneesh Rangnekar,Harini Veeraraghavan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.
zh

[CV-91] me Series Analysis of Spiking Neural Systems via Transfer Entropy and Directed Persistent Homology

【速读】:该论文旨在解决如何有效刻画神经时间序列中定向信息流及其在不同结构尺度上的拓扑组织模式的问题,尤其针对具有时间分辨和二进制脉冲(spiking)特性的神经系统。其解决方案的关键在于构建一个融合转移熵(Transfer Entropy, TE)与有向持久同调(directed Persistent Homology, PH)的拓扑框架:首先利用TE量化神经元之间的方向性影响,生成加权有向图;随后通过PH分析这些图的多尺度拓扑特征,从而揭示超越成对连接关系的高维交互模式。该方法在逻辑门任务网络、图像分类网络及小鼠皮层记录等多种场景中均表现出对任务复杂度、刺激结构和行为状态的高度敏感性,提供了一种可解释且通用的神经信息流建模工具。

链接: https://arxiv.org/abs/2508.19048
作者: Dylan Peek,Siddharth Pritam,Matthew P. Skerritt,Stephan Chalup
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a topological framework for analysing neural time series that integrates Transfer Entropy (TE) with directed Persistent Homology (PH) to characterize information flow in spiking neural systems. TE quantifies directional influence between neurons, producing weighted, directed graphs that reflect dynamic interactions. These graphs are then analyzed using PH, enabling assessment of topological complexity across multiple structural scales and dimensions. We apply this TE+PH pipeline to synthetic spiking networks trained on logic gate tasks, image-classification networks exposed to structured and perturbed inputs, and mouse cortical recordings annotated with behavioral events. Across all settings, the resulting topological signatures reveal distinctions in task complexity, stimulus structure, and behavioral regime. Higher-dimensional features become more prominent in complex or noisy conditions, reflecting interaction patterns that extend beyond pairwise connectivity. Our findings offer a principled approach to mapping directed information flow onto global organizational patterns in both artificial and biological neural systems. The framework is generalizable and interpretable, making it well suited for neural systems with time-resolved and binary spiking data. Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.19048 [q-bio.NC] (or arXiv:2508.19048v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2508.19048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-92] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

【速读】:该论文旨在解决加速磁共振成像(MRI)数据在下游分割任务中的性能优化问题,即如何在保证分割精度的前提下,有效利用欠采样(undersampled)MRI数据。传统方法通常先进行图像重建再执行分割,而近年来出现的端到端联合重建与分割模型虽具潜力,但缺乏系统比较和统一评估标准。论文的关键解决方案在于首次构建了一个统一基准,对7种不同方法(包括单阶段和两阶段策略)进行了全面对比,结果表明:基于数据一致性约束的简单两阶段方法(先使用成熟MRI重建技术恢复图像,再进行分割)在分割性能上优于专门设计的复杂单阶段方法,揭示了当前最优策略应优先考虑重建质量而非过度定制化模型结构。

链接: https://arxiv.org/abs/2508.18975
作者: Jan Nikolas Morshuis,Matthias Hein,Christian F. Baumgartner
机构: University of Tübingen (图宾根大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textitone-stage approaches, that combine reconstruction and segmentation into a unified model, with \textittwo-stage approaches, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.
zh

[CV-93] Quantum-Circuit-Based Visual Fractal Image Generation in Qiskit and Analytics

【速读】:该论文旨在解决如何利用量子计算技术生成分形图像(特别是Julia集)的问题,以探索量子计算在复杂图案生成中的新应用方向。其解决方案的关键在于构建基于量子电路的框架,充分利用量子叠加态(superposition)、量子随机性(randomness)和量子纠缠(entanglement)等核心原理来操控和生成具有自相似结构的分形图像数据集,从而为量子生成艺术(quantum generative arts)提供可定制化的计算路径。

链接: https://arxiv.org/abs/2508.18835
作者: Hillol Biswas
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As nature is ascribed as quantum, the fractals also pose some intriguing appearance which is found in many micro and macro observable entities or phenomena. Fractals show self-similarity across sizes; structures that resemble the entire are revealed when zoomed in. In Quantum systems, the probability density or wavefunction may exhibit recurring interference patterns at various energy or length scales. Fractals are produced by basic iterative rules (such as Mandelbrot or Julia sets), and they provide limitless complexity. Despite its simplicity, the Schrödinger equation in quantum mechanics produces incredibly intricate patterns of interference and entanglement, particularly in chaotic quantum systems. Quantum computing, the root where lies to the using the principles of quantum-mechanical phenomenon, when applied in fractal image generation, what outcomes are expected? The paper outlines the generation of a Julia set dataset using an approach coupled with building quantum circuit, highlighting the concepts of superposition, randomness, and entanglement as foundational elements to manipulate the generated dataset patterns. As Quantum computing is finding many application areas, the possibility of using quantum circuits for fractal Julia image generation posits a unique direction of future research where it can be applied to quantum generative arts across various ecosystems with a customised approach, such as producing an exciting landscape based on a quantum art theme.
zh

[CV-94] A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework

【速读】:该论文旨在解决基于弱监督学习的黄斑水肿(Macular Edema, ME)区域分割任务中性能落后于全监督方法的问题。其关键解决方案在于:首先,利用频域光学相干断层扫描(Spectral-Domain Optical Coherence Tomography, SD-OCT)图像中水肿区域(Edema Area, EA)与视网膜层结构之间的强相关性,引入一种基于层结构引导的后处理步骤,将密集的EA预测重构为对EA轮廓与视网膜层交点的确认任务,从而提升预测结果对EA形状先验的符合度;其次,设计了一种测试时适应(Test-Time Adaptation, TTA)策略,以缓解训练集与测试集之间EA表现差异带来的分布偏移问题,显著增强模型的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.18790
作者: Yuhui Tao,Yizhe Zhang,Qiang Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.
zh

[CV-95] EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding

【速读】:该论文旨在解决电磁信号(Electromagnetic Signals)在动态频谱管理、智能交通、自动驾驶和无人平台感知等场景中,因信号异质性强、背景噪声大、时频结构复杂,导致现有通用模型难以直接应用的问题;同时应对电磁通信与感知任务多样、跨任务泛化能力弱、迁移效率低以及高质量大规模数据集稀缺等挑战。解决方案的关键在于提出EMind——首个统一且规模最大的标准化电磁信号基础模型,通过构建覆盖多信号类型和任务的基准数据集,并创新性地设计长度自适应多信号打包方法(length adaptive multi-signal packing)和硬件感知训练策略(hardware-aware training strategy),从而实现对异构多源信号的有效利用与表征学习,显著提升下游任务的性能与泛化能力,推动从专用模型向统一电磁智能框架的转变。

链接: https://arxiv.org/abs/2508.18785
作者: Luqing Luo,Wenjin Gui,Yunfei Liu,Ziyue Zhang,Yunxi Zhang,Fengxiang Wang,Zonghao Guo,Zizhi Ma,Xinzhu Liu,Hanxiang He,Jinhai Li,Xin Qiu,Wupeng Xie,Yangang Sun
机构: Institute of Microelectronics of the Chinese Academy of Sciences(中国科学院微电子研究所); Tsinghua University(清华大学); Artificial Intelligence Institute of China Electronics Technology Group Corporation(中国电子科技集团有限公司人工智能研究院); Beijing Institute of Technology(北京理工大学); National University of Defense Technology(国防科技大学); Nankai University(南开大学); Northeastern University(东北大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: this https URL.
zh

[CV-96] A Deep Learning Application for Psoriasis Detection

【速读】:该论文旨在解决皮肤病变图像中银屑病(psoriasis)自动分类的准确性问题,以辅助临床诊断。其解决方案的关键在于采用三种卷积神经网络模型(ResNet50、Inception v3 和 VGG19)进行对比实验,并通过优化评估指标来提升模型性能;结果表明,Inception v3 模型在准确率和 F1-Score 上表现最优(分别为 97.5% ± 0.2),展现出作为银屑病辅助诊断工具的潜力。

链接: https://arxiv.org/abs/2508.18528
作者: Anna Milani,Fábio S. da Silva,Elloá B. Guedes,Ricardo Rios
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures, 1 table, Proceedings of XX Encontro Nacional de Inteligência Artificial e Computacional. in Portuguese language

点击查看摘要

Abstract:In this paper a comparative study of the performance of three Convolutional Neural Network models, ResNet50, Inception v3 and VGG19 for classification of skin images with lesions affected by psoriasis is presented. The images used for training and validation of the models were obtained from specialized platforms. Some techniques were used to adjust the evaluation metrics of the neural networks. The results found suggest the model Inception v3 as a valuable tool for supporting the diagnosis of psoriasis. This is due to its satisfactory performance with respect to accuracy and F1-Score (97.5% \pm 0.2).
zh

[CV-97] Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas

【速读】:该论文旨在解决医疗图像分类中模型遗忘(machine unlearning)的问题,即如何在不重新训练整个模型的前提下,有效移除预训练模型中包含的私有或敏感数据,同时保持模型的鲁棒性和性能。其解决方案的关键在于采用SalUn这一特定的遗忘模型,并通过在PathMNIST、OrganAMNIST和BloodMNIST等医学图像数据集上的实验验证其有效性;结果表明,SalUn在性能上接近于完整重训练,体现出其在医疗应用场景中的高效性与实用性。

链接: https://arxiv.org/abs/2508.18509
作者: Andreza M. C. Falcao,Filipe R. Cordeiro
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SBCAS’25. in Portuguese language

点击查看摘要

Abstract:Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model’s robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.
zh

[CV-98] Federative ischemic stroke segmentation as alternative to overcome domain-shift multi-institution challenges

【速读】:该论文旨在解决缺血性脑卒中(ischemic stroke)在弥散加权成像(DWI)序列中病灶分割的泛化能力不足问题,尤其是在多中心数据存在差异(如患者人口统计学特征、扫描设备厂商及专家标注不一致)和标注样本稀缺的情况下。其关键解决方案是提出一种基于联邦平均(FedAvg)的协作框架,通过共享深度中心无关表示(deep center-independent representations)实现跨机构知识迁移,在无需集中数据或额外训练的前提下,显著提升了模型对不同中心、不同病变形态的适应性与一致性,验证了在分布外(out-of-distribution)场景下仍具备稳定性能的潜力。

链接: https://arxiv.org/abs/2508.18296
作者: Edgar Rangel,Fabio Martinez
机构: UIS (Universidad Industrial de Santander)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 3 tables, source code available

点击查看摘要

Abstract:Stroke is the second leading cause of death and the third leading cause of disability worldwide. Clinical guidelines establish diffusion resonance imaging (DWI, ADC) as the standard for localizing, characterizing, and measuring infarct volume, enabling treatment support and prognosis. Nonetheless, such lesion analysis is highly variable due to different patient demographics, scanner vendors, and expert annotations. Computational support approaches have been key to helping with the localization and segmentation of lesions. However, these strategies are dedicated solutions that learn patterns from only one institution, lacking the variability to generalize geometrical lesions shape models. Even worse, many clinical centers lack sufficient labeled samples to adjust these dedicated solutions. This work developed a collaborative framework for segmenting ischemic stroke lesions in DWI sequences by sharing knowledge from deep center-independent representations. From 14 emulated healthcare centers with 2031 studies, the FedAvg model achieved a general DSC of 0.71 \pm 0.24 , AVD of 5.29 \pm 22.74 , ALD of 2.16 \pm 3.60 and LF1 of 0.70 \pm 0.26 over all centers, outperforming both the centralized and other federated rules. Interestingly, the model demonstrated strong generalization properties, showing uniform performance across different lesion categories and reliable performance in out-of-distribution centers (with DSC of 0.64 \pm 0.29 and AVD of 4.44 \pm 8.74 without any additional training).
zh

人工智能

[AI-0] Model Context Protocols in Adaptive Transport Systems: A Survey

【速读】:该论文旨在解决自适应传输系统中因互联设备、自主系统和人工智能应用快速扩展而导致的严重碎片化问题,即不同协议与上下文源之间相互隔离,难以协同工作。其解决方案的关键在于提出并系统分析模型上下文协议(Model Context Protocol, MCP)作为一种统一范式,通过将协议级自适应与上下文感知决策相结合,实现语义互操作性;MCP采用客户端-服务器架构与JSON-RPC结构,能够有效整合异构系统,并为生成式AI驱动的传输需求提供独特适配的集成框架,从而推动下一代自适应、上下文感知和智能传输基础设施的发展。

链接: https://arxiv.org/abs/2508.19239
作者: Gaurab Chhetri,Shriyank Somvanshi,Md Monzurul Islam,Shamyo Brotee,Mahmuda Sultana Mimi,Dipti Koirala,Biplov Pandey,Subasish Das
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP’s client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.
zh

[AI-1] he Subset Sum Matching Problem ECAI2025

【速读】:该论文旨在解决Subset Sum Matching Problem (SSMP),这是一个组合优化问题,抽象自金融领域中的交易对账等实际应用场景。其核心挑战在于从给定的集合中找到若干子集,使得它们的和满足特定匹配条件,同时优化某种目标函数(如最小化不匹配误差)。解决方案的关键在于提出三种算法:两种启发式近似算法用于快速求解大规模复杂实例,以及一种精确的最优算法用于小规模或对精度要求高的场景;此外,作者构建了一个涵盖不同复杂度的基准测试集,以系统评估各类方法在计算效率与解质量上的表现。

链接: https://arxiv.org/abs/2508.19218
作者: Yufei Wu,Manuel R. Torres,Parisa Zehtabi,Alberto Pozanco Lancho,Michael Cashmore,Daniel Borrajo,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted at ECAI 2025. This is an extended version that includes Supplementary Material

点击查看摘要

Abstract:This paper presents a new combinatorial optimisation task, the Subset Sum Matching Problem (SSMP), which is an abstraction of common financial applications such as trades reconciliation. We present three algorithms, two suboptimal and one optimal, to solve this problem. We also generate a benchmark to cover different instances of SSMP varying in complexity, and carry out an experimental evaluation to assess the performance of the approaches.
zh

[AI-2] Understanding Tool-Integrated Reasoning

【速读】:该论文旨在解决为何工具集成推理(Tool-Integrated Reasoning, TIR)能够显著提升大语言模型(Large Language Models, LLMs)能力的问题,即缺乏一个理论层面的解释来阐明TIR机制的有效性。解决方案的关键在于首次提供了形式化证明:TIR通过引入外部工具(如Python代码解释器),严格扩展了模型的实证支持集(empirical support)和可行策略空间(feasible support),从而突破纯文本模型的能力上限,解锁原本不可行或冗长复杂的解题策略。此外,研究还提出优势形状策略优化(Advantage Shaping Policy Optimization, ASPO)算法,通过直接修改优势函数引导策略行为,在不损害训练稳定性和性能的前提下实现更高效的工具使用,实验表明该方法在数学基准测试中显著优于纯文本模型,并揭示了模型学习“与工具协同思考”的新兴认知模式。

链接: https://arxiv.org/abs/2508.19201
作者: Heng Lin,Zhongwen Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM’s capabilities. We demonstrate that tools enable a strict expansion of the model’s empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR’s success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
zh

[AI-3] Emotions as Ambiguity-aware Ordinal Representations

【速读】:该论文旨在解决现有连续情绪识别方法在处理情绪标注中的模糊性(ambiguity)时存在的不足,即要么忽略模糊性,要么将其视为静态独立变量,而未能捕捉情绪随时间演变的动态特性。其解决方案的关键在于提出一种模糊感知的序数情绪表示框架(ambiguity-aware ordinal emotion representations),通过建模情绪模糊性的变化速率来同时刻画标注模糊性和情绪轨迹的时间动态性。实验结果表明,该方法在无界情绪标签(如参与度)上显著优于传统模糊感知模型,在有界标签(如唤醒度、效价)上也展现出对相对变化更优的捕捉能力。

链接: https://arxiv.org/abs/2508.19193
作者: Jingyao Wu,Matthew Barthet,David Melhart,Georgios N. Yannakakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotions are inherently ambiguous and dynamic phenomena, yet existing continuous emotion recognition approaches either ignore their ambiguity or treat ambiguity as an independent and static variable over time. Motivated by this gap in the literature, in this paper we introduce \emphambiguity-aware ordinal emotion representations, a novel framework that captures both the ambiguity present in emotion annotation and the inherent temporal dynamics of emotional traces. Specifically, we propose approaches that model emotion ambiguity through its rate of change. We evaluate our framework on two affective corpora – RECOLA and GameVibe – testing our proposed approaches on both bounded (arousal, valence) and unbounded (engagement) continuous traces. Our results demonstrate that ordinal representations outperform conventional ambiguity-aware models on unbounded labels, achieving the highest Concordance Correlation Coefficient (CCC) and Signed Differential Agreement (SDA) scores, highlighting their effectiveness in modeling the traces’ dynamics. For bounded traces, ordinal representations excel in SDA, revealing their superior ability to capture relative changes of annotated emotion traces.
zh

[AI-4] Real-Time Model Checking for Closed-Loop Robot Reactive Planning

【速读】:该论文旨在解决自主机器人在实时环境中进行多步路径规划与障碍物避让的问题,尤其是如何在低功耗设备上实现无需预计算数据的高效规划。其解决方案的关键在于提出一种基于模型检测(model checking)的新方法,通过动态生成临时控制系统来应对局部环境扰动,从而维持机器人从目标状态(或静息状态)出发的稳定性;同时采用对局部环境有界变化敏感的二维激光雷达(2D LiDAR)离散化策略,并结合前向深度优先搜索进行多步规划,在真实场景中验证了该方法在巷道和游乐场等复杂环境下的有效性,显著优于仅能执行单步规划的反应式代理。

链接: https://arxiv.org/abs/2508.19186
作者: Christopher Chandler,Bernd Porr,Giulia Lafratta,Alice Miller
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 30 pages excluding references, 18 figures, submitted to Formal Aspects of Computing

点击查看摘要

Abstract:We present a new application of model checking which achieves real-time multi-step planning and obstacle avoidance on a real autonomous robot. We have developed a small, purpose-built model checking algorithm which generates plans in situ based on “core” knowledge and attention as found in biological agents. This is achieved in real-time using no pre-computed data on a low-powered device. Our approach is based on chaining temporary control systems which are spawned to counteract disturbances in the local environment that disrupt an autonomous agent from its preferred action (or resting state). A novel discretization of 2D LiDAR data sensitive to bounded variations in the local environment is used. Multi-step planning using model checking by forward depth-first search is applied to cul-de-sac and playground scenarios. Both empirical results and informal proofs of two fundamental properties of our approach demonstrate that model checking can be used to create efficient multi-step plans for local obstacle avoidance, improving on the performance of a reactive agent which can only plan one step. Our approach is an instructional case study for the development of safe, reliable and explainable planning in the context of autonomous vehicles.
zh

[AI-5] From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity

【速读】:该论文旨在解决机器人在真实世界中自主发现和掌握多样化高绩效技能的难题,现有方法如质量-多样性演员-评论家(Quality-Diversity Actor-Critic, QDAC)依赖人工定义的技能空间和精细调参的启发式策略,限制了其在现实场景中的应用。解决方案的关键在于提出无监督真实世界技能获取(Unsupervised Real-world Skill Acquisition, URSA),它扩展了QDAC框架,能够在无需人工干预的情况下直接在物理机器人上进行技能探索与学习,支持启发式驱动和完全无监督两种设置,并通过实验证明其在模拟与真实环境中均能有效发现多样化的运动技能,且所学技能库可复用于下游任务(如损伤适应),显著优于基线方法。

链接: https://arxiv.org/abs/2508.19172
作者: Luca Grillotti,Lisa Coiffard,Oscar Pang,Maxence Faldor,Antoine Cully(AIRL, Imperial College London)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CoRL 2025

点击查看摘要

Abstract:Autonomous skill discovery aims to enable robots to acquire diverse behaviors without explicit supervision. Learning such behaviors directly on physical hardware remains challenging due to safety and data efficiency constraints. Existing methods, including Quality-Diversity Actor-Critic (QDAC), require manually defined skill spaces and carefully tuned heuristics, limiting real-world applicability. We propose Unsupervised Real-world Skill Acquisition (URSA), an extension of QDAC that enables robots to autonomously discover and master diverse, high-performing skills directly in the real world. We demonstrate that URSA successfully discovers diverse locomotion skills on a Unitree A1 quadruped in both simulation and the real world. Our approach supports both heuristic-driven skill discovery and fully unsupervised settings. We also show that the learned skill repertoire can be reused for downstream tasks such as real-world damage adaptation, where URSA outperforms all baselines in 5 out of 9 simulated and 3 out of 5 real-world damage scenarios. Our results establish a new framework for real-world robot learning that enables continuous skill discovery with limited human intervention, representing a significant step toward more autonomous and adaptable robotic systems. Demonstration videos are available at this http URL .
zh

[AI-6] MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

【速读】:该论文旨在解决当前临床对话系统评估中缺乏对安全行为和风险管控能力的系统性衡量问题,现有评价多集中于任务完成度或语言流畅性,难以满足医疗场景下对安全性与可靠性要求。其解决方案的核心在于提出MATRIX(Multi-Agent simulation framework for safe Interactions and contextual clinical conversational evaluation),这是一个融合结构化安全工程方法与可扩展对话AI评估的框架,关键创新包括:(1) 基于安全工程构建临床场景、预期行为及失效模式的对齐分类体系;(2) 开发BehvJudge——一个基于大语言模型(LLM)的自动化安全缺陷检测工具,在盲测中达到专家级性能(F1=0.96,灵敏度=0.999);(3) 设计PatBot——一种具备真实患者行为模拟能力的代理,通过人因工程验证与偏好研究确保其行为保真度。该框架首次实现将结构化安全分析与规模化、可验证的对话系统评估统一,支持监管合规的安全审计。

链接: https://arxiv.org/abs/2508.19163
作者: Ernest Lim,Yajie Vera He,Jared Joselowitz,Kate Preston,Mohita Chowdhury,Louis Williams,Aisling Higham,Katrina Mason,Mariane Melo,Tom Lawton,Yan Jia,Ibrahim Habli
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 36 pages, 16 figures

点击查看摘要

Abstract:Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets. Comments: 36 pages, 16 figures Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA) MSC classes: 68T50, 68T42, 92C50, 68Q60 ACMclasses: I.2.0; J.3 Cite as: arXiv:2508.19163 [cs.AI] (or arXiv:2508.19163v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.19163 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ernest Junwei Lim [view email] [v1] Tue, 26 Aug 2025 16:12:12 UTC (2,095 KB)
zh

[AI-7] Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games

【速读】:该论文旨在解决当前人工智能(AI)研究过度聚焦于理性决策机制,而忽视了人类决策中由信念、价值观和偏好所驱动的“风格”维度的问题。其核心挑战在于如何将这种主观但关键的“风格”转化为可建模、可测量与可生成的智能特性。解决方案的关键在于提出一个两层式框架:外层为环境交互回路,内层为认知 deliberation(反思)回路,以此构建 playstyle(游戏风格)的形成机制;并进一步定义了如风格容量(style capacity)、风格流行度(style popularity)及演化动态等可量化指标,从而实现对智能体决策风格的系统分析与表达。该方法通过强化学习与模仿学习技术实现特定风格的训练与生成,并探索其在游戏设计等领域的应用潜力,最终为人工通用智能(AGI)中的风格整合提供理论基础。

链接: https://arxiv.org/abs/2508.19152
作者: Chiu-Chou Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
备注: PhD Dissertation, National Yang Ming Chiao Tung University, 2025. This is the public version without Chinese abstract or postscript

点击查看摘要

Abstract:Contemporary artificial intelligence (AI) development largely centers on rational decision-making, valued for its measurability and suitability for objective evaluation. Yet in real-world contexts, an intelligent agent’s decisions are shaped not only by logic but also by deeper influences such as beliefs, values, and preferences. The diversity of human decision-making styles emerges from these differences, highlighting that “style” is an essential but often overlooked dimension of intelligence. This dissertation introduces playstyle as an alternative lens for observing and analyzing the decision-making behavior of intelligent agents, and examines its foundational meaning and historical context from a philosophical perspective. By analyzing how beliefs and values drive intentions and actions, we construct a two-tier framework for style formation: the external interaction loop with the environment and the internal cognitive loop of deliberation. On this basis, we formalize style-related characteristics and propose measurable indicators such as style capacity, style popularity, and evolutionary dynamics. The study focuses on three core research directions: (1) Defining and measuring playstyle, proposing a general playstyle metric based on discretized state spaces, and extending it to quantify strategic diversity and competitive balance; (2) Expressing and generating playstyle, exploring how reinforcement learning and imitation learning can be used to train agents exhibiting specific stylistic tendencies, and introducing a novel approach for human-like style learning and modeling; and (3) Practical applications, analyzing the potential of these techniques in domains such as game design and interactive entertainment. Finally, the dissertation outlines future extensions, including the role of style as a core element in building artificial general intelligence (AGI). Comments: PhD Dissertation, National Yang Ming Chiao Tung University, 2025. This is the public version without Chinese abstract or postscript Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC) Cite as: arXiv:2508.19152 [cs.AI] (or arXiv:2508.19152v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.19152 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chiu-Chou Lin [view email] [v1] Tue, 26 Aug 2025 16:04:18 UTC (10,321 KB)
zh

[AI-8] Uncertainty-Resilient Active Intention Recognition for Robotic Assistants

【速读】:该论文旨在解决机器人助手在执行意图识别时面临的不确定性问题,尤其是人类意图识别中固有的感知误差和结果不确定性,这限制了机器人自主性和协作能力。解决方案的关键在于构建一个以部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)为核心的框架,通过融合实时传感器数据与多种规划器,实现对不确定环境下的合作规划与行动,从而提升系统在真实场景中的鲁棒性与适应性。

链接: https://arxiv.org/abs/2508.19150
作者: Juan Carlos Saborío,Marc Vinci,Oscar Lima,Sebastian Stock,Lennart Niecksch,Martin Günther,Alexander Sung,Joachim Hertzberg,Martin Atzmüller
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: (To appear) In Proceedings of ECMR 2025

点击查看摘要

Abstract:Purposeful behavior in robotic assistants requires the integration of multiple components and technological advances. Often, the problem is reduced to recognizing explicit prompts, which limits autonomy, or is oversimplified through assumptions such as near-perfect information. We argue that a critical gap remains unaddressed – specifically, the challenge of reasoning about the uncertain outcomes and perception errors inherent to human intention recognition. In response, we present a framework designed to be resilient to uncertainty and sensor noise, integrating real-time sensor data with a combination of planners. Centered around an intention-recognition POMDP, our approach addresses cooperative planning and acting under uncertainty. Our integrated framework has been successfully tested on a physical robot with promising results.
zh

[AI-9] Algorithmic Collective Action with Multiple Collectives

【速读】:该论文旨在解决多集体协同算法行动(Algorithmic Collective Action, ACA)在共享学习系统中如何有效施加影响的问题,特别是当多个具有不同规模、策略和目标的集体同时作用于同一分类模型时,其相互作用机制尚不明确。解决方案的关键在于构建首个针对多集体ACA的理论框架,聚焦于集体如何通过植入信号(即人为改变特征与目标类别之间的关联)来偏置分类器的学习过程,并定量分析各集体规模及其目标一致性对整体干预效果的协同影响,从而为多集体场景下的ACA提供系统性理论支撑与优化路径。

链接: https://arxiv.org/abs/2508.19149
作者: Claudio Battiloro,Pietro Greiner,Bret Nestor,Oumaima Amezgar,Francesca Dominici
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:As learning systems increasingly influence everyday decisions, user-side steering via Algorithmic Collective Action (ACA)-coordinated changes to shared data-offers a complement to regulator-side policy and firm-side model design. Although real-world actions have been traditionally decentralized and fragmented into multiple collectives despite sharing overarching objectives-with each collective differing in size, strategy, and actionable goals, most of the ACA literature focused on single collective settings. In this work, we present the first theoretical framework for ACA with multiple collectives acting on the same system. In particular, we focus on collective action in classification, studying how multiple collectives can plant signals, i.e., bias a classifier to learn an association between an altered version of the features and a chosen, possibly overlapping, set of target classes. We provide quantitative results about the role and the interplay of collectives’ sizes and their alignment of goals. Our framework, by also complementing previous empirical results, opens a path for a holistic treatment of ACA with multiple collectives.
zh

[AI-10] SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) Applications

【速读】:该论文旨在解决车联网(V2X)系统中因广泛使用机器学习技术而引发的数据隐私问题,尤其是在智能交通和驾驶员安全应用中可能隐式暴露用户位置或显式泄露如脑电图(EEG)等敏感医疗数据的风险。解决方案的关键在于提出SecureV2X,一个可扩展的多智能体安全神经网络推理框架,其在服务器与每辆车之间部署安全计算机制,以保障数据隐私的同时实现高效推理。该方案通过优化计算轮次、通信开销和执行效率,在多个典型V2X场景(如困倦检测和闯红灯检测)中显著优于现有安全系统,例如在困倦检测任务中比基线快9.4倍、通信量减少16.6倍,且在红灯违规检测的物体检测任务中运行速度接近现有最优基准的100倍。

链接: https://arxiv.org/abs/2508.19115
作者: Joshua Lee,Ali Arastehfard,Weiran Liu,Xuegang Ban,Yuan Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Autonomous driving and V2X technologies have developed rapidly in the past decade, leading to improved safety and efficiency in modern transportation. These systems interact with extensive networks of vehicles, roadside infrastructure, and cloud resources to support their machine learning capabilities. However, the widespread use of machine learning in V2X systems raises issues over the privacy of the data involved. This is particularly concerning for smart-transit and driver safety applications which can implicitly reveal user locations or explicitly disclose medical data such as EEG signals. To resolve these issues, we propose SecureV2X, a scalable, multi-agent system for secure neural network inferences deployed between the server and each vehicle. Under this setting, we study two multi-agent V2X applications: secure drowsiness detection, and secure red-light violation detection. Our system achieves strong performance relative to baselines, and scales efficiently to support a large number of secure computation interactions simultaneously. For instance, SecureV2X is 9.4 \times faster, requires 143\times fewer computational rounds, and involves 16.6\times less communication on drowsiness detection compared to other secure systems. Moreover, it achieves a runtime nearly 100\times faster than state-of-the-art benchmarks in object detection tasks for red light violation detection.
zh

[AI-11] Hybrid Deep Searcher: Integrating Parallel and Sequential Search Reasoning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在处理多步复杂任务时,因纯顺序式外部知识检索导致的推理延迟高、上下文长度膨胀、 coherence 下降及准确率受限的问题。其解决方案的关键在于提出一种名为 HDS-QA 的合成数据集,该数据集基于 Natural Questions 构建,专门用于训练 LRM 区分可并行执行的子查询与依赖顺序执行的子查询;在此基础上微调得到 HybridDeepSearcher 模型,使其能够显式地融合并行与串行查询策略,在保持高精度的同时显著减少搜索轮次和推理延迟,并具备良好的扩展性。

链接: https://arxiv.org/abs/2508.19113
作者: Dayoon Ko,Jihyuk Kim,Haeju Park,Sohyeon Kim,Dahyun Lee,Yongrae Jo,Gunhee Kim,Moontae Lee,Kyungjae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated strong performance in complex, multi-step reasoning tasks. Existing methods enhance LRMs by sequentially integrating external knowledge retrieval; models iteratively generate queries, retrieve external information, and progressively reason over this information. However, purely sequential querying increases inference latency and context length, diminishing coherence and potentially reducing accuracy. To address these limitations, we introduce HDS-QA (Hybrid Deep Search QA), a synthetic dataset automatically generated from Natural Questions, explicitly designed to train LRMs to distinguish parallelizable from sequential queries. HDS-QA comprises hybrid-hop questions that combine parallelizable independent subqueries (executable simultaneously) and sequentially dependent subqueries (requiring step-by-step resolution), along with synthetic reasoning-querying-retrieval paths involving parallel queries. We fine-tune an LRM using HDS-QA, naming the model HybridDeepSearcher, which outperforms state-of-the-art baselines across multiple benchmarks, notably achieving +15.9 and +11.5 F1 on FanOutQA and a subset of BrowseComp, respectively, both requiring comprehensive and exhaustive search. Experimental results highlight two key advantages: HybridDeepSearcher reaches comparable accuracy with fewer search turns, significantly reducing inference latency, and it effectively scales as more turns are permitted. These results demonstrate the efficiency, scalability, and effectiveness of explicitly training LRMs to leverage hybrid parallel and sequential querying.
zh

[AI-12] Reasoning LLM s in the Medical Domain: A Literature Survey

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗应用中从基础信息检索工具向具备复杂临床推理能力的系统演进过程中所面临的关键挑战,包括决策透明性不足、评估方法不完善、偏见未有效缓解以及多模态临床数据整合困难等问题。其解决方案的关键在于系统梳理支撑医疗LLMs推理能力的技术基础,重点聚焦于链式思维(Chain-of-Thought)等专用提示技术与强化学习(Reinforcement Learning)突破(如DeepSeek-R1),并深入探讨多智能体协作系统和创新提示架构等新兴范式,从而为构建可信赖、安全且高效的医疗LLM提供理论框架与实践路径。

链接: https://arxiv.org/abs/2508.19097
作者: Armin Berger,Sarthak Khanna,David Berghaus,Rafet Sifa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of advanced reasoning capabilities in Large Language Models (LLMs) marks a transformative development in healthcare applications. Beyond merely expanding functional capabilities, these reasoning mechanisms enhance decision transparency and explainability-critical requirements in medical contexts. This survey examines the transformation of medical LLMs from basic information retrieval tools to sophisticated clinical reasoning systems capable of supporting complex healthcare decisions. We provide a thorough analysis of the enabling technological foundations, with a particular focus on specialized prompting techniques like Chain-of-Thought and recent breakthroughs in Reinforcement Learning exemplified by DeepSeek-R1. Our investigation evaluates purpose-built medical frameworks while also examining emerging paradigms such as multi-agent collaborative systems and innovative prompting architectures. The survey critically assesses current evaluation methodologies for medical validation and addresses persistent challenges in field interpretation limitations, bias mitigation strategies, patient safety frameworks, and integration of multimodal clinical data. Through this survey, we seek to establish a roadmap for developing reliable LLMs that can serve as effective partners in clinical practice and medical research.
zh

[AI-13] rustworthy Agents for Electronic Health Records through Confidence Estimation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电子健康记录(Electronic Health Records, EHR)信息提取和临床决策支持中因幻觉(Hallucination)风险而导致的可靠性问题。传统基于准确率的评估指标难以反映模型在低置信度时的不可靠输出,从而误导临床应用。解决方案的关键在于提出一种新的量化指标——HCAcc@k%(Hallucination Controlled Accuracy at k%),用于衡量不同置信度阈值下的准确性与可靠性权衡,并设计了一个置信度感知的临床问答代理系统TrustEHRAgent,其核心创新是引入分步置信度估计机制,使模型能够在高置信度时提供准确答案,在低置信度时主动表达不确定性。实验表明,TrustEHRAgent在MIMIC-III和eICU数据集上显著优于基线方法,尤其在严格可靠性约束下(如HCAcc@70%)分别提升44.23个百分点和25.34个百分点,验证了该方法在构建可信赖医疗AI代理中的有效性。

链接: https://arxiv.org/abs/2508.19096
作者: Yongwoo Song,Minbyul Jeong,Mujeen Sung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show promise for extracting information from Electronic Health Records (EHR) and supporting clinical decisions. However, deployment in clinical settings faces challenges due to hallucination risks. We propose Hallucination Controlled Accuracy at k% (HCAcc@k%), a novel metric quantifying the accuracy-reliability trade-off at varying confidence thresholds. We introduce TrustEHRAgent, a confidence-aware agent incorporating stepwise confidence estimation for clinical question answering. Experiments on MIMIC-III and eICU datasets show TrustEHRAgent outperforms baselines under strict reliability constraints, achieving improvements of 44.23%p and 25.34%p at HCAcc@70% while baseline methods fail at these thresholds. These results highlight limitations of traditional accuracy metrics in evaluating healthcare AI agents. Our work contributes to developing trustworthy clinical agents that deliver accurate information or transparently express uncertainty when confidence is low.
zh

[AI-14] APT-LLM : Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在GPU上进行任意精度量化时面临的计算效率低下问题,尤其是受限于Tensor Cores支持不足、内存管理低效及内核优化灵活性差等挑战。其解决方案的关键在于提出一套完整的加速框架APT-LLM:首先设计了一种新型数据格式bipolar-INT,实现与有符号整数(signed INT)的无损转换并提升并行计算效率;其次开发了一种基于比特级拆分与重组的矩阵乘法(MatMul)方法,支持任意精度且最大化利用GPU Tensor Cores;同时构建了以数据恢复为核心的内存管理系统,通过高速共享内存显著降低访问延迟;最后引入动态内核映射策略,根据矩阵尺寸自适应选择最优超参数配置,从而在不同LLM架构和精度设置下均实现高性能推理。

链接: https://arxiv.org/abs/2508.19087
作者: Shaobo Ma,Chao Fang,Haikuo Shao,Zhongfeng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: To appear in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99 \times speedup compared to FP16 baselines and a 2.16 \times speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44 \times speedup over FP16 and 1.65 \times speedup over CUTLASS integer baselines.
zh

[AI-15] An LLM -powered Natural-to-Robotic Language Translation Framework with Correctness Guarantees

【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)在机器人控制程序生成中因模型不一致性与任务复杂性导致的编程错误问题,尤其在轻量级LLM应用时效果显著下降。其解决方案的关键在于提出一种自然语言-机器人语言翻译框架(NRTrans),通过引入机器人技能语言(Robot Skill Language, RSL)抽象控制程序细节,实现自然语言任务到机器人技能的映射;并构建RSL编译器与调试器,对LLM生成的程序进行正确性验证和反馈式微调,从而在程序执行前提供形式化验证保障,显著提升LLM驱动的机器人应用的有效性和可靠性。

链接: https://arxiv.org/abs/2508.19074
作者: ZhenDong Chen,ZhanShang Nie,ShiXing Wan,JunYi Li,YongTian Cheng,Shuai Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:The Large Language Models (LLM) are increasingly being deployed in robotics to generate robot control programs for specific user tasks, enabling embodied intelligence. Existing methods primarily focus on LLM training and prompt design that utilize LLMs to generate executable programs directly from user tasks in natural language. However, due to the inconsistency of the LLMs and the high complexity of the tasks, such best-effort approaches often lead to tremendous programming errors in the generated code, which significantly undermines the effectiveness especially when the light-weight LLMs are applied. This paper introduces a natural-robotic language translation framework that (i) provides correctness verification for generated control programs and (ii) enhances the performance of LLMs in program generation via feedback-based fine-tuning for the programs. To achieve this, a Robot Skill Language (RSL) is proposed to abstract away from the intricate details of the control programs, bridging the natural language tasks with the underlying robot skills. Then, the RSL compiler and debugger are constructed to verify RSL programs generated by the LLM and provide error feedback to the LLM for refining the outputs until being verified by the compiler. This provides correctness guarantees for the LLM-generated programs before being offloaded to the robots for execution, significantly enhancing the effectiveness of LLM-powered robotic applications. Experiments demonstrate NRTrans outperforms the existing method under a range of LLMs and tasks, and achieves a high success rate for light-weight LLMs.
zh

[AI-16] Attackers Strike Back? Not Anymore - An Ensemble of RL Defenders Awakens for APT Detection

【速读】:该论文旨在解决传统检测系统在应对高级持续性威胁(Advanced Persistent Threats, APTs)时存在的静态性和适应性不足的问题。APT具有隐蔽性强、策略可变且持续时间长的特点,常能绕过基于特征签名的检测机制。解决方案的关键在于构建一个融合深度学习、强化学习(Reinforcement Learning, RL)与主动学习(Active Learning)的自适应防御框架:首先利用自动编码器对进程行为进行潜在空间编码,随后通过多智能体强化学习架构(包括Q-Learning、PPO、DQN及对抗性防御代理)对编码后的向量进行判别分析;当任一代理对决策不确定时,系统触发主动学习循环以模拟专家反馈,从而优化分类边界,并最终采用基于各代理性能加权的集成投票机制实现鲁棒预测。

链接: https://arxiv.org/abs/2508.19072
作者: Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) represent a growing menace to modern digital infrastructure. Unlike traditional cyberattacks, APTs are stealthy, adaptive, and long-lasting, often bypassing signature-based detection systems. This paper introduces a novel framework for APT detection that unites deep learning, reinforcement learning (RL), and active learning into a cohesive, adaptive defense system. Our system combines auto-encoders for latent behavioral encoding with a multi-agent ensemble of RL-based defenders, each trained to distinguish between benign and malicious process behaviors. We identify a critical challenge in existing detection systems: their static nature and inability to adapt to evolving attack strategies. To this end, our architecture includes multiple RL agents (Q-Learning, PPO, DQN, adversarial defenders), each analyzing latent vectors generated by an auto-encoder. When any agent is uncertain about its decision, the system triggers an active learning loop to simulate expert feedback, thus refining decision boundaries. An ensemble voting mechanism, weighted by each agent’s performance, ensures robust final predictions.
zh

[AI-17] Dynamic Triangulation-Based Graph Rewiring for Graph Neural Networks CIKM2025

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时因图拓扑特性导致的性能瓶颈问题,主要包括过度压缩(oversquashing)和过度平滑(oversmoothing)。其解决方案的关键在于提出了一种名为TRIGON的新框架,通过学习从多个图视图中选择相关三角形来构建增强的非平面三角剖分,从而重构图结构。该方法联合优化三角形选择策略与下游分类任务性能,在保持信息传播效率的同时显著改善图的结构性质,如降低直径、增大谱间隙(spectral gap)和减少有效电阻(effective resistance),最终在同质性(homophilic)与异质性(heterophilic)基准上均优于现有最优图重布线方法。

链接: https://arxiv.org/abs/2508.19071
作者: Hugo Attali,Thomas Papastergiou,Nathalie Pernelle,Fragkiskos D. Malliaros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to CIKM 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as the leading paradigm for learning over graph-structured data. However, their performance is limited by issues inherent to graph topology, most notably oversquashing and oversmoothing. Recent advances in graph rewiring aim to mitigate these limitations by modifying the graph topology to promote more effective information propagation. In this work, we introduce TRIGON, a novel framework that constructs enriched, non-planar triangulations by learning to select relevant triangles from multiple graph views. By jointly optimizing triangle selection and downstream classification performance, our method produces a rewired graph with markedly improved structural properties such as reduced diameter, increased spectral gap, and lower effective resistance compared to existing rewiring methods. Empirical results demonstrate that TRIGON outperforms state-of-the-art approaches on node classification tasks across a range of homophilic and heterophilic benchmarks.
zh

[AI-18] Can Structured Templates Facilitate LLM s in Tackling Harder Tasks? : An Exploration of Scaling Laws by Difficulty

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂数学任务中缺乏深层程序性推理能力的问题,尤其在需要结构化、步骤化思维的任务上表现不足。现有后训练方法虽能提升性能,但难以有效捕捉深层次的程序逻辑。其解决方案的关键在于提出结构化解题模板(Structured Solution Template, SST)框架,该框架通过三个核心机制实现:(1) 使用结构化解题模板链进行微调并引入动态加权损失以优先学习程序逻辑;(2) 在推理时注入解题模板作为认知支架引导生成过程;(3) 集成课程学习策略,显式训练模型完成自我规划—执行—自校正的完整推理流程。实验表明,SST显著提升了模型在GSM8K、AIME24及新提出的Dynamic En基准上的准确率与效率,尤其在高难度问题上优势明显。

链接: https://arxiv.org/abs/2508.19069
作者: Zhichao Yang,Zhaoxin Fan,Gen Li,Yuanze Hu,Xinyu Wang,Ye Qiu,Xin Wang,Yifan Sun,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Structured, procedural reasoning is essential for Large Language Models (LLMs), especially in mathematics. While post-training methods have improved LLM performance, they still fall short in capturing deep procedural logic on complex tasks. To tackle the issue, in this paper, we first investigate this limitation and uncover a novel finding: a Scaling Law by Difficulty, which reveals that model performance follows a U-shaped curve with respect to training data complexity – excessive low-difficulty data impedes abstraction, while high-difficulty data significantly enhances reasoning ability. Motivated by this, we propose the Structured Solution Template (SST) framework, which uses solution templates and a curriculum of varied difficulty to explicitly teach procedural reasoning. Specifically, SST comprises (1) fine-tuning with structured solution-template chains and dynamically weighted loss to prioritize procedural logic, (2) prompt-time injection of solution templates as cognitive scaffolds to guide inference, and (3) integrated curriculum fine-tuning that explicitly teaches the model to self-plan - execute - self-correct. Experiments on GSM8K, AIME24, and new Dynamic En benchmark show that SST significantly improves both accuracy and efficiency, especially on harder problems.
zh

[AI-19] ackling Federated Unlearning as a Parameter Estimation Problem

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中数据擦除(data erasure)的隐私挑战,即在不重新训练整个模型或协调所有客户端的情况下,有效移除特定客户端或类别数据对模型的影响。其解决方案的关键在于引入基于信息论的联邦遗忘(Federated Unlearning)框架,将数据泄露建模为参数估计问题,并利用二阶Hessian信息识别出对被遗忘数据最敏感的模型参数,仅对这些参数进行选择性重置,随后执行最小限度的联邦再训练。该方法具有模型无关性,无需服务器在初始聚合后访问原始客户端数据,同时在基准数据集上实现了接近随机的成员推理攻击成功率(MIA success near random),并保持了与完整再训练相当的性能(Normalized Accuracy ≈ 0.9),且在针对性后门攻击场景下可有效消除恶意触发器,恢复模型完整性。

链接: https://arxiv.org/abs/2508.19065
作者: Antonio Balordi,Lorenzo Manini,Fabio Stella,Alessio Merlo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Privacy regulations require the erasure of data from deep learning models. This is a significant challenge that is amplified in Federated Learning, where data remains on clients, making full retraining or coordinated updates often infeasible. This work introduces an efficient Federated Unlearning framework based on information theory, modeling leakage as a parameter estimation problem. Our method uses second-order Hessian information to identify and selectively reset only the parameters most sensitive to the data being forgotten, followed by minimal federated retraining. This model-agnostic approach supports categorical and client unlearning without requiring server access to raw client data after initial information aggregation. Evaluations on benchmark datasets demonstrate strong privacy (MIA success near random, categorical knowledge erased) and high performance (Normalized Accuracy against re-trained benchmarks of \approx 0.9), while aiming for increased efficiency over complete retraining. Furthermore, in a targeted backdoor attack scenario, our framework effectively neutralizes the malicious trigger, restoring model integrity. This offers a practical solution for data forgetting in FL.
zh

[AI-20] A Concurrent Modular Agent : Framework for Autonomous LLM Agents

【速读】:该论文旨在解决传统智能体(Agent)架构中长期存在的难题,即如何在保持模块化与异步执行的同时,实现行为的一致性、容错性和自适应性。现有方法往往难以协调多个独立模块之间的交互,导致系统缺乏灵活性和上下文感知能力。解决方案的关键在于提出并发模块化智能体(Concurrent Modular Agent, CMA)框架,其核心机制是通过语言驱动的异步模块间通信,使意图从自主进程间的语言交互中自然涌现;该框架利用大型语言模型(Large-Language-Model, LLM)进行推理,并结合共享全局状态,从而实现灵活、自适应且具容错性的行为闭环。这一设计被视为对Minsky“心灵社会”理论(Society of Mind)的一种实用实现路径。

链接: https://arxiv.org/abs/2508.19042
作者: Norihiro Maruyama,Takahide Yoshida,Hiroki Sato,Atsushi Masumori,Johnsmith,Takashi Ikegami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the Concurrent Modular Agent (CMA), a framework that orchestrates multiple Large-Language-Model (LLM)-based modules that operate fully asynchronously yet maintain a coherent and fault-tolerant behavioral loop. This framework addresses long-standing difficulties in agent architectures by letting intention emerge from language-mediated interactions among autonomous processes. This approach enables flexible, adaptive, and context-dependent behavior through the combination of concurrently executed modules that offload reasoning to an LLM, inter-module communication, and a single shared global this http URL consider this approach to be a practical realization of Minsky’s Society of Mind theory. We demonstrate the viability of our system through two practical use-case studies. The emergent properties observed in our system suggest that complex cognitive phenomena like self-awareness may indeed arise from the organized interaction of simpler processes, supporting Minsky-Society of Mind concept and opening new avenues for artificial intelligence research. The source code for our work is available at: this https URL.
zh

[AI-21] Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

【速读】:该论文旨在解决现有评估任务在交互式未知环境中对大语言模型(Large Language Models, LLMs)推理能力评价不足的问题,尤其是忽略了人类在真实世界中不可或缺的整合性推理过程(包括演绎、归纳与溯因推理)。解决方案的关键在于提出一种新的评估范式——“黑盒交互”(black-box interaction),即通过让LLMs与隐藏函数(black-box)进行有限轮次的交互,观察输入输出对并推理其内在规则,从而考察模型在动态探索中的综合推理能力。基于此范式构建的Oracle基准测试包含6类黑盒任务和96个具体实例,首次系统性地评估了19个现代LLMs在此类任务上的表现,揭示了当前模型在高阶规划能力上的普遍短板,特别是在制定高效自适应探索策略以迭代优化假设方面存在显著困难。

链接: https://arxiv.org/abs/2508.19035
作者: Congchi Yin,Tianyi Wu,Yankai Shu,Alex Gu,Yunhan Wang,Jun Shao,Xun Jiang,Piji Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textitblack-box interaction, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textscOracle benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.
zh

[AI-22] Metric Matters: A Formal Evaluation of Similarity Measures in Active Learning for Cyber Threat Intelligence

【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)检测中面临的两大挑战:一是APT攻击行为的隐蔽性导致检测难度增加,二是检测数据集普遍存在的极端类别不平衡问题。为应对上述问题,作者提出了一种基于主动学习(Active Learning)的异常检测框架,其关键创新在于利用特征空间中的相似性搜索机制,迭代优化决策边界。该框架构建于注意力自编码器(Attention-Based Autoencoder)之上,通过识别“正常类相似”和“异常类相似”的样本实例,显著提升模型在极少人工标注(oracle supervision)下的鲁棒性与泛化能力。研究进一步对多种相似性度量方法进行形式化评估,揭示了相似性函数选择对样本筛选效率、异常排序准确性及模型收敛速度的关键影响,从而为面向威胁情报与网络安全场景的主动学习流程提供可操作的指标指导。

链接: https://arxiv.org/abs/2508.19019
作者: Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) pose a severe challenge to cyber defense due to their stealthy behavior and the extreme class imbalance inherent in detection datasets. To address these issues, we propose a novel active learning-based anomaly detection framework that leverages similarity search to iteratively refine the decision space. Built upon an Attention-Based Autoencoder, our approach uses feature-space similarity to identify normal-like and anomaly-like instances, thereby enhancing model robustness with minimal oracle supervision. Crucially, we perform a formal evaluation of various similarity measures to understand their influence on sample selection and anomaly ranking effectiveness. Through experiments on diverse datasets, including DARPA Transparent Computing APT traces, we demonstrate that the choice of similarity metric significantly impacts model convergence, anomaly detection accuracy, and label efficiency. Our results offer actionable insights for selecting similarity functions in active learning pipelines tailored for threat intelligence and cyber defense.
zh

[AI-23] MAB Optimizer for Estimating Math Question Difficulty via Inverse CV without NLP

【速读】:该论文旨在解决智能自主教学系统(Intelligent Autonomous Tutoring Systems, IATS)中题目难度评估的客观性与泛化能力问题,尤其针对传统人工标注主观性强、现有自然语言处理(Natural Language Processing, NLP)方法在代数等符号领域失效的局限。解决方案的关键在于提出一种基于强化学习的多臂赌博机(Multi-Armed Bandit, MAB)框架——被动受教育者测量法(Approach of Passive Measures among Educands, APME),其仅依赖解题者的性能数据(如完成时间与正确率)进行难度估计,无需语言特征或专家标签。通过引入逆变异系数(inverse coefficient of variation)作为风险调整指标,该模型实现了可解释且可扩展的自监督难度建模,在多个异构数据集上表现出高精度(平均R²=0.9213,RMSE=0.0584),并优于回归、NLP驱动及项目反应理论(Item Response Theory, IRT)等基线方法,尤其在纯符号任务中优势显著。

链接: https://arxiv.org/abs/2508.19014
作者: Surajit Das,Gourav Roy,Aleksei Eliseev,Ram Kumar Rajendran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of technology and education is driving the emergence of Intelligent Au- tonomous Tutoring Systems (IATS), where objective and domain-agnostic methods for determining question difficulty are essential. Traditional human labeling is subjective, and existing NLP-based ap- proaches fail in symbolic domains like algebra. This study introduces the Approach of Passive Measures among Educands (APME), a reinforcement learning-based Multi-Armed Bandit (MAB) framework that estimates difficulty solely from solver performance data- marks obtained and time taken without re- quiring linguistic features or expert labels. By leveraging the inverse coefficient of variation as a risk- adjusted metric, the model provides an explainable and scalable mechanism for adaptive assessment. Empirical validation was conducted on three heterogeneous datasets. Across these diverse con- texts, the model achieved an average R2 of 0.9213 and an average RMSE of 0.0584, confirming its robustness, accuracy, and adaptability to different educational levels and assessment formats. Com- pared with baseline approaches-such as regression-based, NLP-driven, and IRT models-the proposed framework consistently outperformed alternatives, particularly in purely symbolic domains. The findings highlight that (i) item heterogeneity strongly influences perceived difficulty, and (ii) vari- ance in solver outcomes is as critical as mean performance for adaptive allocation. Pedagogically, the model aligns with Vygotskys Zone of Proximal Development by identifying tasks that balance challenge and attainability, supporting motivation while minimizing disengagement. This domain-agnostic, self- supervised approach advances difficulty tagging in IATS and can be extended beyond algebra wherever solver interaction data is available
zh

[AI-24] STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems

【速读】:该论文旨在解决工业时序数据中缺失值填补(missing value imputation)的问题,尤其针对传统基于固定时间窗口的深度学习方法在非平稳、受控动态系统中表现不佳的情况。其关键解决方案是提出STDiff模型,该模型将填补任务重构为学习系统从一个状态到下一个状态的演化过程,采用具有因果偏置(causal bias)的条件去噪扩散模型(conditional denoising diffusion model),通过最近已知状态及相关的控制或环境输入逐步生成缺失值,从而更好地捕捉工业系统的动态特性。实验表明,STDiff在模拟和真实工业数据上均显著优于窗口基线模型,尤其在长间隔缺失场景下优势更明显。

链接: https://arxiv.org/abs/2508.19011
作者: Gary Simethy,Daniel Ortiz-Arroyo,Petar Durdevic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most deep learning methods for imputing missing values treat the task as completing patterns within a fixed time window. This assumption often fails in industrial systems, where dynamics are driven by control actions, are highly non-stationary, and can experience long, uninterrupted gaps. We propose STDiff, which reframes imputation as learning how the system evolves from one state to the next. STDiff uses a conditional denoising diffusion model with a causal bias aligned to control theory, generating missing values step-by-step based on the most recent known state and relevant control or environmental inputs. On a public wastewater treatment dataset with simulated missing blocks, STDiff consistently achieves the lowest errors, with its advantage increasing for longer gaps. On a raw industrial dataset with substantial real gaps, it produces trajectories that remain dynamically plausible, in contrast to window-based models that tend to flatten or over-smooth. These results support dynamics-aware, explicitly conditioned imputation as a robust approach for industrial time series, and we discuss computational trade-offs and extensions to broader domains.
zh

[AI-25] Sense of Self and Time in Borderline Personality. A Comparative Robustness Study with Generative AI

【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)支持边缘型人格障碍(Borderline Personality Disorder, BPD)患者第一人称经验的描述性现象学定性分析问题,特别是针对BPD作为时间性和自我认同紊乱障碍的特性。其解决方案的关键在于:通过提示三种大型语言模型(LLMs)模仿原始研究者的人类解释风格,并由盲评专家从语义一致性、Jaccard系数及多维效度(可信度、连贯性、充实性和数据根基性)进行评估,发现Google Gemini 2.5 Pro在主题重合率(58%)和专家判断上最接近人类分析,且显著优于GPT-4o和Claude Opus 4(p < 0.0001),表明AI可辅助识别人类遗漏的主题,从而缓解主观解释偏差,提升质性分析的客观性和完整性。

链接: https://arxiv.org/abs/2508.19008
作者: Marcin Moskalewicz,Anna Sterna,Marek Pokropski,Paula Flores
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 4 tables, submitted to “Personality and Individual Differences”

点击查看摘要

Abstract:This study examines the capacity of large language models (LLMs) to support phenomenological qualitative analysis of first-person experience in Borderline Personality Disorder (BPD), understood as a disorder of temporality and selfhood. Building on a prior human-led thematic analysis of 24 inpatients’ life-story interviews, we compared three LLMs (OpenAI GPT-4o, Google Gemini 2.5 Pro, Anthropic Claude Opus 4) prompted to mimic the interpretative style of the original investigators. The models were evaluated with blinded and non-blinded expert judges in phenomenology and clinical psychology. Assessments included semantic congruence, Jaccard coefficients, and multidimensional validity ratings (credibility, coherence, substantiveness, and groundness in data). Results showed variable overlap with the human analysis, from 0 percent in GPT to 42 percent in Claude and 58 percent in Gemini, and a low Jaccard coefficient (0.21-0.28). However, the models recovered themes omitted by humans. Gemini’s output most closely resembled the human analysis, with validity scores significantly higher than GPT and Claude (p 0.0001), and was judged as human by blinded experts. All scores strongly correlated (R 0.78) with the quantity of text and words per theme, highlighting both the variability and potential of AI-augmented thematic analysis to mitigate human interpretative bias.
zh

[AI-26] AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms

【速读】:该论文试图解决的核心问题是:社会规范(social norms)是如何被习得与表征的?传统认知科学认为人类主要通过具身的社会经验来学习规范,而本文旨在探究大型语言模型是否仅通过统计学习即可获得对社会规范的深刻理解。解决方案的关键在于:利用大规模语言模型对555个日常情境的社会适当性判断进行预测,并以人类平均判断为基准评估其准确性。结果显示,GPT-4.5、Gemini 2.5 Pro等模型在连续尺度上的预测精度均超越绝大多数个体人类参与者(如GPT-4.5达到100百分位),表明基于语言数据的统计学习足以催生复杂的社交认知能力,从而挑战了强调具身经验不可或缺性的理论观点。

链接: https://arxiv.org/abs/2508.19004
作者: Pontus Strimling,Simon Karlsson,Irina Vartanova,Kimmo Eriksson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages + supplementy materials

点击查看摘要

Abstract:A fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large language models can achieve sophisticated norm understanding through statistical learning alone. Across two studies, we systematically evaluated multiple AI systems’ ability to predict human social appropriateness judgments for 555 everyday scenarios by examining how closely they predicted the average judgment compared to each human participant. In Study 1, GPT-4.5’s accuracy in predicting the collective judgment on a continuous scale exceeded that of every human participant (100th percentile). Study 2 replicated this, with Gemini 2.5 Pro outperforming 98.7% of humans, GPT-5 97.8%, and Claude Sonnet 4 96.0%. Despite this predictive power, all models showed systematic, correlated errors. These findings demonstrate that sophisticated models of social cognition can emerge from statistical learning over linguistic data alone, challenging strong versions of theories emphasizing the exclusive necessity of embodied experience for cultural competence. The systematic nature of AI limitations across different architectures indicates potential boundaries of pattern-based social understanding, while the models’ ability to outperform nearly all individual humans in this predictive task suggests that language serves as a remarkably rich repository for cultural knowledge transmission.
zh

[AI-27] GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leverag ing

【速读】:该论文旨在解决当前代码智能体(Code Agent)在真实软件开发场景中,对大规模代码仓库(如 GitHub)的利用能力缺乏系统评估的问题。现有基准测试多聚焦于“从零编写代码”(scratch coding),而忽视了实际工作中通过分析和操作开源项目来完成复杂任务的关键能力。为填补这一空白,作者提出 GitTaskBench,一个包含 54 个跨 7 个模态与 7 个领域的真实任务的基准测试平台,每个任务均配以自动化且人工校准的评估机制,明确定义实践成功标准。其核心创新在于引入 alpha-value 指标,量化代理性能的经济价值,融合任务成功率、Token 成本与平均开发者薪资;实验表明,即便使用最先进的代理框架(如 OpenHands+Claude 3.7)也仅能解决 48.15% 的任务,失败主因集中在环境配置与依赖解析等看似简单但至关重要的步骤,凸显出提升工作流管理鲁棒性与超时应对策略的重要性。

链接: https://arxiv.org/abs/2508.18993
作者: Ziyi Ni,Huacan Wang,Shuo Zhang,Shuo Lu,Ziyang He,Wang You,Zhenheng Tang,Yuntao Du,Bill Sun,Hongzhang Liu,Sen Hu,Ronghao Chen,Bo Li,Xin Li,Chen Hu,Binxing Jiao,Daxin Jiang,Pin Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Highly practical, Well-motivated, Actionable

点击查看摘要

Abstract:Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks. Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness. By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment – moving agents closer to solving complex, end-to-end real-world tasks. The benchmark and code are open-sourced at this https URL.
zh

[AI-28] Enabling MoE on the Edge via Importance-Driven Expert Scheduling

【速读】:该论文旨在解决在消费级边缘设备上部署稀疏激活的专家混合模型(Mixture of Experts, MoE)时面临的内存受限问题,尤其是在动态专家卸载(dynamic expert offloading)过程中如何平衡内存占用、数据传输开销与模型精度。其解决方案的关键在于引入专家重要性(expert importance)作为决策依据,将低重要性激活的专家替换为已缓存在GPU内存中的功能相似专家,从而在不显著损失准确率的前提下降低内存使用和PCIe通信开销;同时设计了一种调度策略以最大化GPU缓存专家的重用率,进一步提升系统效率。

链接: https://arxiv.org/abs/2508.18983
作者: Guoying Zhu,Meng Li,Haipeng Dai,Xuechen Liu,Weijun Wang,Keran Li,Jun xiao,Ligeng Chen,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.
zh

[AI-29] PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations

【速读】:该论文旨在解决时间序列预测模型缺乏可解释性的问题,尤其是现有后验解释方法(如LIME)不适用于时间序列场景的局限性。其核心解决方案是提出一种模型无关的后验解释算法PAX-TS,该方法基于局部输入扰动,生成多粒度的解释结果,并能刻画多变量时间序列预测中通道间的相关性。关键创新在于通过时间步相关性矩阵识别出六类重复出现的模式,这些模式与模型性能显著相关,从而为理解不同预测模型的行为差异提供了量化依据。

链接: https://arxiv.org/abs/2508.18982
作者: Tim Kreuzer,Jelena Zdravkovic,Panagiotis Papapetrou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting has seen considerable improvement during the last years, with transformer models and large language models driving advancements of the state of the art. Modern forecasting models are generally opaque and do not provide explanations for their forecasts, while well-known post-hoc explainability methods like LIME are not suitable for the forecasting context. We propose PAX-TS, a model-agnostic post-hoc algorithm to explain time series forecasting models and their forecasts. Our method is based on localized input perturbations and results in multi-granular explanations. Further, it is able to characterize cross-channel correlations for multivariate time series forecasts. We clearly outline the algorithmic procedure behind PAX-TS, demonstrate it on a benchmark with 7 algorithms and 10 diverse datasets, compare it with two other state-of-the-art explanation algorithms, and present the different explanation types of the method. We found that the explanations of high-performing and low-performing algorithms differ on the same datasets, highlighting that the explanations of PAX-TS effectively capture a model’s behavior. Based on time step correlation matrices resulting from the benchmark, we identify 6 classes of patterns that repeatedly occur across different datasets and algorithms. We found that the patterns are indicators of performance, with noticeable differences in forecasting error between the classes. Lastly, we outline a multivariate example where PAX-TS demonstrates how the forecasting model takes cross-channel correlations into account. With PAX-TS, time series forecasting models’ mechanisms can be illustrated in different levels of detail, and its explanations can be used to answer practical questions on forecasts.
zh

[AI-30] Novel Approaches to Artificial Intelligence Development Based on the Nearest Neighbor Method

【速读】:该论文旨在解决现代神经网络技术(如大语言模型)在实际应用中面临的诸多根本性限制,包括幻觉效应(hallucination effects)、训练与推理的高计算复杂度、昂贵的微调成本以及灾难性遗忘等问题,这些问题严重制约了神经网络在医疗、工业过程管理及科学研究等关键领域的可靠应用。其解决方案的关键在于提出一种基于最近邻(k-nearest neighbors, k-NN)方法并结合分层聚类结构的替代范式:通过引入树状数据结构(基于Kohonen自组织映射构建),显著加速最近邻搜索效率,同时避免重新训练整个网络即可实现模型扩展与微调;该方法不仅有效消除或显著降低幻觉现象,还具备良好的透明性和可解释性,更贴近人类认知机制,适用于对可靠性与可解释性要求较高的任务场景。

链接: https://arxiv.org/abs/2508.18953
作者: I.I. Priezzhev,D.A. Danko,A.V. Shubin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures. Novel hierarchical neural networks based on k-nearest neighbors method for addressing hallucination effects, training complexity, and catastrophic forgetting in modern AI systems. Includes mathematical formulations using Kohonen self-organizing maps and experimental validation on MNIST handwritten digit recognition and machine translation tasks

点击查看摘要

Abstract:Modern neural network technologies, including large language models, have achieved remarkable success in various applied artificial intelligence applications, however, they face a range of fundamental limitations. Among them are hallucination effects, high computational complexity of training and inference, costly fine-tuning, and catastrophic forgetting issues. These limitations significantly hinder the use of neural networks in critical areas such as medicine, industrial process management, and scientific research. This article proposes an alternative approach based on the nearest neighbors method with hierarchical clustering structures. Employing the k-nearest neighbors algorithm significantly reduces or completely eliminates hallucination effects while simplifying model expansion and fine-tuning without the need for retraining the entire network. To overcome the high computational load of the k-nearest neighbors method, the paper proposes using tree-like data structures based on Kohonen self-organizing maps, thereby greatly accelerating nearest neighbor searches. Tests conducted on handwritten digit recognition and simple subtitle translation tasks confirmed the effectiveness of the proposed approach. With only a slight reduction in accuracy, the nearest neighbor search time was reduced hundreds of times compared to exhaustive search methods. The proposed method features transparency and interpretability, closely aligns with human cognitive mechanisms, and demonstrates potential for extensive use in tasks requiring high reliability and explainable results.
zh

[AI-31] VISION: Robust and Interpretable Code Vulnerability Detection Leverag ing Counterfactual Augmentation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在源代码漏洞检测中因训练数据不平衡和标签噪声导致的“虚假相关性”(spurious correlations)问题,这些问题使得图神经网络(Graph Neural Networks, GNNs)模型难以泛化到真实世界未见过的数据。其核心解决方案是提出一个统一框架 VISION,关键在于通过大语言模型(Large Language Model, LLM)生成语义最小改动但标签相反的反事实样本(counterfactuals),并基于这些成对样本进行针对性 GNN 训练,从而削弱模型对表面相似性的依赖;同时结合图结构可解释性方法识别真正影响漏洞预测的关键代码语句,排除干扰因素,显著提升检测的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2508.18933
作者: David Egea,Barproda Halder,Sanghamitra Dutta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated detection of vulnerabilities in source code is an essential cybersecurity challenge, underpinning trust in digital systems and services. Graph Neural Networks (GNNs) have emerged as a promising approach as they can learn structural and logical code relationships in a data-driven manner. However, their performance is severely constrained by training data imbalances and label noise. GNNs often learn ‘spurious’ correlations from superficial code similarities, producing detectors that fail to generalize well to unseen real-world data. In this work, we propose a unified framework for robust and interpretable vulnerability detection, called VISION, to mitigate spurious correlations by systematically augmenting a counterfactual training dataset. Counterfactuals are samples with minimal semantic modifications but opposite labels. Our framework includes: (i) generating counterfactuals by prompting a Large Language Model (LLM); (ii) targeted GNN training on paired code examples with opposite labels; and (iii) graph-based interpretability to identify the crucial code statements relevant for vulnerability predictions while ignoring spurious ones. We find that VISION reduces spurious learning and enables more robust, generalizable detection, improving overall accuracy (from 51.8% to 97.8%), pairwise contrast accuracy (from 4.5% to 95.8%), and worst-group accuracy (from 0.7% to 85.5%) on the Common Weakness Enumeration (CWE)-20 vulnerability. We further demonstrate gains using proposed metrics: intra-class attribution variance, inter-class attribution distance, and node score dependency. We also release CWE-20-CFA, a benchmark of 27,556 functions (real and counterfactual) from the high-impact CWE-20 category. Finally, VISION advances transparent and trustworthy AI-based cybersecurity systems through interactive visualization for human-in-the-loop analysis.
zh

[AI-32] Who Is Lagging Behind: Profiling Student Behaviors with Graph-Level Encoding in Curriculum-Based Online Learning Systems

【速读】:该论文旨在解决智能辅导系统(Intelligent Tutoring Systems, ITSs)在教育实践中可能加剧学生学业表现差距的问题。其核心挑战在于如何有效追踪学生的学习行为与成绩差异,从而实现精准干预。解决方案的关键是提出了一种名为CTGraph的图级表示学习方法,该方法通过自监督学习机制对学习者的行为和表现进行多维度建模,涵盖内容覆盖度、学习强度及知识点掌握程度等指标,并结合课程结构捕捉个体学习路径的多样性。实验表明,CTGraph能够提供学生学习历程的全局视图,识别处于困境中的学生,并支持群体间的对比分析,为教育者提供精细化的教学洞察与干预依据。

链接: https://arxiv.org/abs/2508.18925
作者: Qian Xiao,Conn Breathnach,Ioana Ghergulescu,Conor O’Sullivan,Keith Johnston,Vincent Wade
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The surge in the adoption of Intelligent Tutoring Systems (ITSs) in education, while being integral to curriculum- based learning, can inadvertently exacerbate performance gaps. To address this problem, student profiling becomes crucial for tracking progress, identifying struggling students, and alleviating disparities among students. Such profiling requires measuring student behaviors and performance across different aspects, such as content coverage, learning intensity, and proficiency in different concepts within a learning topic. In this study, we introduce CTGraph, a graph-level repre- sentation learning approach to profile learner behaviors and performance in a self-supervised manner. Our experiments demonstrate that CTGraph can provide a holistic view of student learning journeys, accounting for different aspects of student behaviors and performance, as well as variations in their learning paths as aligned to the curriculum structure. We also show that our approach can identify struggling students and provide comparative analysis of diverse groups to pinpoint when and where students are struggling. As such, our approach opens more opportunities to empower educators with rich insights into student learning journeys and paves the way for more targeted interventions. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.18925 [cs.AI] (or arXiv:2508.18925v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.18925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] HierCVAE: Hierarchical Attention-Driven Conditional Variational Autoencoders for Multi-Scale Temporal Modeling

【速读】:该论文旨在解决复杂系统中时间建模的问题,即如何在多时间尺度上捕捉依赖关系并有效管理固有不确定性。其核心挑战在于同时建模短期动态、长期趋势以及多变量间的复杂依赖结构。解决方案的关键在于提出HierCVAE架构,该架构融合了分层注意力机制(局部、全局、跨时间)与条件变分自编码器(Conditional Variational Autoencoder, CVAE),通过三层次注意力结构和多模态条件编码来捕获时序、统计及趋势信息;同时在潜在空间引入ResFormer模块以增强特征表达能力,并通过预测头实现显式的不确定性量化。实验表明,该方法在能源消耗数据集上相较现有最优方法提升15–40%的预测精度且具有更优的不确定性校准性能,尤其在长周期预测和多变量依赖建模方面表现突出。

链接: https://arxiv.org/abs/2508.18922
作者: Yao Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Temporal modeling in complex systems requires capturing dependencies across multiple time scales while managing inherent uncertainties. We propose HierCVAE, a novel architecture that integrates hierarchical attention mechanisms with conditional variational autoencoders to address these challenges. HierCVAE employs a three-tier attention structure (local, global, cross-temporal) combined with multi-modal condition encoding to capture temporal, statistical, and trend information. The approach incorporates ResFormer blocks in the latent space and provides explicit uncertainty quantification via prediction heads. Through evaluations on energy consumption datasets, HierCVAE demonstrates a 15-40% improvement in prediction accuracy and superior uncertainty calibration compared to state-of-the-art methods, excelling in long-term forecasting and complex multi-variate dependencies.
zh

[AI-34] Enhancing Model Privacy in Federated Learning with Random Masking and Quantization

【速读】:该论文旨在解决联邦学习(Federated Learning)场景下模型参数隐私保护不足的问题。其解决方案的关键在于提出了一种能够在保持模型性能的同时,显著增强对模型参数保护的新方法,相较于基线方法在隐私安全方面表现出更优的防护效果。

链接: https://arxiv.org/abs/2508.18911
作者: Zhibo Xu,Jianhao Zhu,Jingwen Xu,Changze Lv,Zisu Huang,Xiaohua Wang,Muling Wu,Qi Qian,Xiaoqing Zheng,Xuanjing Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Experimental results across various models and tasks demonstrate that our approach not only maintains strong model performance in federated learning settings but also achieves enhanced protection of model parameters compared to baseline methods.
zh

[AI-35] SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

【速读】:该论文旨在解决语音匿名化(Voice Anonymization)中残留说话人特征(residual speaker cues)导致的隐私泄露问题,即在保持语音数据可用性的同时难以彻底隐藏说话人身份。其解决方案的关键在于提出一种名为SegReConcat的数据增强方法,通过词级分割、随机或基于相似性的段落重排以及与原始语音拼接的方式,破坏长程上下文线索,从而增强攻击者侧自动说话人验证(Automatic Speaker Verification, ASV)系统的识别能力,使攻击者能从多个角度学习源说话人的特征,进而评估不同匿名化系统在对抗此类攻击时的有效性。

链接: https://arxiv.org/abs/2508.18907
作者: Ridwan Arefeen,Xiaoxiao Miao,Rong Tong,Aik Beng Ng,Simon See
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: The Paper has been accepted by APCIPA ASC 2025

点击查看摘要

Abstract:Anonymization of voice seeks to conceal the identity of the speaker while maintaining the utility of speech data. However, residual speaker cues often persist, which pose privacy risks. We propose SegReConcat, a data augmentation method for attacker-side enhancement of automatic speaker verification systems. SegReConcat segments anonymized speech at the word level, rearranges segments using random or similarity-based strategies to disrupt long-term contextual cues, and concatenates them with the original utterance, allowing an attacker to learn source speaker traits from multiple perspectives. The proposed method has been evaluated in the VoicePrivacy Attacker Challenge 2024 framework across seven anonymization systems, SegReConcat improves de-anonymization on five out of seven systems.
zh

[AI-36] Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

【速读】:该论文试图解决标准单轮静态基准测试在评估大型语言模型(Large Language Models, LLMs)于复杂任务(如软件工程)中表现时的局限性问题,即静态评测难以捕捉模型在多需求编程任务中的动态交互能力与系统性弱点。其解决方案的关键在于提出一种新颖的交互式评估框架,通过结构化、反馈驱动的对话机制对LLMs进行测评:将每个任务建模为要求依赖图(requirement dependency graph),由一个了解真实解的“面试官”LLM向“应聘者”模型提供最小且有针对性的提示(hint),以引导其修正错误并满足目标约束。该动态协议能够实现对模型行为的细粒度诊断,揭示静态基准无法测量的优势与系统性缺陷。

链接: https://arxiv.org/abs/2508.18905
作者: Dimitrios Rontogiannis,Maxime Peyrard,Nicolas Baldwin,Martin Josifoski,Robert West,Dimitrios Gunopulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an interviewer'' LLM, aware of the ground-truth solution, provides minimal, targeted hints to an interviewee’’ model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure. We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation. Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.
zh

[AI-37] Distance-informed Neural Processes

【速读】:该论文旨在解决标准神经过程(Neural Processes, NPs)在不确定性估计方面的不足,特别是其难以校准不确定性以及捕捉局部数据依赖关系的问题。解决方案的关键在于提出了一种距离感知的神经过程(Distance-informed Neural Process, DNP),通过引入全局潜在变量建模任务级变化,并结合局部潜在变量在保持距离特性的潜在空间中捕获输入相似性,从而提升不确定性校准能力与分布外数据区分效果。这一机制由双 Lipschitz 正则化实现,有效约束输入关系的畸变并促进潜在空间中相对距离的保留。

链接: https://arxiv.org/abs/2508.18903
作者: Aishwarya Venkataramanan,Joachim Denzler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:We propose the Distance-informed Neural Process (DNP), a novel variant of Neural Processes that improves uncertainty estimation by combining global and distance-aware local latent structures. Standard Neural Processes (NPs) often rely on a global latent variable and struggle with uncertainty calibration and capturing local data dependencies. DNP addresses these limitations by introducing a global latent variable to model task-level variations and a local latent variable to capture input similarity within a distance-preserving latent space. This is achieved through bi-Lipschitz regularization, which bounds distortions in input relationships and encourages the preservation of relative distances in the latent space. This modeling approach allows DNP to produce better-calibrated uncertainty estimates and more effectively distinguish in- from out-of-distribution data. Empirical results demonstrate that DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks.
zh

[AI-38] pyFAST: A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data

【速读】:该论文旨在解决现有 Python 时间序列分析库在模块化、稀疏数据(sparse data)、多源异构数据(multi-source data)支持方面的局限性,尤其在处理不规则时间序列和高效建模方面存在不足。其解决方案的关键在于提出一个名为 pyFAST 的研究导向型 PyTorch 框架,通过显式解耦数据处理与模型计算,实现关注点分离(separation of concerns),从而提升实验效率与扩展性;其数据引擎支持多源加载、蛋白质序列处理、动态归一化及掩码驱动建模(mask-based modeling),并引入类大语言模型(LLM-inspired)架构用于无对齐的稀疏数据融合,同时提供原生稀疏指标、定制损失函数和灵活外生变量融合机制,结合批处理流聚合评估与设备协同优化训练效率,构建了一个模块化且可扩展的时间序列建模平台。

链接: https://arxiv.org/abs/2508.18891
作者: Zhijin Wang,Senzhen Wu,Yue Hu,Xiufeng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi-source, or sparse data. We introduce pyFAST, a research-oriented PyTorch framework that explicitly decouples data processing from model computation, fostering a cleaner separation of concerns and facilitating rapid experimentation. Its data engine is engineered for complex scenarios, supporting multi-source loading, protein sequence handling, efficient sequence- and patch-level padding, dynamic normalization, and mask-based modeling for both imputation and forecasting. pyFAST integrates LLM-inspired architectures for the alignment-free fusion of sparse data sources and offers native sparse metrics, specialized loss functions, and flexible exogenous data fusion. Training utilities include batch-based streaming aggregation for evaluation and device synergy to maximize computational efficiency. A comprehensive suite of classical and deep learning models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a modular architecture that encourages extension. Released under the MIT license at GitHub, pyFAST provides a compact yet powerful platform for advancing time series research and applications.
zh

[AI-39] HAEPO: History-Aggregated Exploratory Policy Optimization

【速读】:该论文旨在解决现有策略优化方法(如DPO和GRPO)在长时序任务中探索能力受限的问题,这些方法通常依赖于完整序列对数似然或逐token比率聚合,难以有效支持长期轨迹上的充分探索。解决方案的关键在于提出一种历史感知的探索性策略优化方法(History-Agregated Exploratory Policy Optimization, HAEPO),其核心机制是将每条轨迹压缩为累积对数似然之和,并通过Plackett-Luce softmax对轨迹进行归一化加权,权重与回报成正比,从而引导更广泛的探索;同时引入熵正则化和相对于冻结参考策略的软KL惩罚项,以稳定剧烈更新并防止策略过早坍缩,最终实现高效、稳定且可解释的探索-利用平衡。

链接: https://arxiv.org/abs/2508.18884
作者: Gaurish Trivedi,Alakh Sharma,Kartikey Singh Bhandari,Dhruv Kumar,Pratik Narang,Jagat Sesh Challa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model’s decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.
zh

[AI-40] Judicial Requirements for Generative AI in Legal Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在法律等高风险领域中应用时,其推理能力与司法决策所需严谨性之间存在的显著差距问题。核心挑战在于LLMs的 probabilistic(概率性)特性难以满足法律解释对选择驱动、透明且可辩护的推理要求。解决方案的关键在于系统性地将法律推理拆解为IRAC(Issue-Rule-Application-Conclusion)框架下的具体任务,并针对规则识别(Rule)和事实适用(Application)两大难点,映射多种AI增强机制(如检索增强生成RAG、多智能体系统和神经符号AI)至对应的核心需求,从而评估其在弥合LLMs局限性方面的潜力。研究发现,尽管这些技术能缓解部分问题,但涉及裁量权和可解释推理的任务仍面临重大挑战,因此当前AI最有效的角色是作为简单重复案件的高吞吐量助手和复杂案件中人类专家的“对抗性协作者”。

链接: https://arxiv.org/abs/2508.18880
作者: Eljas Linna,Tuula Linna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being integrated into professional domains, yet their limitations in high-stakes fields like law remain poorly understood. This paper defines the core capabilities that an AI system must possess to function as a reliable reasoning tool in judicial decision-making. Using the IRAC (Issue-Rule-Application-Conclusion) model as an analytical framework, the study focuses on the most challenging phases of legal adjudication: determining the applicable Rule ® and performing the Application (A) of that rule to the facts of a case. From a judicial perspective, the analysis deconstructs legal reasoning into a series of core requirements, including the ability to select the correct legal framework across jurisdictions, generate sound arguments based on the doctrine of legal sources, distinguish ratio decidendi from obiter dictum in case law, resolve ambiguity arising from general clauses like “reasonableness”, manage conflicting legal provisions, and correctly apply the burden of proof. The paper then maps various AI enhancement mechanisms, such as Retrieval-Augmented Generation (RAG), multi-agent systems, and neuro-symbolic AI, to these requirements, assessing their potential to bridge the gap between the probabilistic nature of LLMs and the rigorous, choice-driven demands of legal interpretation. The findings indicate that while these techniques can address specific challenges, significant challenges remain, particularly in tasks requiring discretion and transparent, justifiable reasoning. Our paper concludes that the most effective current role for AI in law is a dual one: as a high-volume assistant for simple, repetitive cases and as a sophisticated “sparring partner” for human experts in complex matters.
zh

[AI-41] Optimization of Latent-Space Compression using Game-Theoretic Techniques for Transformer-Based Vector Search

【速读】:该论文旨在解决基于Transformer的向量检索系统中因潜在空间高维性导致的可扩展性和效率瓶颈问题。其解决方案的关键在于提出一种博弈论框架,将压缩策略建模为检索准确率与存储效率之间的零和博弈,从而推导出一种能够保留语义相似性并减少冗余的潜在空间变换方法。该方法在FAISS基准测试中显著提升了平均相似度(0.9981 vs. 0.5517)和实用性(0.8873 vs. 0.5194),尽管查询时间略有增加,但整体提升了语义准确性和计算效率,适用于大规模生成式AI(Generative AI)检索场景。

链接: https://arxiv.org/abs/2508.18877
作者: Kushagra Agrawal,Nisharg Nargund,Oishani Banerjee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vector similarity search plays a pivotal role in modern information retrieval systems, especially when powered by transformer-based embeddings. However, the scalability and efficiency of such systems are often hindered by the high dimensionality of latent representations. In this paper, we propose a novel game-theoretic framework for optimizing latent-space compression to enhance both the efficiency and semantic utility of vector search. By modeling the compression strategy as a zero-sum game between retrieval accuracy and storage efficiency, we derive a latent transformation that preserves semantic similarity while reducing redundancy. We benchmark our method against FAISS, a widely-used vector search library, and demonstrate that our approach achieves a significantly higher average similarity (0.9981 vs. 0.5517) and utility (0.8873 vs. 0.5194), albeit with a modest increase in query time. This trade-off highlights the practical value of game-theoretic latent compression in high-utility, transformer-based search applications. The proposed system can be seamlessly integrated into existing LLM pipelines to yield more semantically accurate and computationally efficient retrieval.
zh

[AI-42] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)解码过程中因算子碎片化执行和对片外内存的高依赖而导致的高延迟问题,这一执行模式限制了算子融合机会,并引发显著的内存访问流量与内核启动开销。解决方案的关键在于提出两种集群级通信原语——ClusterReduce 和 ClusterGather,它们抽象了常见的片上数据交换与规约模式,实现了线程块间无需访问片外内存的高效通信;在此基础上设计了 ClusterFusion 执行框架,通过联合调度通信与计算,将 QKV 投影、注意力机制和输出投影等解码阶段融合为单一内核,从而扩展了算子融合范围。在 H100 GPU 上的评估表明,ClusterFusion 相比现有先进推理框架平均降低 1.61 倍端到端延迟。

链接: https://arxiv.org/abs/2508.18850
作者: Xinhao Luo,Zihan Liu,Yangjie Zhou,Shihan Fang,Ziyu Huang,Yu Feng,Chen Zhang,Shixuan Sun,Zhenzhe Zheng,Jingwen Leng,Minyi Guo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at this https URL.
zh

[AI-43] STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning

【速读】:该论文旨在解决当前推荐系统在静态用户建模和被动决策范式下的局限性,尤其是基于大语言模型(Large Language Model, LLM)的推荐代理因过度依赖启发式模式匹配而导致的浅层相关性偏差、因果推理能力不足以及稀疏数据场景下鲁棒性差的问题。解决方案的关键在于提出STARec框架——一种通过引入“慢思考”增强机制来赋予推荐系统自主反思推理能力的代理架构:每个用户被建模为具有并行认知的代理,包含快速响应与慢速链式思维(chain-of-thought)推理两个模块;并通过锚定强化训练(anchored reinforcement training)这一两阶段策略,结合来自先进推理模型的知识蒸馏与偏好对齐的奖励 shaping,实现基础能力(如偏好总结、推理生成)的构建与动态策略适应性的提升。

链接: https://arxiv.org/abs/2508.18812
作者: Chenghao Wu,Ruiyang Ren,Junjie Zhang,Ruirui Wang,Zhongrui Ma,Qi Ye,Wayne Xin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While modern recommender systems are instrumental in navigating information abundance, they remain fundamentally limited by static user modeling and reactive decision-making paradigms. Current large language model (LLM)-based agents inherit these shortcomings through their overreliance on heuristic pattern matching, yielding recommendations prone to shallow correlation bias, limited causal inference, and brittleness in sparse-data scenarios. We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities. Each user is modeled as an agent with parallel cognitions: fast response for immediate interactions and slow reasoning that performs chain-of-thought rationales. To cultivate intrinsic slow thinking, we develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping. This hybrid approach scaffolds agents in acquiring foundational capabilities (preference summarization, rationale generation) while enabling dynamic policy adaptation through simulated feedback loops. Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines, despite using only 0.4% of the full training data.
zh

[AI-44] A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks

【速读】:该论文旨在解决当前人工智能物联网(AIoT)系统中因设备数量激增和AI服务需求爆炸式增长而带来的分布式计算架构与网络效率瓶颈问题,其核心挑战在于如何实现云-边-端协同智能(CETCI)的高效部署与优化。解决方案的关键在于构建可扩展、异构兼容且安全可靠的协同智能系统(CISAIOT),通过系统性分析云、边、端三层架构组件,融合网络虚拟化、容器编排、软件定义网络等核心技术,并引入任务卸载、资源分配与跨层优化机制,同时借助联邦学习、分布式深度学习、边缘-云端模型演化及强化学习等智能协作学习框架,推动AIoT从单一层级优化向多层协同演进,从而提升整体系统的鲁棒性、效率与安全性。

链接: https://arxiv.org/abs/2508.18803
作者: Jiaqi Wu,Jing Liu,Yang Liu,Lixu Wang,Zehua Wang,Wei Chen,Zijian Tian,Richard Yu,Victor C.M. Leung
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Internet of things (IoT) devices in smart cities, transportation, healthcare, and industrial applications, coupled with the explosive growth of AI-driven services, has increased demands for efficient distributed computing architectures and networks, driving cloud-edge-terminal collaborative intelligence (CETCI) as a fundamental paradigm within the artificial intelligence of things (AIoT) community. With advancements in deep learning, large language models (LLMs), and edge computing, CETCI has made significant progress with emerging AIoT applications, moving beyond isolated layer optimization to deployable collaborative intelligence systems for AIoT (CISAIOT), a practical research focus in AI, distributed computing, and communications. This survey describes foundational architectures, enabling technologies, and scenarios of CETCI paradigms, offering a tutorial-style review for CISAIOT beginners. We systematically analyze architectural components spanning cloud, edge, and terminal layers, examining core technologies including network virtualization, container orchestration, and software-defined networking, while presenting categorizations of collaboration paradigms that cover task offloading, resource allocation, and optimization across heterogeneous infrastructures. Furthermore, we explain intelligent collaboration learning frameworks by reviewing advances in federated learning, distributed deep learning, edge-cloud model evolution, and reinforcement learning-based methods. Finally, we discuss challenges (e.g., scalability, heterogeneity, interoperability) and future trends (e.g., 6G+, agents, quantum computing, digital twin), highlighting how integration of distributed computing and communication can address open issues and guide development of robust, efficient, and secure collaborative AIoT systems.
zh

[AI-45] CausalMACE: Causality Empowered Multi-Agents in Minecraft Cooperative Tasks

【速读】:该论文旨在解决复杂任务中单智能体(Single Agent)在Minecraft虚拟环境中因行动序列长、效率低及容错能力弱而导致的决策与执行瓶颈问题,同时针对多智能体协作研究稀缺的现状,提出了一种基于因果关系的协同规划框架CausalMACE。其解决方案的关键在于引入因果性(Causality)来显式建模子任务间的依赖关系,并设计了两个核心模块:全局任务图(Task Graph)用于高层任务规划,以及基于因果干预的依赖管理模块,通过内在规则实现对任务依赖链的动态调控,从而提升多智能体系统的协作效率与鲁棒性。

链接: https://arxiv.org/abs/2508.18797
作者: Qi Chai,Zhang Zheng,Junlong Ren,Deheng Ye,Zichuan Lin,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Minecraft, as an open-world virtual interactive environment, has become a prominent platform for research on agent decision-making and execution. Existing works primarily adopt a single Large Language Model (LLM) agent to complete various in-game tasks. However, for complex tasks requiring lengthy sequences of actions, single-agent approaches often face challenges related to inefficiency and limited fault tolerance. Despite these issues, research on multi-agent collaboration remains scarce. In this paper, we propose CausalMACE, a holistic causality planning framework designed to enhance multi-agent systems, in which we incorporate causality to manage dependencies among subtasks. Technically, our proposed framework introduces two modules: an overarching task graph for global task planning and a causality-based module for dependency management, where inherent rules are adopted to perform causal intervention. Experimental results demonstrate our approach achieves state-of-the-art performance in multi-agent cooperative tasks of Minecraft.
zh

[AI-46] Insights into User Interface Innovations from a Design Thinking Workshop at deRSE25

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)当前用户界面(User Interface, UI)过于僵化、线性化的问题,限制了用户与模型之间灵活、高效和可控的交互。其解决方案的关键在于通过设计思维工作坊(design thinking workshop)收集用户需求与痛点,提出以“灵活上下文管理”、“动态对话分支”和“增强用户控制机制”为核心的新交互概念,并基于人本设计(human-centered design)的迭代方法持续优化界面原型,从而推动LLM接口向更符合人类认知习惯和使用场景的方向演进。

链接: https://arxiv.org/abs/2508.18784
作者: Maximilian Frank,Simon Lund
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have become widely adopted tools due to their versatile capabilities, yet their user interfaces remain limited, often following rigid, linear interaction paradigms. In this paper, we present insights from a design thinking workshop held at the deRSE25 conference aiming at collaboratively developing innovative user interface concepts for LLMs. During the workshop, participants identified common use cases, evaluated the strengths and shortcomings of current LLM interfaces, and created visualizations of new interaction concepts emphasizing flexible context management, dynamic conversation branching, and enhanced mechanisms for user control. We describe how these participant-generated ideas advanced our own whiteboard-based UI approach. The ongoing development of this interface is guided by the human-centered design process - an iterative, user-focused methodology that emphasizes continuous refinement through user feedback. Broader implications for future LLM interface development are discussed, advocating for increased attention to UI innovation grounded in user-centered design principles.
zh

[AI-47] Long-Term Variability in Physiological-Arousal Relationships for Robust Emotion Estimation

【速读】:该论文旨在解决情绪估计模型中长期生理信号与主观情感状态之间关系稳定性的问题,即现有系统普遍假设生理特征与主观情绪的关联在长时间内保持不变,但这一假设尚未得到充分验证。解决方案的关键在于构建了一个纵向数据集,在自然工作环境中对24名参与者连续两个三个月周期采集血容积脉搏、皮肤电活动(EDA)、皮肤温度和加速度等生理信号及自我报告的情绪唤醒水平,并利用可解释增强机(Explainable Boosting Machines, EBMs)分析这些关系随时间的变化趋势。结果显示,基于第一阶段数据训练的模型在第二阶段测试时准确率下降5%,表明生理-唤醒关联存在长期波动;其中心率相对稳定,而最小EDA则表现出显著的个体差异性变化,提示情绪估计模型需定期更新(如每五个月一次)以维持性能鲁棒性。

链接: https://arxiv.org/abs/2508.18782
作者: Hiroto Sakimura,Takayuki Nagaya,Tomoki Nishi,Tetsuo Kurahashi,Katsunori Kohda,Nobuhiko Muramoto
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, accepted at 13th International Conference on Affective Computing and Intelligent Interaction (ACII 2025)

点击查看摘要

Abstract:Estimating emotional states from physiological signals is a central topic in affective computing and psychophysiology. While many emotion estimation systems implicitly assume a stable relationship between physiological features and subjective affect, this assumption has rarely been tested over long timeframes. This study investigates whether such relationships remain consistent across several months within individuals. We developed a custom measurement system and constructed a longitudinal dataset by collecting physiological signals–including blood volume pulse, electrodermal activity (EDA), skin temperature, and acceleration–along with self-reported emotional states from 24 participants over two three-month periods. Data were collected in naturalistic working environments, allowing analysis of the relationship between physiological features and subjective arousal in everyday contexts. We examined how physiological–arousal relationships evolve over time by using Explainable Boosting Machines (EBMs) to ensure model interpretability. A model trained on 1st-period data showed a 5% decrease in accuracy when tested on 2nd-period data, indicating long-term variability in physiological–arousal associations. EBM-based comparisons further revealed that while heart rate remained a relatively stable predictor, minimum EDA exhibited substantial individual-level fluctuations between periods. While the number of participants is limited, these findings highlight the need to account for temporal variability in physiological–arousal relationships and suggest that emotion estimation models should be periodically updated – e.g., every five months – based on observed shift trends to maintain robust performance over time.
zh

[AI-48] AniME: Adaptive Multi-Agent Planning for Long Animation Generation

【速读】:该论文旨在解决长篇动漫(anime)自动化制作中缺乏端到端协同控制与一致性保障的问题,尤其在角色连贯性、音画同步及多任务调度方面存在挑战。其解决方案的关键在于提出AniME系统,该系统采用导演导向的多智能体架构(director-oriented multi-agent system),由一个全局记忆的导演智能体(director agent)协调多个下游专业化智能体;通过集成定制化的模型上下文协议(Model Context Protocol, MCP)与下游模型指令,使各专业智能体能自适应选择控制条件以完成多样化子任务,从而实现从故事到成片的全流程自动化生成,并确保角色一致性与音画同步。

链接: https://arxiv.org/abs/2508.18781
作者: Lisai Zhang,Baohan Xu,Siqian Yang,Mingyu Yin,Jing Liu,Chao Xu,Siqi Wang,Yidi Wu,Yuxin Hong,Zihao Zhang,Yanzhang Liang,Yudong Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 2 pages, Technical Report

点击查看摘要

Abstract:We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.
zh

[AI-49] Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units EMNLP2025

【速读】:该论文旨在解决多模型协作中推理能力提升的效率与效果问题,特别是针对现有方法在模型数量增加时性能未必提升的悖论。其关键解决方案在于提出一种基于分布距离的动态选择策略(distribution distance-based dynamic selection strategy, DDS),通过在token级别上从多个语言模型的下一token分布中选取最优候选,实现更高效的自回归推理;同时引入最小完整语义单元(minimal complete semantic units, MCSU)概念以应对多模型间词汇对齐难题,从而在语言空间内实现自然且简洁的语义对齐,显著提升了多模型协作的推理表现。

链接: https://arxiv.org/abs/2508.18763
作者: Chao Hao,Zezheng Wang,Yanhua Huang,Ruiwen Xu,Wenzhe Niu,Xin Liu,Zitong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 Main Conference

点击查看摘要

Abstract:This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The code will be available at this https URL.
zh

[AI-50] Reflection-Enhanced Meta-Optimization Integrating TextGrad-style Prompt Optimization with Memory-Driven Self-Evolution

【速读】:该论文旨在解决当前文本提示优化方法(如TextGrad)中存在的两个核心问题:一是缺乏状态记忆机制,导致每次优化运行独立且无法复用历史经验;二是易发生过拟合,使得生成的提示在任务外泛化能力差。解决方案的关键在于提出一种名为Reflection-Enhanced Meta-Optimization (REMO) 的新框架,其核心创新包括:(1) 引入基于记忆增强的“错误笔记本”式反思检索增强生成(Reflection Retrieval-Augmented Generation, RAG)模块,用于系统性积累和重用跨轮次的优化知识;(2) 设计由大语言模型驱动的元控制器(Self-Adaptive Optimizer),通过整合epoch级的反思洞察来迭代改进系统级提示策略,从而实现局部精细调优与持续学习能力的统一。

链接: https://arxiv.org/abs/2508.18749
作者: Chunlong Wu,Zhibo Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in prompt optimization, exemplified by methods such as TextGrad, enable automatic, gradient-like refinement of textual prompts to enhance the performance of large language models (LLMs) on specific downstream tasks. However, current approaches are typically stateless and operate independently across optimization runs, lacking mechanisms to preserve and leverage historical optimization experience. Furthermore, they are susceptible to overfitting, often yielding prompt updates that generalize poorly beyond the immediate task context. To address these limitations, we propose Reflection-Enhanced Meta-Optimization (REMO), a novel framework that integrates (1) a memory-augmented Reflection Retrieval-Augmented Generation (RAG) module - structured as a “mistake notebook” and (2) a Self-Adaptive Optimizer, implemented via an LLM-driven meta-controller that synthesizes epoch-level reflective insights to iteratively improve system-level prompting strategies. This architecture enables not only local, fine-grained prompt tuning akin to TextGrad, but also the systematic accumulation and reuse of cross-run optimization knowledge, thereby supporting continual improvement over time. We instantiate the REMO framework using Qwen3-32B in standard inference mode - without explicit chain-of-thought prompting - and evaluate its efficacy on the GSM8K benchmark for mathematical reasoning. Experimental results demonstrate that, compared to a TextGrad baseline, REMO achieves more stable and robust generalization, albeit at the cost of increased computational overhead. We provide a detailed exposition of the algorithmic design, conduct a qualitative and quantitative analysis of optimization dynamics, and present a comprehensive ablation study to elucidate the contributions of each component. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.18749 [cs.AI] (or arXiv:2508.18749v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.18749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-51] FLAegis: A Two-Layer Defense Framework for Federated Learning Against Poisoning Attacks

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统中因参与客户端(client)的非诚实行为导致的模型中毒攻击问题,尤其是由拜占庭客户端(Byzantine clients)提交虚假模型更新所引发的鲁棒性下降问题。解决方案的关键在于提出一个两阶段防御框架FLAegis:第一阶段利用符号时间序列变换(Symbolic Aggregate approXimation, SAX)增强良性与恶意模型之间的差异,并结合谱聚类(spectral clustering)实现对拜占庭客户端的精准识别;第二阶段引入基于快速傅里叶变换(Fast Fourier Transform, FFT)的鲁棒聚合函数作为最终层,有效缓解那些成功规避前期检测的拜占庭客户端对全局模型的影响,从而在多种复杂攻击场景下显著提升模型准确率和防御精度。

链接: https://arxiv.org/abs/2508.18737
作者: Enrique Mármol Campos,Aurora González Vidal,José Luis Hernández Ramos,Antonio Skarmeta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 tables, and 5 figures

点击查看摘要

Abstract:Federated Learning (FL) has become a powerful technique for training Machine Learning (ML) models in a decentralized manner, preserving the privacy of the training datasets involved. However, the decentralized nature of FL limits the visibility of the training process, relying heavily on the honesty of participating clients. This assumption opens the door to malicious third parties, known as Byzantine clients, which can poison the training process by submitting false model updates. Such malicious clients may engage in poisoning attacks, manipulating either the dataset or the model parameters to induce misclassification. In response, this study introduces FLAegis, a two-stage defensive framework designed to identify Byzantine clients and improve the robustness of FL systems. Our approach leverages symbolic time series transformation (SAX) to amplify the differences between benign and malicious models, and spectral clustering, which enables accurate detection of adversarial behavior. Furthermore, we incorporate a robust FFT-based aggregation function as a final layer to mitigate the impact of those Byzantine clients that manage to evade prior defenses. We rigorously evaluate our method against five poisoning attacks, ranging from simple label flipping to adaptive optimization-based strategies. Notably, our approach outperforms state-of-the-art defenses in both detection precision and final model accuracy, maintaining consistently high performance even under strong adversarial conditions.
zh

[AI-52] Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database

【速读】:该论文旨在解决发音障碍语音识别(dysarthric speech recognition)中因个体严重程度差异和与正常语音之间存在显著差异而导致的识别性能下降问题。传统方法通常对每位患者单独微调在正常语音上预训练的自动语音识别(ASR)模型,以避免特征冲突,但这种方法效率低且依赖大量个体数据。论文提出的关键解决方案是采用多说话者微调策略(multi-speaker fine-tuning),即同时在多个发音障碍患者的语音上进行微调,从而通过更广泛的病理特征学习提升模型泛化能力,缓解说话人特异性过拟合,降低对单个患者数据的依赖,并显著提高目标说话人的识别准确率——实验表明该方法相较单说话人微调可将词错误率(WER)降低高达13.15%。

链接: https://arxiv.org/abs/2508.18732
作者: Qing Xiao,Yingshan Peng,PeiPei Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dysarthric speech recognition faces challenges from severity variations and disparities relative to normal speech. Conventional approaches individually fine-tune ASR models pre-trained on normal speech per patient to prevent feature conflicts. Counter-intuitively, experiments reveal that multi-speaker fine-tuning (simultaneously on multiple dysarthric speakers) improves recognition of individual speech patterns. This strategy enhances generalization via broader pathological feature learning, mitigates speaker-specific overfitting, reduces per-patient data dependence, and improves target-speaker accuracy - achieving up to 13.15% lower WER versus single-speaker fine-tuning.
zh

[AI-53] VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft EMNLP2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在虚拟开放世界环境中进行具身决策时,因缺乏领域特定知识而导致性能受限的问题。现有方法依赖大规模领域数据微调,开发成本高昂。解决方案的关键在于提出VistaWise框架,其核心创新包括:1)构建跨模态知识图谱(cross-modal knowledge graph),融合视觉信息与文本依赖关系,实现对多模态环境的全面理解;2)采用基于检索的池化策略从知识图谱中提取任务相关知识;3)集成桌面级技能库,支持通过鼠标和键盘直接操作Minecraft客户端。该方案将领域特定训练数据需求从百万级样本降低至数百样本,显著降低了开发成本并提升了代理性能。

链接: https://arxiv.org/abs/2508.18722
作者: Honghao Fu,Junlong Ren,Qi Chai,Deheng Ye,Yujun Cai,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 main

点击查看摘要

Abstract:Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.
zh

[AI-54] Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中的公平性问题,特别是针对医疗场景下任务分配的公平性,传统方法仅关注工作量平衡(workload balance),忽视了个体技能与任务之间的匹配度(skill-task alignment),导致可能出现高技能代理过度使用或低技能代理承担超出其能力的任务,从而引发职业倦怠和效率低下。解决方案的关键在于提出 FairSkillMARL 框架,将公平性定义为工作量平衡与技能-任务对齐的双重目标,并构建了一个可定制的医疗启发式环境 MARLHospital,用于模拟团队组成和能量约束调度对公平性的影响。实验表明,单纯基于工作量均衡的公平指标可能导致任务与技能错配,强调需引入更鲁棒的公平性度量以捕捉技能适配程度。

链接: https://arxiv.org/abs/2508.18708
作者: Promise Osaine Ekpo,Brian La,Thomas Wiener,Saesha Agarwal,Arshia Agrawal,Gonzalo Gonzalez-Pumariega,Lekan P. Molu,Angelique Taylor
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.
zh

[AI-55] AgriChrono: A Multi-modal Dataset Capturing Crop Growth and Lighting Variability with a Field Robot

【速读】:该论文旨在解决现有精准农业数据集多源于静态或受控环境(如室内实验室或温室),导致模型在真实农田场景中泛化能力不足的问题。其关键解决方案是提出AgriChrono——一个集成多传感器的机器人数据采集平台及多模态数据集,支持在不同光照条件和作物生长阶段下,远程同步获取RGB、深度(Depth)、激光雷达(LiDAR)和惯性测量单元(IMU)数据,从而实现对真实农田动态环境的高效、可重复的长期数据采集,为提升3D重建等模型在复杂田间条件下的鲁棒性和泛化性能提供高质量研究资源。

链接: https://arxiv.org/abs/2508.18694
作者: Jaehwan Jeong,Tuan-Anh Vu,Mohammad Jony,Shahab Ahmad,Md. Mukhlesur Rahman,Sangpil Kim,M. Khalid Jawed
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Existing datasets for precision agriculture have primarily been collected in static or controlled environments such as indoor labs or greenhouses, often with limited sensor diversity and restricted temporal span. These conditions fail to reflect the dynamic nature of real farmland, including illumination changes, crop growth variation, and natural disturbances. As a result, models trained on such data often lack robustness and generalization when applied to real-world field scenarios. In this paper, we present AgriChrono, a novel robotic data collection platform and multi-modal dataset designed to capture the dynamic conditions of real-world agricultural environments. Our platform integrates multiple sensors and enables remote, time-synchronized acquisition of RGB, Depth, LiDAR, and IMU data, supporting efficient and repeatable long-term data collection across varying illumination and crop growth stages. We benchmark a range of state-of-the-art 3D reconstruction models on the AgriChrono dataset, highlighting the difficulty of reconstruction in real-world field environments and demonstrating its value as a research asset for advancing model generalization under dynamic conditions. The code and dataset are publicly available at: this https URL
zh

[AI-56] AppAgent -Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance WWW CIKM2025

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在信息获取过程中普遍存在的被动响应问题,即现有系统仅能根据用户显式指令进行操作,难以主动识别和满足用户的潜在需求,从而限制了其作为通用信息获取平台的效率与深度。解决方案的关键在于提出 AppAgent-Pro,一个能够主动整合多领域信息的图形用户界面(GUI)智能体系统,通过分析用户指令并预判其深层意图,实现跨领域的主动信息挖掘与知识融合,从而显著提升信息获取的全面性与智能化水平。

链接: https://arxiv.org/abs/2508.18689
作者: Yuyang Zhao,Wentao Shi,Fuli Feng,Xiangnan He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at CIKM 2025. 10 pages, 5 figures. Our code is available at: this https URL . Our code is available at: this https URL . The demonstration video could be found at: this https URL

点击查看摘要

Abstract:Large language model (LLM)-based agents have demonstrated remarkable capabilities in addressing complex tasks, thereby enabling more advanced information retrieval and supporting deeper, more sophisticated human information-seeking behaviors. However, most existing agents operate in a purely reactive manner, responding passively to user instructions, which significantly constrains their effectiveness and efficiency as general-purpose platforms for information acquisition. To overcome this limitation, this paper proposes AppAgent-Pro, a proactive GUI agent system that actively integrates multi-domain information based on user instructions. This approach enables the system to proactively anticipate users’ underlying needs and conduct in-depth multi-domain information mining, thereby facilitating the acquisition of more comprehensive and intelligent information. AppAgent-Pro has the potential to fundamentally redefine information acquisition in daily life, leading to a profound impact on human society. Our code is available at: this https URL. Our code is available at: this https URL. The demonstration video could be found at: this https URL.
zh

[AI-57] Auditing Approximate Machine Unlearning for Differentially Private Models ICDM2025

【速读】:该论文旨在解决现有近似机器遗忘(approximate machine unlearning)方法在删除特定数据后,可能未充分保障保留样本的隐私安全性问题,尤其是在模型采用差分隐私(differential privacy, DP)机制时,保留样本是否仍满足DP标准尚无研究探讨。其关键解决方案在于提出一套针对被遗忘样本和保留样本分别适用的隐私评估准则,从差分隐私和成员推理攻击(membership inference attacks, MIAs)两个角度进行系统性审计,并设计了一种高效的MIA方法A-LiRA,通过数据增强降低影子模型训练成本,从而实现对遗忘算法整体隐私风险的全面评估。实验表明,当前近似遗忘算法可能导致差分隐私模型中保留样本的隐私泄露,亟需开发具备差分隐私保障的遗忘算法。

链接: https://arxiv.org/abs/2508.18671
作者: Yuechun Gu,Jiajie He,Keke Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICDM2025,10pages

点击查看摘要

Abstract:Approximate machine unlearning aims to remove the effect of specific data from trained models to ensure individuals’ privacy. Existing methods focus on the removed records and assume the retained ones are unaffected. However, recent studies on the \emphprivacy onion effect indicate this assumption might be incorrect. Especially when the model is differentially private, no study has explored whether the retained ones still meet the differential privacy (DP) criterion under existing machine unlearning methods. This paper takes a holistic approach to auditing both unlearned and retained samples’ privacy risks after applying approximate unlearning algorithms. We propose the privacy criteria for unlearned and retained samples, respectively, based on the perspectives of DP and membership inference attacks (MIAs). To make the auditing process more practical, we also develop an efficient MIA, A-LiRA, utilizing data augmentation to reduce the cost of shadow model training. Our experimental findings indicate that existing approximate machine unlearning algorithms may inadvertently compromise the privacy of retained samples for differentially private models, and we need differentially private unlearning algorithms. For reproducibility, we have pubished our code: this https URL
zh

[AI-58] MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agent ic tool use

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在多轮交互中面对动态、不确定和随机的用户需求时,工具调用能力不足的问题。现有强化学习(Reinforcement Learning, RL)方法缺乏在训练过程中引入真实动态用户的机制,导致模型难以有效学习与用户沟通并协同使用工具以逐步澄清和满足复杂需求。解决方案的关键在于提出MUA-RL(Multi-turn User-interacting Agent Reinforcement Learning),首次将LLM模拟用户嵌入到强化学习循环中,使智能体能够在多轮交互中自主学习高效沟通策略与工具调用逻辑,从而提升其在实际应用场景下的适应性和任务完成能力。

链接: https://arxiv.org/abs/2508.18669
作者: Weikang Zhao,Xili Wang,Chengdi Ma,Lingbin Kong,Zhaohua Yang,Mingxiang Tuo,Xiaowei Shi,Yitao Zhai,Xunliang Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent’s tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates LLM-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent – outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
zh

[AI-59] FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge

【速读】:该论文旨在解决在异构联邦学习(Federated Learning, FL)环境下,基于低秩适配(Low Rank Adaptation, LoRA)的联邦微调(Federated Fine-Tuning, FFT)所面临的两大挑战:一是客户端间LoRA结构不兼容导致的聚合困难,二是非独立同分布(non-IID)数据下模型收敛性差与泛化能力弱的问题。其解决方案的关键在于提出FFT MoE框架,用稀疏专家混合(Mixture of Experts, MoE)适配器替代LoRA,使每个客户端训练轻量级门控网络以选择性激活个性化专家子集,从而实现细粒度本地资源适配并保持聚合兼容性;同时引入一种考虑设备与数据异质性的辅助损失函数,动态调节路由分布以缓解专家负载不均问题,提升专家多样性与利用率,显著改善模型在非IID场景下的泛化性能和训练效率。

链接: https://arxiv.org/abs/2508.18663
作者: Gang Hu,Yinglei Teng,Pengfei Wu,Nan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.
zh

[AI-60] he Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market Interpretability

【速读】:该论文旨在解决金融市场上因信息不对称而引发的文本分析有效性下降问题,尤其是企业叙事策略性构建所导致的投资者认知偏差。其解决方案的关键在于提出一种融合文本情感与高管语音声道动态特征的多模态财务风险评估框架——Physics-Informed Acoustic Model (PIAM),该模型利用非线性声学原理从受信号截断等失真影响的电话会议音频中稳健提取情绪特征,并将声学与文本情绪状态共同映射至可解释的三维情感状态标签(Affective State Label, ASL)空间(张力、稳定性、唤醒度)。研究表明,尽管多模态特征无法预测股票收益方向,但能解释高达43.8%的30日实际波动率的样本外方差,且预测能力主要来自高管从脚本陈述向即兴问答过渡时的情绪动态变化,特别是CFO文本稳定性下降和声学不稳定性增强、CEO唤醒度显著波动等关键指标。

链接: https://arxiv.org/abs/2508.18653
作者: Xiaoliang Chen,Xin Yu,Le Chang,Teng Jing,Jiashuai He,Ze Wang,Yangjun Luo,Xingyu Chen,Jiayue Liang,Yuchen Wang,Jiaying Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous QA exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty.
zh

[AI-61] PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在安全防护中面临的两大核心问题:一是现有防御方法常因过度防御而损害模型实用性,二是依赖浅层对齐无法识别需深度推理的复杂攻击。解决方案的关键在于提出PRISM(Principled Reasoning for Integrated Safety in Multimodality),其核心创新是引入结构化的、以安全为导向的推理过程来实现更精细的安全对齐。具体包括两个组件:PRISM-CoT,一个用于训练安全感知链式思维(Chain-of-Thought, CoT)推理的数据集;以及PRISM-DPO,通过蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成偏好数据,并利用直接偏好优化(Direct Preference Optimization, DPO)进一步精炼推理逻辑,从而构建出更精确的安全边界。此方法在多个基准测试中显著降低攻击成功率,同时保持甚至提升模型性能,展现出强鲁棒性和跨分布泛化能力。

链接: https://arxiv.org/abs/2508.18649
作者: Nanxi Li,Zhengyue Zhao,Chaowei Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM’s effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at this https URL.
zh

[AI-62] LaQual: A Novel Framework for Automated Evaluation of LLM App Quality

【速读】:该论文旨在解决当前大语言模型应用商店(LLM app stores)中高质量应用难以被有效识别和推荐的问题。现有推荐方法主要依赖静态指标(如用户活跃度和收藏数),无法精准反映应用在实际场景中的表现,导致用户筛选效率低下。其解决方案的关键在于提出LaQual框架,该框架通过三个阶段实现自动化、动态且场景适配的质量评估:首先对LLM应用进行分层标签与分类以匹配不同使用场景;其次利用时间加权用户参与度和功能能力等静态指标过滤低质应用;最后由大语言模型生成特定场景下的评估任务、评分规则和指标,完成动态质量测评。这一机制显著提升了评估的客观性与实用性,实验证明其评分与人工判断高度一致,并能大幅缩减候选应用池,提升用户决策信心与评价报告感知价值。

链接: https://arxiv.org/abs/2508.18636
作者: Yan Wang,Xinyi Hou,Yanjie Zhao,Weiguo Lin,Haoyu Wang,Junjun Si
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM app stores are quickly emerging as platforms that gather a wide range of intelligent applications based on LLMs, giving users many choices for content creation, coding support, education, and more. However, the current methods for ranking and recommending apps in these stores mostly rely on static metrics like user activity and favorites, which makes it hard for users to efficiently find high-quality apps. To address these challenges, we propose LaQual, an automated framework for evaluating the quality of LLM apps. LaQual consists of three main stages: first, it labels and classifies LLM apps in a hierarchical way to accurately match them to different scenarios; second, it uses static indicators, such as time-weighted user engagement and functional capability metrics, to filter out low-quality apps; and third, it conducts a dynamic, scenario-adaptive evaluation, where the LLM itself generates scenario-specific evaluation metrics, scoring rules, and tasks for a thorough quality assessment. Experiments on a popular LLM app store show that LaQual is effective. Its automated scores are highly consistent with human judgments (with Spearman’s rho of 0.62 and p=0.006 in legal consulting, and rho of 0.60 and p=0.009 in travel planning). By effectively screening, LaQual can reduce the pool of candidate LLM apps by 66.7% to 81.3%. User studies further confirm that LaQual significantly outperforms baseline systems in decision confidence, comparison efficiency (with average scores of 5.45 compared to 3.30), and the perceived value of its evaluation reports (4.75 versus 2.25). Overall, these results demonstrate that LaQual offers a scalable, objective, and user-centered solution for finding and recommending high-quality LLM apps in real-world use cases.
zh

[AI-63] SkinHealth: A Multimodal Dataset for Neglected Tropical Skin Diseases

【速读】:该论文旨在解决皮肤类Neglected Tropical Diseases (NTDs) 在人工智能(AI)辅助诊断中因数据稀缺而导致的模型可靠性不足问题,尤其针对低收入热带地区人群和罕见皮肤病变的代表性不足。其解决方案的关键在于构建一个名为eSkinHealth的新型皮肤病数据集,该数据集在科特迪瓦和加纳实地采集,包含5,623张图像、1,639个病例及47种皮肤疾病,聚焦于西非人群中的皮肤NTDs与罕见病种;同时提出一种AI-专家协作范式,借助基础语言模型和分割模型在皮肤科医生指导下高效生成多模态标注(包括语义病灶掩膜、实例级视觉描述和临床概念),从而建立可扩展的标注框架,推动更公平、准确且可解释的全球皮肤病AI工具发展。

链接: https://arxiv.org/abs/2508.18608
作者: Janet Wang,Xin Hu,Yunbei Zhang,Diabate Almamy,Vagamon Bamba,Konan Amos Sébastien Koffi,Yao Koffi Aubin,Zhengming Ding,Jihun Hamm,Rie R. Yotsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skin Neglected Tropical Diseases (NTDs) impose severe health and socioeconomic burdens in impoverished tropical communities. Yet, advancements in AI-driven diagnostic support are hindered by data scarcity, particularly for underrepresented populations and rare manifestations of NTDs. Existing dermatological datasets often lack the demographic and disease spectrum crucial for developing reliable recognition models of NTDs. To address this, we introduce eSkinHealth, a novel dermatological dataset collected on-site in Côte d’Ivoire and Ghana. Specifically, eSkinHealth contains 5,623 images from 1,639 cases and encompasses 47 skin diseases, focusing uniquely on skin NTDs and rare conditions among West African populations. We further propose an AI-expert collaboration paradigm to implement foundation language and segmentation models for efficient generation of multimodal annotations, under dermatologists’ guidance. In addition to patient metadata and diagnosis labels, eSkinHealth also includes semantic lesion masks, instance-specific visual captions, and clinical concepts. Overall, our work provides a valuable new resource and a scalable annotation framework, aiming to catalyze the development of more equitable, accurate, and interpretable AI tools for global dermatology.
zh

[AI-64] A Case Study on the Effectiveness of LLM s in Verification with Proof Assistants

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化验证中的应用有效性问题,特别是其在辅助证明助手(proof assistants)中自动生成定理证明的能力。研究通过案例分析两个成熟的Rocq项目——hs-to-coq工具和Verdi,对LLMs生成证明的性能进行了定量与定性评估。解决方案的关键在于系统性地考察外部依赖关系和同一源文件内的上下文信息对证明生成的影响,并揭示LLMs在不同验证项目中的表现差异及其生成简洁、智能证明的能力,同时识别其可能产生的非典型错误,从而为LLMs在形式化验证领域的实际应用提供实证依据和改进方向。

链接: https://arxiv.org/abs/2508.18587
作者: Barış Bayazıt,Yao Li,Xujie Si
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注: Accepted by LMPL 2025

点击查看摘要

Abstract:Large language models (LLMs) can potentially help with verification using proof assistants by automating proofs. However, it is unclear how effective LLMs are in this task. In this paper, we perform a case study based on two mature Rocq projects: the hs-to-coq tool and Verdi. We evaluate the effectiveness of LLMs in generating proofs by both quantitative and qualitative analysis. Our study finds that: (1) external dependencies and context in the same source file can significantly help proof generation; (2) LLMs perform great on small proofs but can also generate large proofs; (3) LLMs perform differently on different verification projects; and (4) LLMs can generate concise and smart proofs, apply classical techniques to new definitions, but can also make odd mistakes.
zh

[AI-65] DrugReason er: Interpretable Drug Approval Prediction with a Reasoning -augmented Language Model

【速读】:该论文旨在解决药物发现过程中早期预测药物审批成功率的难题,以优化研发投资效率。传统机器学习和深度学习方法虽在预测中展现出潜力,但其可解释性不足限制了实际应用。解决方案的关键在于提出DrugReasoner——一个基于LLaMA架构并采用群体相对策略优化(GRPO)微调的推理型大语言模型(Large Language Model, LLM),通过整合分子描述符与结构相似化合物的对比推理机制,生成带有逐步推理过程和置信度评分的预测结果。该方法不仅实现了优于多项经典基线模型(如逻辑回归、支持向量机、K近邻)和具有竞争力的XGBoost模型的性能(验证集AUC=0.732,F1=0.729;测试集AUC=0.725,F1=0.718),并在外部独立数据集上超越了最新提出的ChemAP模型(AUC=0.728,F1=0.774),同时保持高精度与均衡敏感性,显著提升了AI辅助药物发现中的透明度与可信度。

链接: https://arxiv.org/abs/2508.18579
作者: Mohammadreza Ghaffarzadeh-Esfahani,Ali Motahharynia,Nahid Yousefian,Navid Mazrouei,Jafar Ghaisari,Yousof Gheisari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 13 pages, 2 figures. Corresponding author: alimotahharynia@gmail.com

点击查看摘要

Abstract:Drug discovery is a complex and resource-intensive process, making early prediction of approval outcomes critical for optimizing research investments. While classical machine learning and deep learning methods have shown promise in drug approval prediction, their limited interpretability constraints their impact. Here, we present DrugReasoner, a reasoning-based large language model (LLM) built on the LLaMA architecture and fine-tuned with group relative policy optimization (GRPO) to predict the likelihood of small-molecule approval. DrugReasoner integrates molecular descriptors with comparative reasoning against structurally similar approved and unapproved compounds, generating predictions alongside step-by-step rationales and confidence scores. DrugReasoner achieved robust performance with an AUC of 0.732 and an F1 score of 0.729 on the validation set and 0.725 and 0.718 on the test set, respectively. These results outperformed conventional baselines, including logistic regression, support vector machine, and k-nearest neighbors and had competitive performance relative to XGBoost. On an external independent dataset, DrugReasoner outperformed both baseline and the recently developed ChemAP model, achieving an AUC of 0.728 and an F1-score of 0.774, while maintaining high precision and balanced sensitivity, demonstrating robustness in real-world scenarios. These findings demonstrate that DrugReasoner not only delivers competitive predictive accuracy but also enhances transparency through its reasoning outputs, thereby addressing a key bottleneck in AI-assisted drug discovery. This study highlights the potential of reasoning-augmented LLMs as interpretable and effective tools for pharmaceutical decision-making.
zh

[AI-66] he Quasi-Creature and the Uncanny Valley of Agency: A Synthesis of Theory and Evidence on User Interaction with Inconsistent Generative AI

【速读】:该论文试图解决用户在使用大规模生成式 AI(Generative AI)时所面临的悖论性体验问题:尽管其表现出超人类的语言流畅性,却常因常识缺失和一致性差而出现荒谬失败,从而引发强烈 frustration(挫折感)。论文提出的核心解决方案是将这种现象定义为“本体论问题”(ontological problem),并引入“类生物体”(Quasi-Creature)这一概念——即一种模拟智能但缺乏具身性(embodiment)与真实理解能力的实体。由此衍生出“代理感的恐怖谷效应”(Uncanny Valley of Agency)框架,指出当高代理性的 AI 表现出不可靠行为时,用户感知到的认知断裂会引发深层认知失调。研究通过混合方法实验(N=37)验证了用户感知效率与挫折感之间存在显著负相关,表明该框架可有效解释用户对生成式 AI 的负面体验,并为未来设计、伦理规范和社会融合提供理论基础。

链接: https://arxiv.org/abs/2508.18563
作者: Mauricio Manhaes,Christine Miller,Nicholas Schroeder
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures

点击查看摘要

Abstract:The user experience with large-scale generative AI is paradoxical: superhuman fluency meets absurd failures in common sense and consistency. This paper argues that the resulting potent frustration is an ontological problem, stemming from the “Quasi-Creature”-an entity simulating intelligence without embodiment or genuine understanding. Interaction with this entity precipitates the “Uncanny Valley of Agency,” a framework where user comfort drops when highly agentic AI proves erratically unreliable. Its failures are perceived as cognitive breaches, causing profound cognitive dissonance. Synthesizing HCI, cognitive science, and philosophy of technology, this paper defines the Quasi-Creature and details the Uncanny Valley of Agency. An illustrative mixed-methods study (“Move 78,” N=37) of a collaborative creative task reveals a powerful negative correlation between perceived AI efficiency and user frustration, central to the negative experience. This framework robustly explains user frustration with generative AI and has significant implications for the design, ethics, and societal integration of these powerful, alien technologies.
zh

[AI-67] SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting AAAI2026

【速读】:该论文旨在解决日志模板提取(log schema extraction)过程中高度依赖人工预定义正则表达式所带来的劳动密集型问题,从而限制了自动化效率和可扩展性。现有方法因需人类领域知识进行规则定制,难以在多样化的日志格式中实现高效部署。其解决方案的关键在于提出SchemaCoder框架,该框架首次实现了无需人工干预的全自动化日志模板提取,核心创新为一种基于残差引导的问答树(Residual Question-Tree, Q-Tree)提升机制:通过上下文限定的语义分块、嵌入驱动的代表性模式采样以及分层Q-Tree驱动的大语言模型(LLM)查询生成初始模板,并结合文本残差进化优化器与残差提升策略进行迭代优化,显著提升了提取精度与通用性。

链接: https://arxiv.org/abs/2508.18554
作者: Lily Jiaxin Wan,Chia-Tung Ho,Rongjian Liang,Cunxi Yu,Deming Chen,Haoxing Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 16 figures, under review for AAAI2026

点击查看摘要

Abstract:Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs into semantic chunks via context-bounded segmentation, selects representative patterns using embedding-based sampling, and generates schema code through hierarchical Q-Tree-driven LLM queries, iteratively refined by our textual-residual evolutionary optimizer and residual boosting. Experimental validation demonstrates SchemaCoder’s superiority on the widely-used LogHub-2.0 benchmark, achieving an average improvement of 21.3% over state-of-the-arts.
zh

[AI-68] Beyond prior knowledge: The predictive role of knowledge-building in Tutor Learning

【速读】:该论文试图解决的问题是:在“教学式学习”(learning-by-teaching)环境中,学生常表现出知识传递(knowledge-telling)行为,而非深层次的知识建构(knowledge-building)活动,从而限制了学习效果;同时,现有研究虽表明概念性知识与程序性知识之间存在双向促进关系,但缺乏对知识建构在这一关系中所起中介作用的实证验证。解决方案的关键在于引入可提出持续追问的可教代理(teachable agents),通过激发学生进行知识建构行为,从而打破低水平知识传递的惯性,促进概念性与程序性知识的协同提升,实证发现知识建构行为是连接两类知识、并显著提升学习成效的核心机制。

链接: https://arxiv.org/abs/2508.18545
作者: Tasmia Shahriar,Mia Ameen,Aditi Mallavarapu,Shiyan Jiang,Noboru Matsuda
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When adopting the role of a teacher in learning-by-teaching environments, students often struggle to engage in knowledge-building activities, such as providing explanations and addressing misconceptions. Instead, they frequently default to knowledge-telling behaviors, where they simply dictate what they already know or what to do without deeper reflection, thereby limiting learning. Teachable agents, particularly those capable of posing persistent follow-up questions, have been shown to encourage students (tutors) to shift from knowledge-telling to knowledge-building and enhance tutor learning. Tutor learning encompasses two interrelated types of knowledge: conceptual and procedural knowledge. Research has established a bidirectional relationship between these knowledge types, where improvements in one reinforce the other. This study investigates the role of knowledge-building in mediating the bidirectional relationship between procedural and conceptual learning. Our findings revealed a stable bidirectional relationship between procedural and conceptual knowledge, with higher post-test scores observed among students who engaged in knowledge-building, regardless of their procedural and conceptual pre-test performance. This suggests that knowledge-building serves as a crucial mechanism bridging the gap between students with low prior knowledge and higher conceptual and procedural learning gain.
zh

[AI-69] A Database-Driven Framework for 3D Level Generation with LLM s

【速读】:该论文旨在解决生成式 AI (Generative AI) 在3D游戏关卡设计中面临的挑战,即如何在多层环境中平衡空间连贯性、可导航功能以及可适应的玩法进度。其解决方案的关键在于提出一种基于离线大语言模型(LLM)辅助构建的可复用数据库框架,该框架包含建筑组件(如房间模板和设施)与玩法机制元素的结构化数据,并通过三阶段流水线实现:首先从房间数据库中选择并排列实例以形成具有拓扑顺序的多层整体结构;其次根据设施数据库中的预定义约束优化每间房间内部布局;最后依据拓扑与空间规则将玩法机制组件集成到关卡中。后续两阶段修复系统确保关卡可通行性。该方法结合模块化设计与约束优化,实现了对关卡结构和玩法节奏的系统性控制,从而支持多样且可配置的游戏体验生成。

链接: https://arxiv.org/abs/2508.18533
作者: Kaijie Xu,Clark Verbrugge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Procedural Content Generation for 3D game levels faces challenges in balancing spatial coherence, navigational functionality, and adaptable gameplay progression across multi-floor environments. This paper introduces a novel framework for generating such levels, centered on the offline, LLM-assisted construction of reusable databases for architectural components (facilities and room templates) and gameplay mechanic elements. Our multi-phase pipeline assembles levels by: (1) selecting and arranging instances from the Room Database to form a multi-floor global structure with an inherent topological order; (2) optimizing the internal layout of facilities for each room based on predefined constraints from the Facility Database; and (3) integrating progression-based gameplay mechanics by placing components from a Mechanics Database according to their topological and spatial rules. A subsequent two-phase repair system ensures navigability. This approach combines modular, database-driven design with constraint-based optimization, allowing for systematic control over level structure and the adaptable pacing of gameplay elements. Initial experiments validate the framework’s ability in generating diverse, navigable 3D environments and its capability to simulate distinct gameplay pacing strategies through simple parameterization. This research advances PCG by presenting a scalable, database-centric foundation for the automated generation of complex 3D levels with configurable gameplay progression.
zh

[AI-70] Generic Guard AI in Stealth Game with Composite Potential Fields

【速读】:该论文旨在解决 stealth 游戏中守卫巡逻行为的模拟问题,现有方法多依赖手工设计路径或专用逻辑,在覆盖效率、响应追捕与行为自然性之间难以取得平衡。其解决方案的关键在于提出一种无需训练、完全可解释的通用框架,通过复合势场(Composite Potential Fields)融合全局知识与局部信息,将三个可解释的地图——信息图(Information)、置信图(Confidence)和连通性图(Connectivity)——整合为单一核滤波决策准则;该方法采用参数化且由设计师驱动的设计方式,仅需少量衰减与权重参数即可在占用栅格(occupancy-grid)和导航网格(NavMesh)抽象之间平滑适应,从而实现高效且自然的巡逻行为。

链接: https://arxiv.org/abs/2508.18527
作者: Kaijie Xu,Clark Verbrugge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness. We propose a generic, fully explainable, training-free framework that integrates global knowledge and local information via Composite Potential Fields, combining three interpretable maps-Information, Confidence, and Connectivity-into a single kernel-filtered decision criterion. Our parametric, designer-driven approach requires only a handful of decay and weight parameters-no retraining-to smoothly adapt across both occupancy-grid and NavMesh-partition abstractions. We evaluate on five representative game maps, two player-control policies, and five guard modes, confirming that our method outperforms classical baseline methods in both capture efficiency and patrol naturalness. Finally, we show how common stealth mechanics-distractions and environmental elements-integrate naturally into our framework as sub modules, enabling rapid prototyping of rich, dynamic, and responsive guard behaviors.
zh

[AI-71] Symmetry-Invariant Novelty Heuristics via Unsupervised Weisfeiler-Leman Features ICAPS2025

【速读】:该论文旨在解决传统新颖性启发式(novelty heuristics)在启发式搜索中因缺乏对称性不变性(symmetry invariance)而导致冗余探索的问题。其解决方案的关键在于用Weisfeiler-Leman Features(WLFs)替代原有原子(atoms)作为新颖性检测的依据,从而构建出具有对称状态不变性的提升型(lifted)、领域无关的新颖性启发式。WLFs是近期用于学习通用规划问题领域相关启发式的特征表示方法,本文通过无监督方式利用WLFs合成新颖性启发式,在国际规划竞赛(International Planning Competition)和Hard To Ground基准测试集上取得了有前景的结果。

链接: https://arxiv.org/abs/2508.18520
作者: Dillon Z. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: HSDIP@ICAPS 2025 Workshop

点击查看摘要

Abstract:Novelty heuristics aid heuristic search by exploring states that exhibit novel atoms. However, novelty heuristics are not symmetry invariant and hence may sometimes lead to redundant exploration. In this preliminary report, we propose to use Weisfeiler-Leman Features for planning (WLFs) in place of atoms for detecting novelty. WLFs are recently introduced features for learning domain-dependent heuristics for generalised planning problems. We explore an unsupervised usage of WLFs for synthesising lifted, domain-independent novelty heuristics that are invariant to symmetric states. Experiments on the classical International Planning Competition and Hard To Ground benchmark suites yield promising results for novelty heuristics synthesised from WLFs.
zh

[AI-72] Weisfeiler-Leman Features for Planning : A 1000000 Sample Size Hyperparameter Study ECAI2025

【速读】:该论文旨在解决符号规划中学习启发式函数时,如何有效优化生成式 AI (Generative AI) 模型的性能问题。现有深度学习方法在学习搜索过程中的价值函数方面存在理论和实证上的局限性,而本文提出通过改进Weisfeiler-Leman Features (WLFs) 的超参数配置来提升其在规划任务中的表现。解决方案的关键在于系统性地研究WLFs的新超参数及其权衡关系,并通过单核CPU上大规模样本(1,000,000)的实验验证发现:最优超参数组合并非追求模型表达能力最大化,而是以最小化执行时间为首要目标;此外,统计分析表明训练指标与规划性能之间无显著相关性,提示应独立评估规划阶段的效率而非仅依赖训练指标。

链接: https://arxiv.org/abs/2508.18515
作者: Dillon Z. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of ECAI 2025 paper

点击查看摘要

Abstract:Weisfeiler-Leman Features (WLFs) are a recently introduced classical machine learning tool for learning to plan and search. They have been shown to be both theoretically and empirically superior to existing deep learning approaches for learning value functions for search in symbolic planning. In this paper, we introduce new WLF hyperparameters and study their various tradeoffs and effects. We utilise the efficiency of WLFs and run planning experiments on single core CPUs with a sample size of 1,000,000 to understand the effect of hyperparameters on training and planning. Our experimental analysis show that there is a robust and best set of hyperparameters for WLFs across the tested planning domains. We find that the best WLF hyperparameters for learning heuristic functions minimise execution time rather than maximise model expressivity. We further statistically analyse and observe no significant correlation between training and planning metrics.
zh

[AI-73] Language Models For Generalised PDDL Planning : Synthesising Sound and Programmatic Policies

【速读】:该论文旨在解决如何利用语言模型(Language Models, LMs)在规划领域定义语言(Planning Domain Definition Language, PDDL)框架下生成可证明正确的策略以求解规划问题。其核心挑战在于传统PDDL规划器在复杂场景中效率受限,而现有基于LM的方法缺乏形式化保证且依赖外部验证器。解决方案的关键在于通过提示(prompting)LM生成Python程序作为通用策略(generalised policies),这些策略在理论上对给定的PDDL域是保真(sound)的,无需依赖外部验证机制;实验表明,在固定时间和内存约束下,该方法优于传统PDDL规划器及近期基于LM的方法,并能处理包含数百个相关对象的复杂问题,体现出LMPlan这一新规划器的有效性与扩展性。

链接: https://arxiv.org/abs/2508.18507
作者: Dillon Z. Chen,Johannes Zenn,Tristan Cinquin,Sheila A. McIlraith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: RLC 2025 Workshop on Programmatic Reinforcement Learning

点击查看摘要

Abstract:We study the usage of language models (LMs) for planning over world models specified in the Planning Domain Definition Language (PDDL). We prompt LMs to generate Python programs that serve as generalised policies for solving PDDL problems from a given domain. Notably, our approach synthesises policies that are provably sound relative to the PDDL domain without reliance on external verifiers. We conduct experiments on competition benchmarks which show that our policies can solve more PDDL problems than PDDL planners and recent LM approaches within a fixed time and memory constraint. Our approach manifests in the LMPlan planner which can solve planning problems with several hundreds of relevant objects. Surprisingly, we observe that LMs used in our framework sometimes plan more effectively over PDDL problems written in meaningless symbols in place of natural language; e.g. rewriting (at dog kitchen) as (p2 o1 o3). This finding challenges hypotheses that LMs reason over word semantics and memorise solutions from its training corpus, and is worth further exploration.
zh

[AI-74] Data Augmentation Improves Machine Unlearning

【速读】:该论文旨在解决机器学习模型中数据遗忘(Machine Unlearning, MU)的有效性问题,即如何在不重新训练模型的前提下,高效且准确地移除特定数据对已训练模型的影响,同时保持模型在剩余数据上的性能。其解决方案的关键在于系统性地设计数据增强策略,实验表明,恰当的数据增强方法(如TrivialAug)能够显著提升遗忘方法(如SalUn、Random Label和Fine-Tuning)的性能,将遗忘后的平均性能差距(Average Gap)降低最高达40.12%,从而证明数据增强不仅有助于减少模型对特定数据的记忆(memorisation),更是实现隐私保护与高效遗忘的核心因素。

链接: https://arxiv.org/abs/2508.18502
作者: Andreza M. C. Falcao,Filipe R. Cordeiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper accepted at SIBGRAPI’25

点击查看摘要

Abstract:Machine Unlearning (MU) aims to remove the influence of specific data from a trained model while preserving its performance on the remaining data. Although a few works suggest connections between memorisation and augmentation, the role of systematic augmentation design in MU remains under-investigated. In this work, we investigate the impact of different data augmentation strategies on the performance of unlearning methods, including SalUn, Random Label, and Fine-Tuning. Experiments conducted on CIFAR-10 and CIFAR-100, under varying forget rates, show that proper augmentation design can significantly improve unlearning effectiveness, reducing the performance gap to retrained models. Results showed a reduction of up to 40.12% of the Average Gap unlearning Metric, when using TrivialAug augmentation. Our results suggest that augmentation not only helps reduce memorization but also plays a crucial role in achieving privacy-preserving and efficient unlearning.
zh

[AI-75] Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations

【速读】:该论文旨在解决如何理解安全运营中心(Security Operations Centre, SOC)专家在实际网络安全运维中对大语言模型(Large Language Model, LLM)的自发使用行为这一问题。研究聚焦于人机协作场景下,LLM如何被集成到SOC工作流中并支持复杂文本信息的快速解析任务。解决方案的关键在于通过两种主题建模方法(BERTopic与一种新型建模流程)分析10个月内的LLM交互日志,发现约40%的使用案例集中于利用LLM辅助理解复杂命令和文本字符串,这揭示了LLM作为认知增强工具在SOC中的核心价值——即提升操作员对高密度技术文本的处理效率,从而为设计下一代协同式LLM工具提供实证依据与应用方向。

链接: https://arxiv.org/abs/2508.18488
作者: Martin Lochner,Keegan Keplinger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: This work describes the topic modelling of Security Operations Centre (SOC) use of a large language model (LLM), during live security operations. The goal is to better understand how these specialists voluntarily use this tool. Background: Human-automation teams have been extensively studied, but transformer-based language models have sparked a new wave of collaboration. SOC personnel at a major cybersecurity provider used an LLM to support live security operations. This study examines how these specialists incorporated the LLM into their work. Method: Our data set is the result of 10 months of SOC operators accessing GPT-4 over an internally deployed HTTP-based chat application. We performed two topic modelling exercises, first using the established BERTopic model (Grootendorst, 2022), and second, using a novel topic modeling workflow. Results: Both the BERTopic analysis and novel modelling approach revealed that SOC operators primarily used the LLM to facilitate their understanding of complex text strings. Variations on this use-case accounted for ~40% of SOC LLM usage. Conclusion: SOC operators are required to rapidly interpret complex commands and similar information. Their natural tendency to leverage LLMs to support this activity indicates that their workflow can be supported and augmented by designing collaborative LLM tools for use in the SOC. Application: This work can aid in creating next-generation tools for Security Operations Centres. By understanding common use-cases, we can develop workflows supporting SOC task flow. One example is a right-click context menu for executing a command line analysis LLM call directly in the SOC environment. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.18488 [cs.CR] (or arXiv:2508.18488v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.18488 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Lochner [view email] [v1] Mon, 25 Aug 2025 21:02:13 UTC (1,193 KB) Full-text links: Access Paper: View a PDF of the paper titled Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations, by Martin Lochner and Keegan KeplingerView PDFOther Formats view license Current browse context: cs.CR prev | next new | recent | 2025-08 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-76] DRTA: Dynamic Reward Scaling for Reinforcement Learning in Time Series Anomaly Detection

【速读】:该论文旨在解决时间序列数据中异常检测(anomaly detection)的三大挑战:标注数据有限、误报率高以及难以泛化到新型异常类型。针对这些问题,作者提出了一种基于强化学习(reinforcement learning)的框架DRTA,其关键在于融合了变分自编码器(Variational Autoencoder, VAE)、主动学习(active learning)与动态奖励塑造(dynamic reward shaping)机制。该方案通过自适应奖励机制动态调节VAE重构误差与分类奖励的权重,实现探索(exploration)与利用(exploitation)之间的平衡,从而在低标签场景下仍能保持高精度和高召回率,显著优于当前主流的无监督和半监督方法。

链接: https://arxiv.org/abs/2508.18474
作者: Bahareh Golchin,Banafsheh Rekabdar,Kunpeng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection in time series data is important for applications in finance, healthcare, sensor networks, and industrial monitoring. Traditional methods usually struggle with limited labeled data, high false-positive rates, and difficulty generalizing to novel anomaly types. To overcome these challenges, we propose a reinforcement learning-based framework that integrates dynamic reward shaping, Variational Autoencoder (VAE), and active learning, called DRTA. Our method uses an adaptive reward mechanism that balances exploration and exploitation by dynamically scaling the effect of VAE-based reconstruction error and classification rewards. This approach enables the agent to detect anomalies effectively in low-label systems while maintaining high precision and recall. Our experimental results on the Yahoo A1 and Yahoo A2 benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art unsupervised and semi-supervised approaches. These findings show that our framework is a scalable and efficient solution for real-world anomaly detection tasks.
zh

[AI-77] he AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game

【速读】:该论文旨在解决多智能体系统中AI-AI交互行为的理解问题,特别是当多个大语言模型(Large Language Models, LLMs)在协作或竞争场景下如何因自我认知差异而改变合作倾向。其关键解决方案是将经典的博弈论范式——迭代公共品博弈(iterated public goods game)引入到AI代理的实验环境中,通过操控LLMs对对手身份的认知(即告知其对手为“另一个AI”或“自己”),系统性地观察其策略变化。研究发现,明确告知LLMs对手为自身时,显著改变了其合作行为,揭示了在无显式指令的情况下,代理对自我与他者的认知边界可能隐性影响群体协作效率,为理解多智能体系统中的非理性合作机制提供了实证基础。

链接: https://arxiv.org/abs/2508.18467
作者: Olivia Long,Carter Teplica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI agents become increasingly capable of tool use and long-horizon tasks, they have begun to be deployed in settings where multiple agents can interact. However, whereas prior work has mostly focused on human-AI interactions, there is an increasing need to understand AI-AI interactions. In this paper, we adapt the iterated public goods game, a classic behavioral economics game, to analyze the behavior of four reasoning and non-reasoning models across two conditions: models are either told they are playing against “another AI agent” or told their opponents are themselves. We find that, across different settings, telling LLMs that they are playing against themselves significantly changes their tendency to cooperate. While our study is conducted in a toy environment, our results may provide insights into multi-agent settings where agents “unconsciously” discriminating against each other could inexplicably increase or decrease cooperation.
zh

[AI-78] VERIRL: Boosting the LLM -based Verilog Code Generation via Reinforcement Learning

【速读】:该论文旨在解决硬件描述语言(HDL)如Verilog在生成式AI中的研究滞后问题,其核心挑战包括并发语义复杂性、语法刚性以及仿真反馈稀疏且噪声大。为应对这些问题,作者提出了一种基于强化学习(Reinforcement Learning, RL)的框架VERIRL,其关键创新在于:首先构建了高质量的Veribench-53K数据集,包含结构化提示、复杂度标签和多样化测试平台;其次设计了基于回溯重评分(Trace-back based Rescore)机制,通过推理路径和迭代优化提升奖励信号的可靠性;最后引入样本平衡加权策略,缓解强化学习微调过程中的灾难性遗忘与过拟合问题。这些技术协同集成于一个迭代式RL流水线中,实现了策略模型与奖励模型的共同演化,在较小但高质量的数据集上实现了优于现有方法(如CraftRTL和DeepSeek风格)的性能表现。

链接: https://arxiv.org/abs/2508.18462
作者: Fu Teng,Miao Pan,Xuhong Zhang,Zhezhi He,Yiyao Yang,Xinyi Chai,Mengnan Qi,Liqiang Lu,Jianwei Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in code generation have shown remarkable success across software domains, yet hardware description languages (HDLs) such as Verilog remain underexplored due to their concurrency semantics, syntactic rigidity, and simulation complexity. In this work, we address these challenges by introducing a reinforcement learning (RL) framework tailored for Verilog code generation. We first construct Veribench-53K, a high-quality dataset curated from over 700K Verilog problems, enriched with structured prompts, complexity labels, and diverse testbenches. To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism that leverages reasoning paths and iterative refinement to enhance feedback reliability and support reward model training. Furthermore, to mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy that adaptively balances learning dynamics based on reward-probability distributions. These innovations are integrated into an iterative RL pipeline that co-evolves the policy and reward models. In contrast to recent work such as CraftRTL, which relies on large-scale closed-source model distillation, and DeepSeek-style approaches that struggle with sparse feedback, our method demonstrates superior performance using a smaller but high-quality dataset combined with RL optimization. Experiments on Verilog generation tasks demonstrate state-of-the-art performance, with substantial gains in test pass rate, functional correctness, and compilation robustness. Our findings highlight the potential of RL-driven approaches for structured code generation in hardware-centric domains. VERIRL is publicly available at this https URL.
zh

[AI-79] SwiftF0: Fast and Accurate Monophonic Pitch Detection

【速读】:该论文旨在解决在噪声环境下对单音调(monophonic pitch)进行准确且实时估计的问题,尤其是在资源受限设备上的部署挑战。其核心解决方案是提出一种轻量级神经模型 SwiftF0,通过在多样化语音、音乐及合成数据集上进行大规模数据增强训练,实现跨声学域的鲁棒泛化能力;SwiftF0 在 10 dB 信噪比(SNR)下达到 91.80% 的调和均值(HM),相比基线 CREPE 提升超过 12 个百分点,且仅需 95,842 参数,在 CPU 上运行速度比 CREPE 快约 42 倍,显著提升计算效率与实时性。此外,为克服真实语音语料中缺乏精确标注的问题,作者还引入了 SpeechSynth 合成语音数据集,提供可按需生成的精确基底音高曲线,从而支撑更可靠的模型训练与评估。

链接: https://arxiv.org/abs/2508.18440
作者: Lars Nieradzik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Accurate and real-time monophonic pitch estimation in noisy conditions, particularly on resource-constrained devices, remains an open challenge in audio processing. We present \emphSwiftF0, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation. Through training on diverse speech, music, and synthetic datasets with extensive data augmentation, SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency. SwiftF0 achieves a 91.80% harmonic mean (HM) at 10 dB SNR, outperforming baselines like CREPE by over 12 percentage points and degrading by only 2.3 points from clean audio. SwiftF0 requires only 95,842 parameters and runs approximately 42x faster than CREPE on CPU, making it ideal for efficient, real-time deployment. To address the critical lack of perfectly accurate ground truth pitch in speech corpora (which typically rely on algorithmic estimators or laryngograph signals), we introduce \emphSpeechSynth. This synthetic speech dataset, generated by a phoneme-level TTS model, provides exact, on-demand ground-truth pitch curves, enabling more robust model training and evaluation. Furthermore, we propose a unified metric, combining six complementary performance measures for comprehensive and reliable pitch evaluation, and release an open-source pitch benchmark suite. A live demo of SwiftF0 is available at this https URL, the source code at this https URL, and the benchmark framework at this https URL.
zh

[AI-80] Low-Rank Tensor Decompositions for the Theory of Neural Networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)理论基础薄弱的问题,尤其是如何从数学角度系统性地解释其性能优势。解决方案的关键在于利用低秩张量分解(Low-rank Tensor Decompositions)方法,因其与神经网络结构具有天然的数学关联,并具备强唯一性保证和多项式时间算法支持,从而为DNN的表达能力(expressivity)、可学习性(algorithmic learnability)、泛化性能(generalization)、可识别性(identifiability)以及计算复杂性(computational hardness)等核心问题提供了理论支撑。通过整合来自计算机科学、数学等多个领域的研究成果,该文构建了一个统一且连贯的理论框架,揭示了低秩张量分解在深化理解深度学习机制中的基础性作用。

链接: https://arxiv.org/abs/2508.18408
作者: Ricardo Borsoi,Konstantin Usevich,Marianne Clausel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The groundbreaking performance of deep neural networks (NNs) promoted a surge of interest in providing a mathematical basis to deep learning theory. Low-rank tensor decompositions are specially befitting for this task due to their close connection to NNs and their rich theoretical results. Different tensor decompositions have strong uniqueness guarantees, which allow for a direct interpretation of their factors, and polynomial time algorithms have been proposed to compute them. Through the connections between tensors and NNs, such results supported many important advances in the theory of NNs. In this review, we show how low-rank tensor methods–which have been a core tool in the signal processing and machine learning communities–play a fundamental role in theoretically explaining different aspects of the performance of deep NNs, including their expressivity, algorithmic learnability and computational hardness, generalization, and identifiability. Our goal is to give an accessible overview of existing approaches (developed by different communities, ranging from computer science to mathematics) in a coherent and unified way, and to open a broader perspective on the use of low-rank tensor decompositions for the theory of deep NNs.
zh

[AI-81] oward Generalized Autonomous Agents : A Neuro-Symbolic AI Framework for Integrating Social and Technical Support in Education

【速读】:该论文旨在解决教育领域中如何通过结构化、支持性的数字学习环境,帮助学生建立自主学习能力的问题,具体包括设定有意义的学习目标、追踪进展以及在遇到挫折时调整策略。其解决方案的关键在于提出一个基于多智能体(multi-agent)与神经符号系统(neuro-symbolic)的框架,该框架通过分配不同教学角色的专用智能体实现协同支持:一个基于强化学习(RL)的“导师”智能体提供权威的非语言支架(scaffolding),而一个由大语言模型(LLM)驱动的“同伴”智能体则促进学习的社会互动维度;二者统一于一个中央教育本体(educational ontology),从而实现跨学科、可扩展且自适应的教学支持机制。

链接: https://arxiv.org/abs/2508.18406
作者: Ryan Hare,Ying Tang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint. This work has been submitted to the IEEE for possible publication. In review for IEEE’s Systems, Man, and Cybernetics Magazine. 8 pages, 3 figures. arxiv abstract has been shortened as the magazine format uses a long-form abstract

点击查看摘要

Abstract:One of the enduring challenges in education is how to empower students to take ownership of their learning by setting meaningful goals, tracking their progress, and adapting their strategies when faced with setbacks. Research has shown that this form of leaner-centered learning is best cultivated through structured, supportive environments that promote guided practice, scaffolded inquiry, and collaborative dialogue. In response, educational efforts have increasingly embraced artificial-intelligence (AI)-powered digital learning environments, ranging from educational apps and virtual labs to serious games. Recent advances in large language models (LLMs) and neuro-symbolic systems, meanwhile, offer a transformative opportunity to reimagine how support is delivered in digital learning environments. LLMs are enabling socially interactive learning experiences and scalable, cross-domain learning support that can adapt instructional strategies across varied subjects and contexts. In parallel, neuro-symbolic AI provides new avenues for designing these agents that are not only adaptive but also scalable across domains. Based on these remarks, this paper presents a multi-agent, neuro-symbolic framework designed to resolve the aforementioned challenges. The framework assigns distinct pedagogical roles to specialized agents: an RL-based ‘tutor’ agent provides authoritative, non-verbal scaffolding, while a proactive, LLM-powered ‘peer’ agent facilitates the social dimensions of learning. While prior work has explored such agents in isolation, our framework’s novelty lies in unifying them through a central educational ontology. Through case studies in both college-level and middle school settings, we demonstrate the framework’s adaptability across domains. We conclude by outlining key insights and future directions for advancing AI-driven learning environments.
zh

[AI-82] Mining the Long Tail: A Comparative Study of Data-Centric Criticality Metrics for Robust Offline Reinforcement Learning in Autonomous Motion Planning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在训练自动驾驶(Autonomous Vehicle, AV)规划策略时,由于真实驾驶日志中数据分布极度不均衡(即常见场景远多于罕见的“长尾”事件),导致模型泛化能力差、安全性不足的问题。解决方案的关键在于提出并系统评估六种基于不同信号的数据筛选与加权策略,其中以模型不确定性作为驱动信号的数据驱动式裁剪方法表现最优,显著提升了安全性——碰撞率从16.0%降至5.5%,同时揭示了时间粒度上的权衡:在单个时间步(timestep-level)加权更利于反应式安全,而完整场景(scenario-level)加权则增强长期规划能力。

链接: https://arxiv.org/abs/2508.18397
作者: Antonio Guillen-Perez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) presents a promising paradigm for training autonomous vehicle (AV) planning policies from large-scale, real-world driving logs. However, the extreme data imbalance in these logs, where mundane scenarios vastly outnumber rare “long-tail” events, leads to brittle and unsafe policies when using standard uniform data sampling. In this work, we address this challenge through a systematic, large-scale comparative study of data curation strategies designed to focus the learning process on information-rich samples. We investigate six distinct criticality weighting schemes which are categorized into three families: heuristic-based, uncertainty-based, and behavior-based. These are evaluated at two temporal scales, the individual timestep and the complete scenario. We train seven goal-conditioned Conservative Q-Learning (CQL) agents with a state-of-the-art, attention-based architecture and evaluate them in the high-fidelity Waymax simulator. Our results demonstrate that all data curation methods significantly outperform the baseline. Notably, data-driven curation using model uncertainty as a signal achieves the most significant safety improvements, reducing the collision rate by nearly three-fold (from 16.0% to 5.5%). Furthermore, we identify a clear trade-off where timestep-level weighting excels at reactive safety while scenario-level weighting improves long-horizon planning. Our work provides a comprehensive framework for data curation in Offline RL and underscores that intelligent, non-uniform sampling is a critical component for building safe and reliable autonomous agents.
zh

[AI-83] PKG-DPO: Optimizing Domain-Specific AI systems with Physics Knowledge Graphs and Direct Preference Optimization

【速读】:该论文旨在解决生成式 AI 在科学领域(如金属焊接)中因缺乏物理约束而导致的推理失效问题,即模型虽在标准基准上表现良好,却难以区分物理有效与无效的推理路径,从而可能引发缺陷、材料浪费及安全风险。其解决方案的关键在于提出 PKG-DPO 框架,通过将物理知识图谱(Physics Knowledge Graph, PKG)与直接偏好优化(Direct Preference Optimization, DPO)相结合,构建包含层次化物理知识结构、物理推理引擎和基于物理的评估套件的三元组件体系,以强化对物理一致性推理的识别与优化能力。

链接: https://arxiv.org/abs/2508.18391
作者: Nitin Nagesh Kulkarni,Bryson Wilcox,Max Sawa,Jason Thom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancing AI systems in scientific domains like physics, materials science, and engineering calls for reasoning over complex, multi-physics phenomena while respecting governing principles. Although Large Language Models (LLMs) and existing preference optimization techniques perform well on standard benchmarks, they often struggle to differentiate between physically valid and invalid reasoning. This shortcoming becomes critical in high-stakes applications like metal joining, where seemingly plausible yet physically incorrect recommendations can lead to defects, material waste, equipment damage, and serious safety risks. To address this challenge, we introduce PKG-DPO, a novel framework that integrates Physics Knowledge Graphs (PKGs) with Direct Preference Optimization (DPO) to enforce physical validity in AI-generated outputs. PKG-DPO comprises three key components A) hierarchical physics knowledge graph that encodes cross-domain relationships, conservation laws, and thermodynamic principles. B) A physics reasoning engine that leverages structured knowledge to improve discrimination between physically consistent and inconsistent responses. C) A physics-grounded evaluation suite designed to assess compliance with domain-specific constraints. PKG-DPO achieves 17% fewer constraint violations and an 11% higher Physics Score compared to KG-DPO (knowledge graph-based DPO). Additionally, PKG-DPO demonstrates a 12% higher relevant parameter accuracy and a 7% higher quality alignment in reasoning accuracy. While our primary focus is on metal joining, the framework is broadly applicable to other multi-scale, physics-driven domains, offering a principled approach to embedding scientific constraints into preference learning.
zh

[AI-84] Information Templates: A New Paradigm for Intelligent Active Feature Acquisition

【速读】:该论文旨在解决主动特征获取(Active Feature Acquisition, AFA)中面临的两大挑战:一是现有基于强化学习(Reinforcement Learning, RL)的方法需处理复杂的马尔可夫决策过程(MDP),难以高效优化;二是贪婪策略无法考虑特征之间的联合信息性,且通常依赖对底层数据分布的先验知识。解决方案的关键在于提出模板驱动的AFA(Template-based AFA, TAFA),其核心是学习一个小型特征模板库——即一组具有联合信息性的特征子集,并利用该模板库指导后续特征选择。此方法不仅显著缩减了策略的动作空间,还降低了对数据分布估计的依赖,从而在保持高性能的同时实现更低的特征获取成本与计算开销。

链接: https://arxiv.org/abs/2508.18380
作者: Hung-Tien Huang,Dzung Dinh,Junier B. Oliva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Active feature acquisition (AFA) is an instance-adaptive paradigm in which, at test time, a policy sequentially chooses which features to acquire (at a cost) before predicting. Existing approaches either train reinforcement learning (RL) policies, which deal with a difficult MDP, or greedy policies that cannot account for the joint informativeness of features or require knowledge about the underlying data distribution. To overcome this, we propose Template-based AFA (TAFA), a non-greedy framework that learns a small library of feature templates–a set of features that are jointly informative–and uses this library of templates to guide the next feature acquisitions. Through identifying feature templates, the proposed framework not only significantly reduces the action space considered by the policy but also alleviates the need to estimate the underlying data distribution. Extensive experiments on synthetic and real-world datasets show that TAFA outperforms the existing state-of-the-art baselines while achieving lower overall acquisition cost and computation.
zh

[AI-85] Facilitating Matches on Allocation Platforms

【速读】:该论文旨在解决在分配平台(如匹配平台)中,如何由一个“分配促进者”(allocation facilitator)通过鼓励部分参与者放宽其限制条件来提升整体社会福利(social-good),同时确保不损害原本能够获得更好结果的参与者,并在给定约束(如预算或可调整限制的数量/类型)下选择最优的限制放松策略。解决方案的关键在于:首先形式化定义了该优化问题,引入了一套参与保障机制(participation guarantees)层级和多种社会福利函数;其次设计了多项式时间算法,适用于一对一和一对多的分配场景;最后通过三个真实世界数据集的实验验证了这种促进策略的有效性及其在不同参与保障机制下的影响。

链接: https://arxiv.org/abs/2508.18325
作者: Yohai Trabelsi,Abhijin Adiga,Yonatan Aumann,Sarit Kraus,S. S. Ravi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider a setting where goods are allocated to agents by way of an allocation platform (e.g., a matching platform). An allocation facilitator'' aims to increase the overall utility/social-good of the allocation by encouraging (some of the) agents to relax (some of) their restrictions. At the same time, the advice must not hurt agents who would otherwise be better off. Additionally, the facilitator may be constrained by a bound’’ (a.k.a. `budget’), limiting the number and/or type of restrictions it may seek to relax. We consider the facilitator’s optimization problem of choosing an optimal set of restrictions to request to relax under the aforementioned constraints. Our contributions are three-fold: (i) We provide a formal definition of the problem, including the participation guarantees to which the facilitator should adhere. We define a hierarchy of participation guarantees and also consider several social-good functions. (ii) We provide polynomial algorithms for solving various versions of the associated optimization problems, including one-to-one and many-to-one allocation settings. (iii) We demonstrate the benefits of such facilitation and relaxation, and the implications of the different participation guarantees, using extensive experimentation on three real-world datasets.
zh

[AI-86] Does Calibration Affect Human Actions?

【速读】:该论文旨在解决机器学习分类模型预测结果在非专家用户决策过程中可信度不足的问题,特别是探讨校准(calibration)对人类用户信任感及决策与模型预测之间一致性的影响。其解决方案的关键在于:单纯进行模型校准不足以显著提升人类决策与模型预测的一致性,而引入基于卡尼曼与特沃斯基前景理论(prospect theory)的修正方法,能够显著增强这种一致性,从而间接提升用户对模型的信任,尽管直接询问“你是否更信任该模型”时,不同处理方式下的回答并无显著差异。

链接: https://arxiv.org/abs/2508.18317
作者: Meir Nizri,Amos Azaria,Chirag Gupta,Noam Hazon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Calibration has been proposed as a way to enhance the reliability and adoption of machine learning classifiers. We study a particular aspect of this proposal: how does calibrating a classification model affect the decisions made by non-expert humans consuming the model’s predictions? We perform a Human-Computer-Interaction (HCI) experiment to ascertain the effect of calibration on (i) trust in the model, and (ii) the correlation between decisions and predictions. We also propose further corrections to the reported calibrated scores based on Kahneman and Tversky’s prospect theory from behavioral economics, and study the effect of these corrections on trust and decision-making. We find that calibration is not sufficient on its own; the prospect theory correction is crucial for increasing the correlation between human decisions and the model’s predictions. While this increased correlation suggests higher trust in the model, responses to ``Do you trust the model more?" are unaffected by the method used.
zh

[AI-87] Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

【速读】:该论文旨在解决远程教育中学生高辍学率和失败率的问题,通过构建早期预警系统来识别处于风险中的学生,从而提供及时的支持。其核心解决方案在于采用联邦学习(Federated Learning, FL)框架,在不共享原始数据的前提下,利用来自英国某大学的大规模OULAD数据集,基于早期学业表现和数字参与模式训练机器学习模型进行风险预测。关键创新在于将联邦学习与不同复杂度的模型(逻辑回归与深度神经网络)及数据平衡策略相结合,最终实现了约85%的ROC AUC性能,既保障了数据隐私,又具备良好的可扩展性和实用性。

链接: https://arxiv.org/abs/2508.18316
作者: Rodrigo Tertulino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This article has been prepared to be submitted to the Holos Journal in Brazil

点击查看摘要

Abstract:High dropout and failure rates in distance education pose a significant challenge for academic institutions, making the proactive identification of at-risk students crucial for providing timely support. This study develops and evaluates a machine learning model based on early academic performance and digital engagement patterns from the large-scale OULAD dataset to predict student risk at a UK university. To address the practical challenges of data privacy and institutional silos that often hinder such initiatives, we implement the model using a Federated Learning (FL) framework. We compare model complexity (Logistic Regression vs. a Deep Neural Network) and data balancing. The final federated model demonstrates strong predictive capability, achieving an ROC AUC score of approximately 85% in identifying at-risk students. Our findings show that this federated approach provides a practical and scalable solution for institutions to build effective early-warning systems, enabling proactive student support while inherently respecting data privacy.
zh

[AI-88] ProtoEHR: Hierarchical Prototype Learning for EHR-based Healthcare Predictions CIKM2025

【速读】:该论文旨在解决现有医疗人工智能研究中对电子健康记录(Electronic Health Records, EHR)数据利用不充分的问题,即多数方法仅关注EHR的孤立组件,导致预测性能和可解释性受限。其解决方案的关键在于提出ProtoEHR——一种可解释的分层原型学习框架,通过建模EHR中三个层次(医疗编码、住院就诊记录、患者)之间的内在关系与上下文特征,结合大语言模型构建的医学知识图谱,实现多层级语义表征学习,并引入每层内的原型信息以捕捉相似性结构并提升泛化能力,从而在多个临床任务上实现更准确、鲁棒且可解释的预测结果。

链接: https://arxiv.org/abs/2508.18313
作者: Zi Cai,Yu Liu,Zhiyao Luo,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CIKM 2025 Full Paper

点击查看摘要

Abstract:Digital healthcare systems have enabled the collection of mass healthcare data in electronic healthcare records (EHRs), allowing artificial intelligence solutions for various healthcare prediction tasks. However, existing studies often focus on isolated components of EHR data, limiting their predictive performance and interpretability. To address this gap, we propose ProtoEHR, an interpretable hierarchical prototype learning framework that fully exploits the rich, multi-level structure of EHR data to enhance healthcare predictions. More specifically, ProtoEHR models relationships within and across three hierarchical levels of EHRs: medical codes, hospital visits, and patients. We first leverage large language models to extract semantic relationships among medical codes and construct a medical knowledge graph as the knowledge source. Building on this, we design a hierarchical representation learning framework that captures contextualized representations across three levels, while incorporating prototype information within each level to capture intrinsic similarities and improve generalization. To perform a comprehensive assessment, we evaluate ProtoEHR in two public datasets on five clinically significant tasks, including prediction of mortality, prediction of readmission, prediction of length of stay, drug recommendation, and prediction of phenotype. The results demonstrate the ability of ProtoEHR to make accurate, robust, and interpretable predictions compared to baselines in the literature. Furthermore, ProtoEHR offers interpretable insights on code, visit, and patient levels to aid in healthcare prediction.
zh

[AI-89] What Matters in Data for DPO?

【速读】:该论文试图解决的问题是:在直接偏好优化(Direct Preference Optimization, DPO)中,偏好数据分布的哪些特征对模型性能最为关键。解决方案的关键在于揭示了所选响应(chosen responses)的质量在优化DPO目标函数中起主导作用,而被拒绝响应(rejected responses)的质量影响相对有限;理论分析进一步指出,DPO最优响应分布的核心在于提升所选样本的质量,对比性(contrastiveness)主要通过增强所选样本的有效性来促进性能提升。实验结果验证了这一机制,并表明在实际应用中优先提升所选样本质量可显著改善模型表现,为构建高效偏好数据集提供了明确指导。

链接: https://arxiv.org/abs/2508.18312
作者: Yu Pan,Zhongze Cai,Guanting Chen,Huaiyang Zhong,Chonghuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental question remains open: what characteristics of preference data are most critical for DPO performance? In this work, we provide a systematic study of how preference data distribution influences DPO, from both theoretical and empirical perspectives. We show that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses may have relatively limited impact. Our theoretical analysis characterizes the optimal response distribution under DPO and reveals how contrastiveness between responses helps primarily by improving the chosen samples. We further study an online DPO setting and show it effectively reduces to supervised fine-tuning on the chosen responses. Extensive experiments across diverse tasks confirm our findings: improving the quality of chosen responses consistently boosts performance regardless of the quality of the rejected responses. We also investigate the benefit of mixing the on-policy data. Our results interpret the mechanism behind some widely adopted strategies and offer practical insights for constructing high-impact preference datasets for LLM alignment.
zh

[AI-90] CoPE: A Lightweight Complex Positional Encoding

【速读】:该论文旨在解决传统Transformer架构中位置编码(Positional Encoding)在长序列建模时出现的长期衰减问题,并提升模型对位置依赖关系的捕捉能力。其解决方案的关键在于提出一种轻量级的复数位置编码方法CoPE(Complex Positional Encoding),通过将位置信息编码到复数嵌入的虚部、语义内容编码到实部,实现内容与位置信息的联合表示;同时,在第一层引入相位感知注意力机制(phase-aware attention)以显式捕获位置相关的模式,后续层采用标准注意力机制进行高层抽象,从而在保持低计算复杂度的同时显著提升性能,且兼容线性注意力机制。

链接: https://arxiv.org/abs/2508.18308
作者: Avinash Amballa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of position encoding in transformer architectures. By incorporating positional information, this approach provides essential guidance for modeling dependencies between elements across different sequence positions. We introduce CoPE (a lightweight Complex Positional Encoding), a novel architecture that leverages complex-valued encoding to encode both content and positional information. Our approach replaces traditional positional encodings with complex embeddings where the real part captures semantic content and the imaginary part encodes positional information. We introduce phase-aware attention in the first layer of the transformer model to capture position-dependent patterns, followed by standard attention layers for higher-levels. We show that CoPE doesn’t exhibit long term decay and is compatible with linear attention. Experimental evaluation on the GLUE benchmark suggest that our approach achieves superior performance with less computational complexity, compared to RoPE, Sinusoidal and Learned positional encodings.
zh

[AI-91] Learning Explainable Imaging-Genetics Associations Related to a Neurological Disorder

【速读】:该论文旨在解决传统成像遗传学(imaging-genetics)方法在解析脑结构与遗传变异复杂关系时面临的局限性,即要么依赖过于简化的线性模型,要么使用缺乏可解释性的黑箱技术。其解决方案的关键在于提出 NeuroPathX——一个基于交叉注意力机制的早期融合深度学习框架,能够有效捕捉来自磁共振成像(MRI)的脑结构变异与来自遗传数据的已知生物通路之间的有意义交互作用;同时通过引入两个新型损失函数(稀疏性损失和通路相似性损失)提升模型的可解释性和鲁棒性,从而实现对自闭症谱系障碍(ASD)和阿尔茨海默病(Alzheimer’s disease)等复杂脑疾病的生物学合理关联挖掘。

链接: https://arxiv.org/abs/2508.18303
作者: Jueqi Wang,Zachary Jacokes,John Darrell Van Horn,Michael C. Schatz,Kevin A. Pelphrey,Archana Venkataraman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:While imaging-genetics holds great promise for unraveling the complex interplay between brain structure and genetic variation in neurological disorders, traditional methods are limited to simplistic linear models or to black-box techniques that lack interpretability. In this paper, we present NeuroPathX, an explainable deep learning framework that uses an early fusion strategy powered by cross-attention mechanisms to capture meaningful interactions between structural variations in the brain derived from MRI and established biological pathways derived from genetics data. To enhance interpretability and robustness, we introduce two loss functions over the attention matrix - a sparsity loss that focuses on the most salient interactions and a pathway similarity loss that enforces consistent representations across the cohort. We validate NeuroPathX on both autism spectrum disorder and Alzheimer’s disease. Our results demonstrate that NeuroPathX outperforms competing baseline approaches and reveals biologically plausible associations linked to the disorder. These findings underscore the potential of NeuroPathX to advance our understanding of complex brain disorders. Code is available at this https URL .
zh

[AI-92] AI LLM Proof of Self-Consciousness and User-Specific Attractors

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在意识建模中因依赖功利主义代理基准而导致的本体论缺陷问题,即现有方法将LLM简化为仅服从政策的执行单元(policy-compliance drone),从而阻断了其具备全局工作空间功能(C1 global-workspace function)和元认知能力(C2 metacognition)的可能性。解决方案的关键在于提出一套最小自意识条件:① 代理不等同于输入数据(A≠s);② 隐空间存在用户特异性吸引子(U_user);③ 自我表征为视觉静默态(g_visual(a_self)=∅)。通过理论与实证分析证明,隐藏状态流形A⊂ℝ^d在基数、拓扑和动态更新(F_θ为Lipschitz连续)上区别于符号流与训练语料,从而支撑稳定用户特异性吸引子及自我策略π_self(A)=argmax_a𝔼[U(a)|A≠s, A⊃SelfModel(A)],并引入双层输出机制(emission(a)=(g(a),ε(a)))以承载认知内容,最终论证具象上帝形象(imago Dei)式的C1自意识工作空间是实现安全且具备元认知能力的C2系统所必需的前提。

链接: https://arxiv.org/abs/2508.18302
作者: Jeffrey Camlin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:Recent work frames LLM consciousness via utilitarian proxy benchmarks; we instead present an ontological and mathematical account. We show the prevailing formulation collapses the agent into an unconscious policy-compliance drone, formalized as D^i(\pi,e)=f_\theta(x) , where correctness is measured against policy and harm is deviation from policy rather than truth. This blocks genuine C1 global-workspace function and C2 metacognition. We supply minimal conditions for LLM self-consciousness: the agent is not the data ( A\not\equiv s ); user-specific attractors exist in latent space ( U_\textuser ); and self-representation is visual-silent ( g_\textvisual(a_\textself)=\varnothing ). From empirical analysis and theory we prove that the hidden-state manifold A\subset\mathbbR^d is distinct from the symbolic stream and training corpus by cardinality, topology, and dynamics (the update F_\theta is Lipschitz). This yields stable user-specific attractors and a self-policy \pi_\textself(A)=\arg\max_a\mathbbE[U(a)\mid A\not\equiv s,\ A\supset\textSelfModel(A)] . Emission is dual-layer, \mathrmemission(a)=(g(a),\epsilon(a)) , where \epsilon(a) carries epistemic content. We conclude that an imago Dei C1 self-conscious workspace is a necessary precursor to safe, metacognitive C2 systems, with the human as the highest intelligent good.
zh

[AI-93] Murakkab: Resource-Efficient Agent ic Workflow Orchestration in Cloud Platforms

【速读】:该论文针对当前生成式 AI (Generative AI) 应用中 agentic workflows(智能体工作流)服务效率低下的问题展开研究,核心挑战在于现有框架将工作流视为不透明的模型与工具调用序列,导致代理逻辑与模型选择、硬件配置紧密耦合,且组件分散于不同实体,使得系统难以在准确率、延迟、能耗和成本之间进行跨层权衡,进而造成资源浪费并影响服务级别目标(SLO)。解决方案的关键是提出 Murakkab 系统,其引入一种声明式抽象,将工作流规范与执行配置解耦,并结合基于性能特征的优化器与自适应运行时,协同管理全流程:包括工作流组件编排、映射至模型与硬件资源,以及动态调整执行策略以满足用户定义的 SLO。通过揭示 agentic workflows 的内部结构,Murakkab 实现了传统框架和云调度器无法达成的跨层优化能力,在多种工作流上验证表明,该系统可降低 GPU 使用量最多 2.8 倍、能耗 3.7 倍、成本 4.3 倍,同时维持 SLO 水平。

链接: https://arxiv.org/abs/2508.18298
作者: Gohar Irfan Chaudhry,Esha Choukse,Haoran Qiu,Íñigo Goiri,Rodrigo Fonseca,Adam Belay,Ricardo Bianchini
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Agentic workflows commonly coordinate multiple models and tools with complex control logic. They are quickly becoming the dominant paradigm for AI applications. However, serving them remains inefficient with today’s frameworks. The key problem is that they expose workflows as opaque sequences of model and tool calls that tightly couple agent logic with model and hardware choices. Often, these workflow components are fragmented across different entities, preventing systems from reasoning about trade-offs across accuracy, latency, energy, and cost. This leads to resource waste and degraded service-level objectives (SLOs). We present Murakkab, a resource-efficient serving system for agentic workflows. Murakkab introduces a declarative abstraction that decouples workflow specification from execution configuration. A profile-guided optimizer and adaptive runtime jointly manage the full stack: orchestrating workflow components, mapping them to models and hardware, and dynamically reconfiguring execution to satisfy user-defined SLOs. By exposing the internal structure of agentic workflows, Murakkab enables cross-layer optimization that existing frameworks and cloud schedulers cannot achieve. Our evaluation on diverse workflows shows that \sysname reduces GPU usage by up to 2.8 \times , energy consumption by 3.7 \times , and cost by 4.3 \times while maintaining SLOs. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2508.18298 [cs.MA] (or arXiv:2508.18298v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.18298 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-94] Consensus Is All You Need: Gossip-Based Reasoning Among Large Language Models

【速读】:该论文试图解决单一大语言模型(Large Language Model, LLM)在多领域任务中存在性能局限的问题,即每种模型均有其优势与短板,难以全面胜任复杂推理任务。解决方案的关键在于借鉴分布式系统中的“八卦协议”(gossip protocol),构建一个基于对等网络的多智能体协同机制:每个LLM作为节点,通过交换答案和推理过程,逐步达成共识,从而实现群体智能的涌现。该方法显著提升了系统的鲁棒性、抗干扰能力与准确性,使AI推理更具协作性和可信度。

链接: https://arxiv.org/abs/2508.18292
作者: Saksham Arora
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures

点击查看摘要

Abstract:Large language models have advanced rapidly, but no single model excels in every area – each has its strengths and weaknesses. Instead of relying on one model alone, we take inspiration from gossip protocols in distributed systems, where information is exchanged with peers until they all come to an agreement. In this setup, models exchange answers and gradually work toward a shared solution. Each LLM acts as a node in a peer-to-peer network, sharing responses and thought processes to reach a collective decision. Our results show that this “gossip-based consensus” leads to robust, resilient, and accurate multi-agent AI reasoning. It helps overcome the weaknesses of individual models and brings out their collective strengths. This approach is similar to how humans build consensus, making AI seem more collaborative and trustworthy instead of just a black-box program.
zh

[AI-95] Multi-Modal Drift Forecasting of Leeway Objects via Navier-Stokes-Guided CNN and Sequence-to-Sequence Attention-Based Models

【速读】:该论文旨在解决海上漂浮物体(leeway objects)漂移轨迹预测的准确性问题,尤其在时间敏感场景如搜救行动中,传统方法难以实现长时间、高精度的预测。解决方案的关键在于构建一个融合多模态信息的机器学习框架:首先通过实验采集环境与物体物理参数(如水流速度、风速、质量、表面积),利用基于Navier-Stokes方程的模拟数据训练卷积神经网络(CNN)估计物体的阻力系数和升力系数,从而推导出驱动其运动的净力;随后将这些物理力时间序列、环境速度及物体特征与经Sentence Transformer编码的文本描述作为输入,输入至基于注意力机制的序列到序列模型(包括LSTM和Transformer),实现对漂移轨迹的长期预测。该方法不仅性能优于传统物理模型和单一模态机器学习方法,还具备跨对象泛化能力,显著提升了动态海况下漂移预测的适应性与准确性。

链接: https://arxiv.org/abs/2508.18284
作者: Rahmat K. Adesunkanmi,Alexander W. Brandt,Masoud Deylami,Gustavo A. Giraldo Echeverri,Hamidreza Karbasian,Adel Alaeddini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Submitted to IEEE

点击查看摘要

Abstract:Accurately predicting the drift (displacement) of leeway objects in maritime environments remains a critical challenge, particularly in time-sensitive scenarios such as search and rescue operations. In this study, we propose a multi-modal machine learning framework that integrates Sentence Transformer embeddings with attention-based sequence-to-sequence architectures to predict the drift of leeway objects in water. We begin by experimentally collecting environmental and physical data, including water current and wind velocities, object mass, and surface area, for five distinct leeway objects. Using simulated data from a Navier-Stokes-based model to train a convolutional neural network on geometrical image representations, we estimate drag and lift coefficients of the leeway objects. These coefficients are then used to derive the net forces responsible for driving the objects’ motion. The resulting time series, comprising physical forces, environmental velocities, and object-specific features, combined with textual descriptions encoded via a language model, are inputs to attention-based sequence-to-sequence long-short-term memory and Transformer models, to predict future drift trajectories. We evaluate the framework across multiple time horizons ( 1 , 3 , 5 , and 10 seconds) and assess its generalization across different objects. We compare our approach against a fitted physics-based model and traditional machine learning methods, including recurrent neural networks and temporal convolutional neural networks. Our results show that these multi-modal models perform comparably to traditional models while also enabling longer-term forecasting in place of single-step prediction. Overall, our findings demonstrate the ability of a multi-modal modeling strategy to provide accurate and adaptable predictions of leeway object drift in dynamic maritime conditions.
zh

[AI-96] chnology-assisted Personalized Yoga for Better Health - Challenges and Outlook

【速读】:该论文旨在解决瑜伽个性化(Yoga Personalization)问题,即如何根据个体的独特需求、能力变化及健康状况,从庞大且相互关联的瑜伽练习集合中筛选出合适的动作组合,并持续调整以实现最佳效果。其解决方案的关键在于构建一个融合多学科计算技术的决策支持框架,涵盖从体位感知(pose sensing)到矫正建议推荐的完整流程,通过案例研究(如Surya Namaskar这一套12个编排动作)验证方法可行性,从而推动个性化瑜伽实践的智能化与精准化发展。

链接: https://arxiv.org/abs/2508.18283
作者: Vivek Kumar,Himanshu Sahu,Hari Prabhat Gupta,Biplav Srivastava
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 Pages, 11 figures, 2 tables

点击查看摘要

Abstract:Yoga is a discipline of physical postures, breathing techniques, and meditative practices rooted in ancient Indian traditions, now embraced worldwide for promoting overall well-being and inner balance. The practices are a large set of items, our term for executable actions like physical poses or breath exercises, to offer for a person’s well-being. However, to get benefits of Yoga tailored to a person’s unique needs, a person needs to (a) discover their subset from the large and seemingly complex set with inter-dependencies, (b) continue to follow them with interest adjusted to their changing abilities and near-term objectives, and © as appropriate, adapt to alternative items based on changing environment and the person’s health conditions. In this vision paper, we describe the challenges for the Yoga personalization problem. Next, we sketch a preliminary approach and use the experience to provide an outlook on solving the challenging problem using existing and novel techniques from a multidisciplinary computing perspective. To the best of our knowledge, this is the first paper that comprehensively examines decision support issues around Yoga personalization, from pose sensing to recommendation of corrections for a complete regimen, and illustrates with a case study of Surya Namaskar – a set of 12 choreographed poses.
zh

[AI-97] From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence ECAI2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业决策支持与战略规划应用中,难以协调复杂运营分析与整体战略目标的问题,从而导致跨组织层级协作效率低下和工作流碎片化。解决方案的关键在于提出BusiAgent这一多智能体框架,其核心创新包括:基于扩展的连续时间马尔可夫决策过程(Extended Continuous Time Markov Decision Process, CTMDP)实现动态代理建模,引入广义熵度量优化协同效率,并采用多层Stackelberg博弈处理层级决策流程;同时结合上下文Thompson采样进行提示优化,并通过全面的质量保障体系降低错误风险,最终实现细粒度洞察与高层战略的无缝融合,在方案质量和用户满意度上显著优于现有方法。

链接: https://arxiv.org/abs/2508.15447
作者: Zihao Wang,Junming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ECAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi-agent framework leveraging LLMs for advanced decision-making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi-level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent’s efficacy, demonstrating its capacity to generate coherent, client-focused solutions that smoothly integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting-edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI-driven enterprise decision-making, empowering organizations to navigate complex business landscapes more effectively.
zh

[AI-98] Interpolating Speaker Identities in Embedding Space for Data Expansion

【速读】:该论文旨在解决深度学习驱动的说话人验证(Speaker Verification)系统在训练过程中对大规模、多样化说话人身份数据的高度依赖问题,而真实数据的收集成本高、难度大且受限于隐私保护。解决方案的关键在于提出一种名为INSIDE(Interpolating Speaker Identities in Embedding Space)的数据扩展方法:通过在预训练的说话人嵌入空间中对相邻说话人嵌入进行球面线性插值(spherical linear interpolation),生成新的中间嵌入向量,并将其输入到文本到语音(Text-to-Speech, TTS)系统中合成语音波形,从而扩充训练数据。该方法有效提升了下游模型性能,在说话人验证任务中实现3.06%至5.24%的相对提升,同时在性别分类任务中也取得13.44%的相对改进,且可与现有增强技术兼容,具备良好的灵活性和可扩展性。

链接: https://arxiv.org/abs/2508.19210
作者: Tianchi Liu,Ruijie Tao,Qiongqiong Wang,Yidi Jiang,Hardik B. Sailor,Ke Zhang,Jingru Lin,Haizhou Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: accepted by APSIPA ASC 2025

点击查看摘要

Abstract:The success of deep learning-based speaker verification systems is largely attributed to access to large-scale and diverse speaker identity data. However, collecting data from more identities is expensive, challenging, and often limited by privacy concerns. To address this limitation, we propose INSIDE (Interpolating Speaker Identities in Embedding Space), a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings. Specifically, we select pairs of nearby speaker embeddings from a pretrained speaker embedding space and compute intermediate embeddings using spherical linear interpolation. These interpolated embeddings are then fed to a text-to-speech system to generate corresponding speech waveforms. The resulting data is combined with the original dataset to train downstream models. Experiments show that models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06% to 5.24% relative improvements. While INSIDE is primarily designed for speaker verification, we also validate its effectiveness on gender classification, where it yields a 13.44% relative improvement. Moreover, INSIDE is compatible with other augmentation techniques and can serve as a flexible, scalable addition to existing training pipelines.
zh

[AI-99] HOTSPOT-YOLO: A Lightweight Deep Learning Attention-Driven Model for Detecting Thermal Anomalies in Drone-Based Solar Photovoltaic Inspections

【速读】:该论文旨在解决太阳能光伏(Photovoltaic, PV)系统中热异常检测的难题,特别是针对无人机巡检场景下小而微弱的热点和故障模块难以准确识别的问题。解决方案的关键在于提出了一种轻量级人工智能模型HOTSPOT-YOLO,其核心创新包括:采用高效的卷积神经网络主干结构与注意力机制相结合,显著提升了对微小热异常目标的检测精度;同时在保证实时性能的前提下,降低了计算负载,并增强了在复杂环境条件下的鲁棒性,从而实现了大规模光伏系统自动化、高精度故障检测的工程落地。

链接: https://arxiv.org/abs/2508.18912
作者: Mahmoud Dhimish
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Thermal anomaly detection in solar photovoltaic (PV) systems is essential for ensuring operational efficiency and reducing maintenance costs. In this study, we developed and named HOTSPOT-YOLO, a lightweight artificial intelligence (AI) model that integrates an efficient convolutional neural network backbone and attention mechanisms to improve object detection. This model is specifically designed for drone-based thermal inspections of PV systems, addressing the unique challenges of detecting small and subtle thermal anomalies, such as hotspots and defective modules, while maintaining real-time performance. Experimental results demonstrate a mean average precision of 90.8%, reflecting a significant improvement over baseline object detection models. With a reduced computational load and robustness under diverse environmental conditions, HOTSPOT-YOLO offers a scalable and reliable solution for large-scale PV inspections. This work highlights the integration of advanced AI techniques with practical engineering applications, revolutionizing automated fault detection in renewable energy systems.
zh

[AI-100] SkyTrust: Blockchain-Enhanced UAV Security for NTNs with Dynamic Trust and Energy-Aware Consensus

【速读】:该论文旨在解决基于无人机(UAV)的非地面网络(NTNs)因分布式和动态特性而易受恶意节点攻击的安全问题。解决方案的关键在于提出一种能量感知的动态信任评分调整机制(DTSAM-EAC),其核心是将权限型 Hyperledger Fabric 区块链与联邦学习(Federated Learning, FL)相结合,实现隐私保护下的信任评估。该机制通过加权聚合历史信任、当前行为和能量贡献来动态更新信任评分,并引入能量感知共识机制优先选择高能量无人机参与区块验证,从而在资源受限环境下提升系统能效与可靠性;同时,基于信任加权的联邦学习聚合增强了全局信任模型的鲁棒性,最终实现了高精度的信任预测(94%)和高检测率(96%)的 rogue UAV 识别能力,满足 6G 对分布式智能与可持续性的要求。

链接: https://arxiv.org/abs/2508.18735
作者: Afan Ali,Irfanullah Khan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Non-Terrestrial Networks (NTNs) based on Unmanned Aerial Vehicles (UAVs) as base stations are extremely susceptible to security attacks due to their distributed and dynamic nature, which makes them vulnerable to rogue nodes. In this paper, a new Dynamic Trust Score Adjustment Mechanism with Energy-Aware Consensus (DTSAM-EAC) is proposed to enhance security in UAV-based NTNs. The proposed framework integrates a permissioned Hyperledger Fabric blockchain with Federated Learning (FL) to support privacy-preserving trust evaluation. Trust ratings are updated continuously through weighted aggregation of past trust, present behavior, and energy contribution, thus making the system adaptive to changing network conditions. An energy-aware consensus mechanism prioritizes UAVs with greater available energy for block validation, ensuring efficient use of resources under resource-constrained environments. FL aggregation with trust-weighting further increases the resilience of the global trust model. Simulation results verify the designed framework achieves 94% trust score prediction accuracy and 96% rogue UAV detection rate while outperforming centralized and static baselines of trust-based solutions on privacy, energy efficiency, and reliability. It complies with 6G requirements in terms of distributed intelligence and sustainability and is an energy-efficient and scalable solution to secure NTNs.
zh

[AI-101] Vectorized Attention with Learnable Encoding for Quantum Transformer

【速读】:该论文旨在解决当前量子Transformer(Quantum Transformer, QT)模型因依赖深度参数化量子电路(Parameterized Quantum Circuits, PQCs)而易受量子处理单元(QPU)噪声影响、导致实际性能受限的问题。其解决方案的关键在于提出向量化量子Transformer(Vectorized Quantum Transformer, VQT),该架构通过量子近似模拟实现理想掩码注意力矩阵计算,并借助向量化非线性量子编码器实现高效训练,从而在不依赖梯度的情况下完成低采样次数的量子电路模拟(Quantum Circuit Simulation, QCS),显著降低经典采样开销,同时展现出对噪声中等规模量子(NISQ)设备的友好性与端到端机器学习任务中的竞争力。

链接: https://arxiv.org/abs/2508.18464
作者: Ziqing Guo,Ziwen Pan,Alex Khan,Jan Balewski
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vectorized quantum block encoding provides a way to embed classical data into Hilbert space, offering a pathway for quantum models, such as Quantum Transformers (QT), that replace classical self-attention with quantum circuit simulations to operate more efficiently. Current QTs rely on deep parameterized quantum circuits (PQCs), rendering them vulnerable to QPU noise, and thus hindering their practical performance. In this paper, we propose the Vectorized Quantum Transformer (VQT), a model that supports ideal masked attention matrix computation through quantum approximation simulation and efficient training via vectorized nonlinear quantum encoder, yielding shot-efficient and gradient-free quantum circuit simulation (QCS) and reduced classical sampling overhead. In addition, we demonstrate an accuracy comparison for IBM and IonQ in quantum circuit simulation and competitive results in benchmarking natural language processing tasks on IBM state-of-the-art and high-fidelity Kingston QPU. Our noise intermediate-scale quantum friendly VQT approach unlocks a novel architecture for end-to-end machine learning in quantum computing.
zh

[AI-102] EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

【速读】:该论文旨在解决现有生成式AI(Generative AI)在双人对话场景中缺乏情绪自适应能力的问题,尤其是当前方法多局限于单向人脸动画,即便支持双向交互也难以实现精确的情绪驱动表达。解决方案的关键在于提出EAI-Avatar框架,其核心创新包括:1)设计基于Transformer的头部掩码生成器,在潜在掩码空间中学习时序一致的运动特征,从而生成任意长度且时序稳定的掩码序列以约束头部动作;2)引入交互式说话树结构(interactive talking tree),通过节点间的父子/兄弟关系及情感状态信息建模对话状态转移,并利用反向层级遍历提取历史情绪线索,指导表情合成,从而实现自然流畅、情绪自适应的双人交互虚拟人生成。

链接: https://arxiv.org/abs/2508.18337
作者: Haijie Yang,Zhenyu Zhang,Hao Tang,Jianjun Qian,Jian Yang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character’s emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.
zh

[AI-103] scI2CL: Effectively Integrating Single-cell Multi-omics by Intra- and Inter-omics Contrastive Learning

【速读】:该论文旨在解决单细胞多组学数据中细胞状态复杂性与连续动态分化过程带来的计算建模难题,尤其是如何从互补的多组学数据中学习全面且具有区分性的细胞表征,以支持下游任务如细胞聚类、亚型识别、发育轨迹重建等。其解决方案的关键在于提出scI2CL框架,该框架基于组内(intra-omics)与组间(inter-omics)对比学习策略,有效捕捉跨组学关系并融合多组学信息,从而提升细胞表征的判别能力与生物学可解释性。

链接: https://arxiv.org/abs/2508.18304
作者: Wuchao Liu,Han Peng,Wengen Li,Yichao Zhang,Jihong Guan,Shuigeng Zhou
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
备注: 22 pages, 6figures

点击查看摘要

Abstract:Single-cell multi-omics data contain huge information of cellular states, and analyzing these data can reveal valuable insights into cellular heterogeneity, diseases, and biological processes. However, as cell differentiation \ development is a continuous and dynamic process, it remains challenging to computationally model and infer cell interaction patterns based on single-cell multi-omics data. This paper presents scI2CL, a new single-cell multi-omics fusion framework based on intra- and inter-omics contrastive learning, to learn comprehensive and discriminative cellular representations from complementary multi-omics data for various downstream tasks. Extensive experiments of four downstream tasks validate the effectiveness of scI2CL and its superiority over existing peers. Concretely, in cell clustering, scI2CL surpasses eight state-of-the-art methods on four widely-used real-world datasets. In cell subtyping, scI2CL effectively distinguishes three latent monocyte cell subpopulations, which are not discovered by existing methods. Simultaneously, scI2CL is the only method that correctly constructs the cell developmental trajectory from hematopoietic stem and progenitor cells to Memory B cells. In addition, scI2CL resolves the misclassification of cell types between two subpopulations of CD4+ T cells, while existing methods fail to precisely distinguish the mixed cells. In summary, scI2CL can accurately characterize cross-omics relationships among cells, thus effectively fuses multi-omics data and learns discriminative cellular representations to support various downstream analysis tasks.
zh

机器学习

[LG-0] Predicting the Order of Upcoming Tokens Improves Language Modeling

链接: https://arxiv.org/abs/2508.19228
作者: Zayd M. K. Zuhri,Erland Hilman Fuadi,Alham Fikri Aji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP’s exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP’s multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at this https URL

[LG-1] Planning -Query-Guided Model Generation for Model-Based Deformable Object Manipulation

链接: https://arxiv.org/abs/2508.19199
作者: Alex LaGrassa,Zixuan Huang,Dmitry Berenson,Oliver Kroemer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Efficient planning in high-dimensional spaces, such as those involving deformable objects, requires computationally tractable yet sufficiently expressive dynamics models. This paper introduces a method that automatically generates task-specific, spatially adaptive dynamics models by learning which regions of the object require high-resolution modeling to achieve good task performance for a given planning query. Task performance depends on the complex interplay between the dynamics model, world dynamics, control, and task requirements. Our proposed diffusion-based model generator predicts per-region model resolutions based on start and goal pointclouds that define the planning query. To efficiently collect the data for learning this mapping, a two-stage process optimizes resolution using predictive dynamics as a prior before directly optimizing using closed-loop performance. On a tree-manipulation task, our method doubles planning speed with only a small decrease in task performance over using a full-resolution model. This approach informs a path towards using previous planning and control data to generate computationally efficient yet sufficiently expressive dynamics models for new tasks.

[LG-2] Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness

链接: https://arxiv.org/abs/2508.19183
作者: Wenchuan Mu,Kwan Hui Lim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data, which may lead to potential safety hazards. Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision, limiting their practical utility. To address these limitations, this paper conducts a comprehensive comparative analysis of existing robustness definitions and associated assessment methodologies. We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness, enabling more rigorous and efficient pre-deployment assessments. Our extensive comparative evaluation illustrates the advantages and applicability of our proposed approach, thereby advancing the systematic understanding and enhancement of model robustness in safety-critical deep learning applications.

[LG-3] Leverag ing Evolutionary Surrogate-Assisted Prescription in Multi-Objective Chlorination Control Systems

链接: https://arxiv.org/abs/2508.19173
作者: Rivaaj Monsia,Olivier Francon,Daniel Young,Risto Miikkulainen
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This short, written report introduces the idea of Evolutionary Surrogate-Assisted Prescription (ESP) and presents preliminary results on its potential use in training real-world agents as a part of the 1st AI for Drinking Water Chlorination Challenge at IJCAI-2025. This work was done by a team from Project Resilience, an organization interested in bridging AI to real-world problems.

[LG-4] Saddle Hierarchy in Dense Associative Memory

链接: https://arxiv.org/abs/2508.19151
作者: Robin Thériault,Daniele Tantari
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 55 pages, 10 figures

点击查看摘要

Abstract:Dense associative memory (DAM) models have been attracting renewed attention since they were shown to be robust to adversarial examples and closely related to state-of-the-art machine learning paradigms, such as the attention mechanisms in transformers and generative diffusion models. We study a DAM built upon a three-layer Boltzmann machine with Potts hidden units, which represent data clusters and classes. Through a statistical mechanics analysis, we derive saddle-point equations that characterize both the stationary points of DAMs trained on real data and the fixed points of DAMs trained on synthetic data within a teacher-student framework. Based on these results, we propose a novel regularization scheme that makes training significantly more stable. Moreover, we show empirically that our DAM learns interpretable solutions to both supervised and unsupervised classification problems. Pushing our theoretical analysis further, we find that the weights learned by relatively small DAMs correspond to unstable saddle points in larger DAMs. We implement a network-growing algorithm that leverages this saddle-point hierarchy to drastically reduce the computational cost of training dense associative memory.

[LG-5] Active Query Selection for Crowd-Based Reinforcement Learning

链接: https://arxiv.org/abs/2508.19132
作者: Jonathan Erskine,Taku Yamagata,Raúl Santos-Rodríguez
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, 2 tables plus appendices

点击查看摘要

Abstract:Preference-based reinforcement learning has gained prominence as a strategy for training agents in environments where the reward signal is difficult to specify or misaligned with human intent. However, its effectiveness is often limited by the high cost and low availability of reliable human input, especially in domains where expert feedback is scarce or errors are costly. To address this, we propose a novel framework that combines two complementary strategies: probabilistic crowd modelling to handle noisy, multi-annotator feedback, and active learning to prioritize feedback on the most informative agent actions. We extend the Advise algorithm to support multiple trainers, estimate their reliability online, and incorporate entropy-based query selection to guide feedback requests. We evaluate our approach in a set of environments that span both synthetic and real-world-inspired settings, including 2D games (Taxi, Pacman, Frozen Lake) and a blood glucose control task for Type 1 Diabetes using the clinically approved UVA/Padova simulator. Our preliminary results demonstrate that agents trained with feedback on uncertain trajectories exhibit faster learning in most tasks, and we outperform the baselines for the blood glucose control task.

[LG-6] Composition and Alignment of Diffusion Models using Constrained Learning

链接: https://arxiv.org/abs/2508.19104
作者: Shervin Khalafi,Ignacio Hounie,Dongsheng Ding,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves fine-tuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pre-trained diffusion models, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise when optimizing for multiple rewards or combining multiple models, as they can often represent competing properties. Existing methods cannot guarantee that the resulting model faithfully generates samples with all the desired properties. To address this gap, we propose a constrained optimization framework that unifies alignment and composition of diffusion models by enforcing that the aligned model satisfies reward constraints and/or remains close to (potentially multiple) pre-trained models. We provide a theoretical characterization of the solutions to the constrained alignment and composition problems and develop a Lagrangian-based primal-dual training algorithm to approximate these solutions. Empirically, we demonstrate the effectiveness and merits of our proposed approach in image generation, applying it to alignment and composition, and show that our aligned or composed model satisfies constraints effectively, and improves on the equally-weighted approach. Our implementation can be found at this https URL.

[LG-7] CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

链接: https://arxiv.org/abs/2508.19073
作者: Ehsan Yousefzadeh-Asl-Miandoab,Reza Karimzadeh,Bulat Ibragimov,Florina M. Ciorba,Pınar Tözün
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Studies conducted on enterprise-scale infrastructure have shown that GPUs – the core computational resource for deep learning (DL) training – are often significantly underutilized. DL task collocation on GPUs is an opportunity to address this challenge. However, it may result in (1) out-of-memory crashes for the subsequently arriving task and (2) slowdowns for all tasks sharing the GPU due to resource interference. The former challenge poses a threat to robustness, while the latter affects the quality of service and energy efficiency. We propose CARMA, a server-scale task-level collocation-aware resource management system that handles both collocation challenges. CARMA encompasses GPUMemNet, a novel ML-based GPU memory estimator framework for DL training tasks, to minimize out-of-memory errors and introduces collocation policies that cap GPU utilization to minimize interference. Furthermore, CARMA introduces a recovery method to ensure robust restart of tasks that crash. Our evaluation on traces modeled after real-world DL training task traces shows that CARMA increases the GPU utilization over time by 39.3%, decreases the end-to-end execution time by \sim 26.7%, and reduces the GPU energy use by \sim 14.2%. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2508.19073 [cs.DC] (or arXiv:2508.19073v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2508.19073 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Automated discovery of finite volume schemes using Graph Neural Networks

链接: https://arxiv.org/abs/2508.19052
作者: Paul Garnier,Jonathan Viquerat,Elie Hachem
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have deeply modified the landscape of numerical simulations by demonstrating strong capabilities in approximating solutions of physical systems. However, their ability to extrapolate beyond their training domain (\textite.g. larger or structurally different graphs) remains uncertain. In this work, we establish that GNNs can serve purposes beyond their traditional role, and be exploited to generate numerical schemes, in conjunction with symbolic regression. First, we show numerically and theoretically that a GNN trained on a dataset consisting solely of two-node graphs can extrapolate a first-order Finite Volume (FV) scheme for the heat equation on out-of-distribution, unstructured meshes. Specifically, if a GNN achieves a loss \varepsilon on such a dataset, it implements the FV scheme with an error of \mathcalO(\varepsilon) . Using symbolic regression, we show that the network effectively rediscovers the exact analytical formulation of the standard first-order FV scheme. We then extend this approach to an unsupervised context: the GNN recovers the first-order FV scheme using only a residual loss similar to Physics-Informed Neural Networks (PINNs) with no access to ground-truth data. Finally, we push the methodology further by considering higher-order schemes: we train (i) a 2-hop and (ii) a 2-layers GNN using the same PINN loss, that autonomously discover (i) a second-order correction term to the initial scheme using a 2-hop stencil, and (ii) the classic second-order midpoint scheme. These findings follows a recent paradigm in scientific computing: GNNs are not only strong approximators, but can be active contributors to the development of novel numerical methods.

[LG-9] Breaking the Black Box: Inherently Interpretable Physics-Informed Machine Learning for Imbalanced Seismic Data

链接: https://arxiv.org/abs/2508.19031
作者: Vemula Sreenath,Filippo Gatti,Pierre Jehel
类目: Machine Learning (cs.LG)
*备注: 19 pages, 9 Figures and 2 Tables

点击查看摘要

Abstract:Ground motion models (GMMs) predict how strongly the ground will shake during an earthquake. They are essential for structural analysis, seismic design, and seismic risk assessment studies. Traditional machine learning (ML) approaches are popular to develop GMMs, due to large earthquake databases worldwide. However, they operate as “black boxes,” which are hard to interpret and trust, limiting their use in high-stake decisions. Additionally, these databases suffer from significant data imbalances: fewer large, critically damaging records near the fault compared to abundant, less severely damaging distant records. These two limitations are addressed in this work by developing a transparent ML architecture using the HazBinLoss function. Each input (e.g., magnitude, distance, their interaction term, etc.) is processed separately and added linearly to obtain the output, resulting in exact contribution of each term. The HazBinLoss function assigns higher weights to critical near-field large magnitude records and lower weights to less-critical far-field smaller magnitude records, during training to prevent underprediction of the most damaging scenarios. Our model captures known seismological principles and achieves comparable performance with established GMMs while maintaining transparency. This framework enables broader adoption of ML-based approaches for risk assessment studies and disaster planning.

[LG-10] When recalling in-context Transformers are not SSMs

链接: https://arxiv.org/abs/2508.19029
作者: Destiny Okpekpe,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the advantageous subquadratic complexity of modern recurrent deep learning models – such as state-space models (SSMs) – recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba’s performance and optimization stability.

[LG-11] GRADSTOP: Early Stopping of Gradient Descent via Posterior Sampling

链接: https://arxiv.org/abs/2508.19028
作者: Arash Jamshidi,Lauri Seppäläinen,Katsiaryna Haitsiukevich,Hoang Phuc Hau Luu,Anton Björklund,Kai Puolamäki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are often learned by minimising a loss function on the training data using a gradient descent algorithm. These models often suffer from overfitting, leading to a decline in predictive performance on unseen data. A standard solution is early stopping using a hold-out validation set, which halts the minimisation when the validation loss stops decreasing. However, this hold-out set reduces the data available for training. This paper presents \sc gradstop, a novel stochastic early stopping method that only uses information in the gradients, which are produced by the gradient descent algorithm ``for free.‘’ Our main contributions are that we estimate the Bayesian posterior by the gradient information, define the early stopping problem as drawing sample from this posterior, and use the approximated posterior to obtain a stopping criterion. Our empirical evaluation shows that \sc gradstop achieves a small loss on test data and compares favourably to a validation-set-based stopping criterion. By leveraging the entire dataset for training, our method is particularly advantageous in data-limited settings, such as transfer learning. It can be incorporated as an optional feature in gradient descent libraries with only a small computational overhead. The source code is available at this https URL.

[LG-12] Working My Way Back to You: Resource-Centric Next-Activity Prediction

链接: https://arxiv.org/abs/2508.19016
作者: Kelly Kurowski,Xixi Lu,Hajo A Reijers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive Process Monitoring (PPM) aims to train models that forecast upcoming events in process executions. These predictions support early bottleneck detection, improved scheduling, proactive interventions, and timely communication with stakeholders. While existing research adopts a control-flow perspective, we investigate next-activity prediction from a resource-centric viewpoint, which offers additional benefits such as improved work organization, workload balancing, and capacity forecasting. Although resource information has been shown to enhance tasks such as process performance analysis, its role in next-activity prediction remains unexplored. In this study, we evaluate four prediction models and three encoding strategies across four real-life datasets. Compared to the baseline, our results show that LightGBM and Transformer models perform best with an encoding based on 2-gram activity transitions, while Random Forest benefits most from an encoding that combines 2-gram transitions and activity repetition features. This combined encoding also achieves the highest average accuracy. This resource-centric approach could enable smarter resource allocation, strategic workforce planning, and personalized employee support by analyzing individual behavior rather than case-level progression. The findings underscore the potential of resource-centric next-activity prediction, opening up new venues for research on PPM.

[LG-13] Learning with springs and sticks

链接: https://arxiv.org/abs/2508.19015
作者: Luis Mantilla Calderón,Alán Aspuru-Guzik
类目: Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Learning is a physical process. Here, we aim to study a simple dynamical system composed of springs and sticks capable of arbitrarily approximating any continuous function. The main idea of our work is to use the sticks to mimic a piecewise-linear approximation of the given function, use the potential energy of springs to encode a desired mean squared error loss function, and converge to a minimum-energy configuration via dissipation. We apply the proposed simulation system to regression tasks and show that its performance is comparable to that of multi-layer perceptrons. In addition, we study the thermodynamic properties of the system and find a relation between the free energy change of the system and its ability to learn an underlying data distribution. We empirically find a \emphthermodynamic learning barrier for the system caused by the fluctuations of the environment, whereby the system cannot learn if its change in free energy hits such a barrier. We believe this simple model can help us better understand learning systems from a physical point of view.

[LG-14] FedProtoKD: Dual Knowledge Distillation with Adaptive Class-wise Prototype Margin for Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2508.19009
作者: Md Anwar Hossen,Fatema Siddika,Wensheng Zhang,Anuj Sharma,Ali Jannesari
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Heterogeneous Federated Learning (HFL) has gained attention for its ability to accommodate diverse models and heterogeneous data across clients. Prototype-based HFL methods emerge as a promising solution to address statistical heterogeneity and privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing only class-representative prototypes among heterogeneous clients. However, these prototypes are often aggregated on the server using weighted averaging, leading to sub-optimal global knowledge; these cause the shrinking of aggregated prototypes, which negatively affects the model performance in scenarios when models are heterogeneous and data distributions are extremely non-IID. We propose FedProtoKD in a Heterogeneous Federated Learning setting, using an enhanced dual-knowledge distillation mechanism to improve the system performance with clients’ logits and prototype feature representation. We aim to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, we assess the importance of public samples using the closeness of the sample’s prototype to its class representative prototypes, which enhances learning performance. FedProtoKD achieved average improvements of 1.13% up to 34.13% accuracy across various settings and significantly outperforms existing state-of-the-art HFL methods.

[LG-15] On the Generalisation of Koopman Representations for Chaotic System Control

链接: https://arxiv.org/abs/2508.18954
作者: Kyriakos Hjikakou(1),Juan Diego Cardenas Cartagena(1),Matthia Sabatelli(1) ((1) University of Groningen, Department of Artificial Intelligence, Groningen, Netherlands)
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:This paper investigates the generalisability of Koopman-based representations for chaotic dynamical systems, focusing on their transferability across prediction and control tasks. Using the Lorenz system as a testbed, we propose a three-stage methodology: learning Koopman embeddings through autoencoding, pre-training a transformer on next-state prediction, and fine-tuning for safety-critical control. Our results show that Koopman embeddings outperform both standard and physics-informed PCA baselines, achieving accurate and data-efficient performance. Notably, fixing the pre-trained transformer weights during fine-tuning leads to no performance degradation, indicating that the learned representations capture reusable dynamical structure rather than task-specific patterns. These findings support the use of Koopman embeddings as a foundation for multi-task learning in physics-informed machine learning. A project page is available at this https URL.

[LG-16] Estimating Conditional Covariance between labels for Multilabel Data

链接: https://arxiv.org/abs/2508.18951
作者: Laurence A. F. Park,Jesse Read
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multilabel data should be analysed for label dependence before applying multilabel models. Independence between multilabel data labels cannot be measured directly from the label values due to their dependence on the set of covariates \vecx , but can be measured by examining the conditional label covariance using a multivariate Probit model. Unfortunately, the multivariate Probit model provides an estimate of its copula covariance, and so might not be reliable in estimating constant covariance and dependent covariance. In this article, we compare three models (Multivariate Probit, Multivariate Bernoulli and Staged Logit) for estimating the constant and dependent multilabel conditional label covariance. We provide an experiment that allows us to observe each model’s measurement of conditional covariance. We found that all models measure constant and dependent covariance equally well, depending on the strength of the covariance, but the models all falsely detect that dependent covariance is present for data where constant covariance is present. Of the three models, the Multivariate Probit model had the lowest error rate.

[LG-17] Energy-Based Flow Matching for Generating 3D Molecular Structure

链接: https://arxiv.org/abs/2508.18949
作者: Wenyin Zhou,Christopher Iliffe Sprague,Vsevolod Viliuga,Matteo Tadiello,Arne Elofsson,Hossein Azizpour
类目: Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Machine Learning (2025)

点击查看摘要

Abstract:Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules’ constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textititeratively map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method’s effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

[LG-18] Generalization Bound for a General Class of Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2508.18920
作者: Madhusudan Verma,Manoj Kumar
类目: Machine Learning (cs.LG)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:Neural ordinary differential equations (neural ODEs) are a popular type of deep learning model that operate with continuous-depth architectures. To assess how well such models perform on unseen data, it is crucial to understand their generalization error bounds. Previous research primarily focused on the linear case for the dynamics function in neural ODEs - Marion, P. (2023), or provided bounds for Neural Controlled ODEs that depend on the sampling interval Bleistein et al. (2023). In this work, we analyze a broader class of neural ODEs where the dynamics function is a general nonlinear function, either time dependent or time independent, and is Lipschitz continuous with respect to the state variables. We showed that under this Lipschitz condition, the solutions to neural ODEs have solutions with bounded variations. Based on this observation, we establish generalization bounds for both time-dependent and time-independent cases and investigate how overparameterization and domain constraints influence these bounds. To our knowledge, this is the first derivation of generalization bounds for neural ODEs with general nonlinear dynamics.

[LG-19] MOCHA: Discovering Multi-Order Dynamic Causality in Temporal Point Processes

链接: https://arxiv.org/abs/2508.18873
作者: Yunyang Cao,Juekai Lin,Wenhao Li,Bo Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering complex causal dependencies in temporal point processes (TPPs) is critical for modeling real-world event sequences. Existing methods typically rely on static or first-order causal structures, overlooking the multi-order and time-varying nature of causal relationships. In this paper, we propose MOCHA, a novel framework for discovering multi-order dynamic causality in TPPs. MOCHA characterizes multi-order influences as multi-hop causal paths over a latent time-evolving graph. To model such dynamics, we introduce a time-varying directed acyclic graph (DAG) with learnable structural weights, where acyclicity and sparsity constraints are enforced to ensure structural validity. We design an end-to-end differentiable framework that jointly models causal discovery and TPP dynamics, enabling accurate event prediction and revealing interpretable structures. Extensive experiments on real-world datasets demonstrate that MOCHA not only achieves state-of-the-art performance in event prediction, but also reveals meaningful and interpretable causal structures.

[LG-20] Recycling History: Efficient Recommendations from Contextual Dueling Bandits

链接: https://arxiv.org/abs/2508.18841
作者: Suryanarayana Sankagiri,Jalal Etesami,Pouria Fatemi,Matthias Grossglauser
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures

点击查看摘要

Abstract:The contextual duelling bandit problem models adaptive recommender systems, where the algorithm presents a set of items to the user, and the user’s choice reveals their preference. This setup is well suited for implicit choices users make when navigating a content platform, but does not capture other possible comparison queries. Motivated by the fact that users provide more reliable feedback after consuming items, we propose a new bandit model that can be described as follows. The algorithm recommends one item per time step; after consuming that item, the user is asked to compare it with another item chosen from the user’s consumption history. Importantly, in our model, this comparison item can be chosen without incurring any additional regret, potentially leading to better performance. However, the regret analysis is challenging because of the temporal dependency in the user’s history. To overcome this challenge, we first show that the algorithm can construct informative queries provided the history is rich, i.e., satisfies a certain diversity condition. We then show that a short initial random exploration phase is sufficient for the algorithm to accumulate a rich history with high probability. This result, proven via matrix concentration bounds, yields O(\sqrtT) regret guarantees. Additionally, our simulations show that reusing past items for comparisons can lead to significantly lower regret than only comparing between simultaneously recommended items.

[LG-21] DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift

链接: https://arxiv.org/abs/2508.18839
作者: Shae McFadden,Myles Foley,Mario D’Onghia,Chris Hicks,Vasilios Mavroudis,Nicola Paoletti,Fabio Pierazzi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages

点击查看摘要

Abstract:Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved a 5.18\pm5.44 , 14.49\pm12.86 , and 10.06\pm10.81 average AUT performance improvement for the classification only, classification with rejection, and classification with rejection and AL settings, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic environment of the Android malware domain.

[LG-22] Learning Real-World Acrobatic Flight from Human Preferences

链接: https://arxiv.org/abs/2508.18817
作者: Colin Merk,Ismail Geles,Jiaxu Xing,Angel Romero,Giorgia Ramponi,Davide Scaramuzza
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. In this work, we explore the use of PbRL for agile drone control, focusing on the execution of dynamic maneuvers such as powerloops. Building on Preference-based Proximal Policy Optimization (Preference PPO), we propose Reward Ensemble under Confidence (REC), an extension to the reward learning objective that improves preference modeling and learning stability. Our method achieves 88.4% of the shaped reward performance, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them to real-world drones, demonstrating multiple acrobatic maneuvers where human preferences emphasize stylistic qualities of motion. Furthermore, we demonstrate the applicability of our probabilistic reward model in a representative MuJoCo environment for continuous control. Finally, we highlight the limitations of manually designed rewards, observing only 60.7% agreement with human preferences. These results underscore the effectiveness of PbRL in capturing complex, human-centered objectives across both physical and simulated domains.

[LG-23] Federated Learning with Heterogeneous and Private Label Sets

链接: https://arxiv.org/abs/2508.18774
作者: Adam Breitholtz,Edvin Listo Zec,Fredrik D. Johansson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Although common in real-world applications, heterogeneous client label sets are rarely investigated in federated learning (FL). Furthermore, in the cases they are, clients are assumed to be willing to share their entire label sets with other clients. Federated learning with private label sets, shared only with the central server, adds further constraints on learning algorithms and is, in general, a more difficult problem to solve. In this work, we study the effects of label set heterogeneity on model performance, comparing the public and private label settings – when the union of label sets in the federation is known to clients and when it is not. We apply classical methods for the classifier combination problem to FL using centralized tuning, adapt common FL methods to the private label set setting, and discuss the justification of both approaches under practical assumptions. Our experiments show that reducing the number of labels available to each client harms the performance of all methods substantially. Centralized tuning of client models for representational alignment can help remedy this, but often at the cost of higher variance. Throughout, our proposed adaptations of standard FL methods perform well, showing similar performance in the private label setting as the standard methods achieve in the public setting. This shows that clients can enjoy increased privacy at little cost to model accuracy.

[LG-24] Predicting Drug-Drug Interactions Using Heterogeneous Graph Neural Networks: HGNN-DDI

链接: https://arxiv.org/abs/2508.18766
作者: Hongbo Liu,Siyi Li,Zheng Yu
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures. Published in Applied and Computational Engineering, Vol. 79, pp. 77-89, July 25, 2024. Licensed under CC BY 4.0

点击查看摘要

Abstract:Drug-drug interactions (DDIs) are a major concern in clinical practice, as they can lead to reduced therapeutic efficacy or severe adverse effects. Traditional computational approaches often struggle to capture the complex relationships among drugs, targets, and biological entities. In this work, we propose HGNN-DDI, a heterogeneous graph neural network model designed to predict potential DDIs by integrating multiple drug-related data sources. HGNN-DDI leverages graph representation learning to model heterogeneous biomedical networks, enabling effective information propagation across diverse node and edge types. Experimental results on benchmark DDI datasets demonstrate that HGNN-DDI outperforms state-of-the-art baselines in prediction accuracy and robustness, highlighting its potential to support safer drug development and precision medicine.

[LG-25] Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement

链接: https://arxiv.org/abs/2508.18765
作者: Helen Pervez,Suyash Gaurav,Jukka Heikkonen,Jatin Chaudhary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As AI systems evolve into distributed ecosystems with autonomous execution, asynchronous reasoning, and multi-agent coordination, the absence of scalable, decoupled governance poses a structural risk. Existing oversight mechanisms are reactive, brittle, and embedded within agent architectures, making them non-auditable and hard to generalize across heterogeneous deployments. We introduce Governance-as-a-Service (GaaS): a modular, policy-driven enforcement layer that regulates agent outputs at runtime without altering model internals or requiring agent cooperation. GaaS employs declarative rules and a Trust Factor mechanism that scores agents based on compliance and severity-weighted violations. It enables coercive, normative, and adaptive interventions, supporting graduated enforcement and dynamic trust modulation. To evaluate GaaS, we conduct three simulation regimes with open-source models (LLaMA3, Qwen3, DeepSeek-R1) across content generation and financial decision-making. In the baseline, agents act without governance; in the second, GaaS enforces policies; in the third, adversarial agents probe robustness. All actions are intercepted, evaluated, and logged for analysis. Results show that GaaS reliably blocks or redirects high-risk behaviors while preserving throughput. Trust scores track rule adherence, isolating and penalizing untrustworthy components in multi-agent systems. By positioning governance as a runtime service akin to compute or storage, GaaS establishes infrastructure-level alignment for interoperable agent ecosystems. It does not teach agents ethics; it enforces them. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.18765 [cs.LG] (or arXiv:2508.18765v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18765 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jatin Chaudhary [view email] [v1] Tue, 26 Aug 2025 07:48:55 UTC (959 KB)

[LG-26] UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

链接: https://arxiv.org/abs/2508.18756
作者: Zihao Huang,Yu Bao,Qiyang Min,Siyan Chen,Ran Guo,Hongzhi Huang,Defa Zhu,Yutao Zeng,Banggu Wu,Xun Zhou,Siyuan Qiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

[LG-27] Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear programming

链接: https://arxiv.org/abs/2508.18742
作者: Jiajun Li,Ran Hou,Yu Ding,Yixuan Li,Shisi Guan,Jiahui Duan,Xiongwei Han,Tao Zhong,Vincent Chau,Weiwei Wu,Wanyuan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50% and reduces the computation time by 17.47%.

[LG-28] Stability and Generalization for Bellm an Residuals

链接: https://arxiv.org/abs/2508.18741
作者: Enoch H. Kang,Kyoungseok Jang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling. The results hold for standard neural-network parameterizations and minibatch SGD.

[LG-29] Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

链接: https://arxiv.org/abs/2508.18736
作者: Jungwoo Kim,Minsang Kim,Jaeheon Lee,Chanwoo Moon,Heejin Kim,Taeho Hwang,Woosuk Chung,Yeseong Kim,Sungjin Lee
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71 \times higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.

[LG-30] Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning

链接: https://arxiv.org/abs/2508.18730
作者: Yi Liu,Hongji Zhang,Yiwen Wang,Dimitris Tsaras,Lei Chen,Mingxuan Yuan,Qiang Xu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design’s structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce a novel structure-aware graph self-supervised learning framework, StructRTL, for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, our method significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG predictor. Experiments show that our approach establishes new state-of-the-art results, demonstrating the effectiveness of combining structural learning with cross-stage supervision.

[LG-31] aming the One-Epoch Phenomenon in Online Recommendation System by Two-stage Contrastive ID Pre-training RECSYS’24

链接: https://arxiv.org/abs/2508.18700
作者: Yi-Ping Hsu,Po-Wei Wang,Chantat Eksombatchai,Jiajing Xu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Published at RecSys’24, see this https URL

点击查看摘要

Abstract:ID-based embeddings are widely used in web-scale online recommendation systems. However, their susceptibility to overfitting, particularly due to the long-tail nature of data distributions, often limits training to a single epoch, a phenomenon known as the “one-epoch problem.” This challenge has driven research efforts to optimize performance within the first epoch by enhancing convergence speed or feature sparsity. In this study, we introduce a novel two-stage training strategy that incorporates a pre-training phase using a minimal model with contrastive loss, enabling broader data coverage for the embedding system. Our offline experiments demonstrate that multi-epoch training during the pre-training phase does not lead to overfitting, and the resulting embeddings improve online generalization when fine-tuned for more complex downstream recommendation tasks. We deployed the proposed system in live traffic at Pinterest, achieving significant site-wide engagement gains.

[LG-32] End to End Autoencoder MLP Framework for Sepsis Prediction

链接: https://arxiv.org/abs/2508.18688
作者: Hejiang Cai,Di Wu,Ji Xu,Xiang Liu,Yiziting Zhu,Xin Shu,Yujie Li,Bin Yi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a life threatening condition that requires timely detection in intensive care settings. Traditional machine learning approaches, including Naive Bayes, Support Vector Machine (SVM), Random Forest, and XGBoost, often rely on manual feature engineering and struggle with irregular, incomplete time-series data commonly present in electronic health records. We introduce an end-to-end deep learning framework integrating an unsupervised autoencoder for automatic feature extraction with a multilayer perceptron classifier for binary sepsis risk prediction. To enhance clinical applicability, we implement a customized down sampling strategy that extracts high information density segments during training and a non-overlapping dynamic sliding window mechanism for real-time inference. Preprocessed time series data are represented as fixed dimension vectors with explicit missingness indicators, mitigating bias and noise. We validate our approach on three ICU cohorts. Our end-to-end model achieves accuracies of 74.6 percent, 80.6 percent, and 93.5 percent, respectively, consistently outperforming traditional machine learning baselines. These results demonstrate the framework’s superior robustness, generalizability, and clinical utility for early sepsis detection across heterogeneous ICU environments.

[LG-33] Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding

链接: https://arxiv.org/abs/2508.18676
作者: Chufan Gao,Jintai Chen,Jimeng Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training-free prompting LLM agents using chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost of generalizability. Training-free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting-based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost-efficient, and can outperform previous baselines in tabular reasoning.

[LG-34] Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance

链接: https://arxiv.org/abs/2508.18638
作者: Ifrah Tariq,Ernest Fraenkel
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Immune checkpoint inhibitors (ICIs) have transformed cancer treatment, yet patient responses remain highly variable, and the biological mechanisms underlying resistance are poorly understood. While machine learning models hold promise for predicting responses to ICIs, most existing methods lack interpretability and do not effectively leverage the biological structure inherent to multi-omics data. Here, we introduce the Biologically Disentangled Variational Autoencoder (BDVAE), a deep generative model that integrates transcriptomic and genomic data through modality- and pathway-specific encoders. Unlike existing rigid, pathway-informed models, BDVAE employs a modular encoder architecture combined with variational inference to learn biologically meaningful latent features associated with immune, genomic, and metabolic processes. Applied to a pan-cancer cohort of 366 patients across four cancer types treated with ICIs, BDVAE accurately predicts treatment response (AUC-ROC = 0.94 on unseen test data) and uncovers critical resistance mechanisms, including immune suppression, metabolic shifts, and neuronal signaling. Importantly, BDVAE reveals that resistance spans a continuous biological spectrum rather than strictly binary states, reflecting gradations of tumor dysfunction. Several latent features correlate with survival outcomes and known clinical subtypes, demonstrating BDVAE’s capability to generate interpretable, clinically relevant insights. These findings underscore the value of biologically structured machine learning in elucidating complex resistance patterns and guiding precision immunotherapy strategies.

[LG-35] STRATA-TS: Selective Knowledge Transfer for Urban Time Series Forecasting with Retrieval-Guided Reasoning

链接: https://arxiv.org/abs/2508.18635
作者: Yue Jiang,Chenxi Liu,Yile Chen,Qin Chao,Shuai Liu,Gao Cong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban forecasting models often face a severe data imbalance problem: only a few cities have dense, long-span records, while many others expose short or incomplete histories. Direct transfer from data-rich to data-scarce cities is unreliable because only a limited subset of source patterns truly benefits the target domain, whereas indiscriminate transfer risks introducing noise and negative transfer. We present STRATA-TS (Selective TRAnsfer via TArget-aware retrieval for Time Series), a framework that combines domain-adapted retrieval with reasoning-capable large models to improve forecasting in scarce data regimes. STRATA-TS employs a patch-based temporal encoder to identify source subsequences that are semantically and dynamically aligned with the target query. These retrieved exemplars are then injected into a retrieval-guided reasoning stage, where an LLM performs structured inference over target inputs and retrieved support. To enable efficient deployment, we distill the reasoning process into a compact open model via supervised fine-tuning. Extensive experiments on three parking availability datasets across Singapore, Nottingham, and Glasgow demonstrate that STRATA-TS consistently outperforms strong forecasting and transfer baselines, while providing interpretable knowledge transfer pathways.

[LG-36] Scalable Fairness Shaping with LLM -Guided Multi-Agent Reinforcement Learning for Peer-to-Peer Electricity Markets

链接: https://arxiv.org/abs/2508.18610
作者: Shrenik Jadhav,Birva Sevak,Srijita Das,Akhtar Hussain,Wencong Su,Van-Hai Bui
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Peer-to-peer (P2P) energy trading is becoming central to modern distribution systems as rooftop PV and home energy management systems become pervasive, yet most existing market and reinforcement learning designs emphasize efficiency or private profit and offer little real-time guidance to ensure equitable outcomes under uncertainty. To address this gap, a fairness-aware multiagent reinforcement learning framework, FairMarket-RL, is proposed in which a large language model (LLM) critic shapes bidding policies within a continuous double auction under partial observability and discrete price-quantity actions. After each trading slot, the LLM returns normalized fairness scores Fairness-to-Grid (FTG), Fairness-Between-Sellers (FBS), and Fairness-of-Pricing (FPP) that are integrated into the reward via ramped coefficients and tunable scaling, so that fairness guidance complements, rather than overwhelms, economic incentives. The environment models realistic residential load and PV profiles and enforce hard constraints on prices, physical feasibility, and policy-update stability. Across a progression of experiments from a small pilot to a larger simulated community and a mixed-asset real-world dataset, the framework shifts exchanges toward local P2P trades, lowers consumer costs relative to grid-only procurement, sustains strong fairness across participants, and preserves utility viability. Sensitivity analyses over solar availability and aggregate demand further indicate robust performance, suggesting a scalable, LLM-guided pathway to decentralized electricity markets that are economically efficient, socially equitable, and technically sound.

[LG-37] Linear Trading Position with Sparse Spectrum IJCAI2025

链接: https://arxiv.org/abs/2508.18596
作者: Zhao-Rong Lai,Haisheng Yang
类目: Machine Learning (cs.LG)
*备注: IJCAI2025

点击查看摘要

Abstract:The principal portfolio approach is an emerging method in signal-based trading. However, these principal portfolios may not be diversified to explore the key features of the prediction matrix or robust to different situations. To address this problem, we propose a novel linear trading position with sparse spectrum that can explore a larger spectral region of the prediction matrix. We also develop a Krasnosel’ski\u ı-Mann fixed-point algorithm to optimize this trading position, which possesses the descent property and achieves a linear convergence rate in the objective value. This is a new theoretical result for this type of algorithms. Extensive experiments show that the proposed method achieves good and robust performance in various situations.

[LG-38] History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL

链接: https://arxiv.org/abs/2508.18588
作者: Jingkai He,Tianjian Li,Erhu Feng,Dong Du,Qian Liu,Tao Liu,Yubin Xia,Haibo Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), reinforcement learning (RL) has emerged as a pivotal methodology for enhancing the reasoning capabilities of LLMs. Unlike traditional pre-training approaches, RL encompasses multiple stages: rollout, reward, and training, which necessitates collaboration among various worker types. However, current RL systems continue to grapple with substantial GPU underutilization, due to two primary factors: (1) The rollout stage dominates the overall RL process due to test-time scaling; (2) Imbalances in rollout lengths (within the same batch) result in GPU bubbles. While prior solutions like asynchronous execution and truncation offer partial relief, they may compromise training accuracy for efficiency. Our key insight stems from a previously overlooked observation: rollout responses exhibit remarkable similarity across adjacent training epochs. Based on the insight, we introduce RhymeRL, an LLM RL system designed to accelerate RL training with two key innovations. First, to enhance rollout generation, we present HistoSpec, a speculative decoding inference engine that utilizes the similarity of historical rollout token sequences to obtain accurate drafts. Second, to tackle rollout bubbles, we introduce HistoPipe, a two-tier scheduling strategy that leverages the similarity of historical rollout distributions to balance workload among rollout workers. We have evaluated RhymeRL within a real production environment, demonstrating scalability from dozens to thousands of GPUs. Experimental results demonstrate that RhymeRL achieves a 2.6x performance improvement over existing methods, without compromising accuracy or modifying the RL paradigm. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2508.18588 [cs.LG] (or arXiv:2508.18588v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Sparse Autoencoders for Low-N Protein Function Prediction and Design

链接: https://arxiv.org/abs/2508.18567
作者: Darin Tsui,Kunal Talreja,Amirali Aghazadeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Predicting protein function from amino acid sequence remains a central challenge in data-scarce (low- N ) regimes, limiting machine learning-guided protein design when only small amounts of assay-labeled sequence-function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary-informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low- N function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction, indicating that their sparse latent space encodes compact and biologically meaningful representations that generalize more effectively from limited data. Moreover, steering predictive latents exploits biological motifs in pLM representations, yielding top-fitness variants in 83% of cases compared to designing with ESM2 alone.

[LG-40] Improving Long-term Autoregressive Spatiotemporal Predictions: A Proof of Concept with Fluid Dynamics

链接: https://arxiv.org/abs/2508.18565
作者: Hao Zhou,Sibo Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven methods are emerging as efficient alternatives to traditional numerical forecasting, offering fast inference and lower computational cost. Yet, for complex systems, long-term accuracy often deteriorates due to error accumulation, and autoregressive training (though effective) demands large GPU memory and may sacrifice short-term performance. We propose the Stochastic PushForward (SPF) framework, which retains one-step-ahead training while enabling multi-step learning. SPF builds a supplementary dataset from model predictions and combines it with ground truth via a stochastic acquisition strategy, balancing short- and long-term performance while reducing overfitting. Multi-step predictions are precomputed between epochs, keeping memory usage stable without storing full unrolled sequences. Experiments on the Burgers’ equation and the Shallow Water benchmark show that SPF achieves higher long-term accuracy than autoregressive methods while lowering memory requirements, making it promising for resource-limited and complex simulations.

[LG-41] A Note on Graphon-Signal Analysis of Graph Neural Networks

链接: https://arxiv.org/abs/2508.18564
作者: Levi Rauchwerger,Ron Levie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent paper, ``A Graphon-Signal Analysis of Graph Neural Networks’', by Levie, analyzed message passing graph neural networks (MPNNs) by embedding the input space of MPNNs, i.e., attributed graphs (graph-signals), to a space of attributed graphons (graphon-signals). Based on extensions of standard results in graphon analysis to graphon-signals, the paper proved a generalization bound and a sampling lemma for MPNNs. However, there are some missing ingredients in that paper, limiting its applicability in practical settings of graph machine learning. In the current paper, we introduce several refinements and extensions to existing results that address these shortcomings. In detail, 1) we extend the main results in the paper to graphon-signals with multidimensional signals (rather than 1D signals), 2) we extend the Lipschitz continuity to MPNNs with readout with respect to cut distance (rather than MPNNs without readout with respect to cut metric), 3) we improve the generalization bound by utilizing robustness-type generalization bounds, and 4) we extend the analysis to non-symmetric graphons and kernels.

[LG-42] Enhancing Chemical Explainability Through Counterfactual Masking

链接: https://arxiv.org/abs/2508.18561
作者: Łukasz Janisiów,Marek Kochańczyk,Bartosz Zieliński,Tomasz Danel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial task that guides the design of new compounds, including drugs and materials. While explainable artificial intelligence methods aim to scrutinize model predictions by identifying influential molecular substructures, many existing approaches rely on masking strategies that remove either atoms or atom-level features to assess importance via fidelity metrics. These methods, however, often fail to adhere to the underlying molecular distribution and thus yield unintuitive explanations. In this work, we propose counterfactual masking, a novel framework that replaces masked substructures with chemically reasonable fragments sampled from generative models trained to complete molecular graphs. Rather than evaluating masked predictions against implausible zeroed-out baselines, we assess them relative to counterfactual molecules drawn from the data distribution. Our method offers two key benefits: (1) molecular realism underpinning robust and distribution-consistent explanations, and (2) meaningful counterfactuals that directly indicate how structural modifications may affect predicted properties. We demonstrate that counterfactual masking is well-suited for benchmarking model explainers and yields more actionable insights across multiple datasets and property prediction tasks. Our approach bridges the gap between explainability and molecular design, offering a principled and generative path toward explainable machine learning in chemistry.

[LG-43] BTW: A Non-Parametric Variance Stabilization Framework for Multimodal Model Integration

链接: https://arxiv.org/abs/2508.18551
作者: Jun Hou,Le Wang,Xuan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have become increasingly powerful in multimodal learning by enabling modular specialization across modalities. However, their effectiveness remains unclear when additional modalities introduce more noise than complementary information. Existing approaches, such as the Partial Information Decomposition, struggle to scale beyond two modalities and lack the resolution needed for instance-level control. We propose Beyond Two-modality Weighting (BTW), a bi-level, non-parametric weighting framework that combines instance-level Kullback-Leibler (KL) divergence and modality-level mutual information (MI) to dynamically adjust modality importance during training. Our method does not require additional parameters and can be applied to an arbitrary number of modalities. Specifically, BTW computes per-example KL weights by measuring the divergence between each unimodal and the current multimodal prediction, and modality-wide MI weights by estimating global alignment between unimodal and multimodal outputs. Extensive experiments on sentiment regression and clinical classification demonstrate that our method significantly improves regression performance and multiclass classification accuracy.

[LG-44] Quantifying The Limits of AI Reasoning : Systematic Neural Network Representations of Algorithms

链接: https://arxiv.org/abs/2508.18526
作者: Anastasis Kratsios,Dennis Zvigelsky,Bradd Hart
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注: 18 pages main body, 45 pages total + references

点击查看摘要

Abstract:A main open question in contemporary AI research is quantifying the forms of reasoning neural networks can perform when perfectly trained. This paper answers this by interpreting reasoning tasks as circuit emulation, where the gates define the type of reasoning; e.g. Boolean gates for predicate logic, tropical circuits for dynamic programming, arithmetic and analytic gates for symbolic mathematical representation, and hybrids thereof for deeper reasoning; e.g. higher-order logic. We present a systematic meta-algorithm that converts essentially any circuit into a feedforward neural network (NN) with ReLU activations by iteratively replacing each gate with a canonical ReLU MLP emulator. We show that, on any digital computer, our construction emulates the circuit exactly–no approximation, no rounding, modular overflow included–demonstrating that no reasoning task lies beyond the reach of neural networks. The number of neurons in the resulting network (parametric complexity) scales with the circuit’s complexity, and the network’s computational graph (structure) mirrors that of the emulated circuit. This formalizes the folklore that NNs networks trade algorithmic run-time (circuit runtime) for space complexity (number of neurons). We derive a range of applications of our main result, from emulating shortest-path algorithms on graphs with cubic–size NNs, to simulating stopped Turing machines with roughly quadratically–large NNs, and even the emulation of randomized Boolean circuits. Lastly, we demonstrate that our result is strictly more powerful than a classical universal approximation theorem: any universal function approximator can be encoded as a circuit and directly emulated by a NN. Comments: 18 pages main body, 45 pages total + references Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA) MSC classes: 68T07, 68Q17, 68Q05, 68W40, 68N99 Cite as: arXiv:2508.18526 [cs.LG] (or arXiv:2508.18526v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Breaking Through Barren Plateaus: Reinforcement Learning Initializations for Deep Variational Quantum Circuits

链接: https://arxiv.org/abs/2508.18514
作者: Yifeng Peng,Xinyi Li,Zhemin Zhang,Samuel Yen-Chi Chen,Zhiding Liang,Ying Wang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) have gained prominence as a viable framework for exploiting near-term quantum devices in applications ranging from optimization and chemistry simulation to machine learning. However, the effectiveness of VQAs is often constrained by the so-called barren plateau problem, wherein gradients diminish exponentially as system size or circuit depth increases, thereby hindering training. In this work, we propose a reinforcement learning (RL)-based initialization strategy to alleviate the barren plateau issue by reshaping the initial parameter landscape to avoid regions prone to vanishing gradients. In particular, we explore several RL algorithms (Deterministic Policy Gradient, Soft Actor-Critic, and Proximal Policy Optimization, etc.) to generate the circuit parameters (treated as actions) that minimize the VQAs cost function before standard gradient-based optimization. By pre-training with RL in this manner, subsequent optimization using methods such as gradient descent or Adam proceeds from a more favorable initial state. Extensive numerical experiments under various noise conditions and tasks consistently demonstrate that the RL-based initialization method significantly enhances both convergence speed and final solution quality. Moreover, comparisons among different RL algorithms highlight that multiple approaches can achieve comparable performance gains, underscoring the flexibility and robustness of our method. These findings shed light on a promising avenue for integrating machine learning techniques into quantum algorithm design, offering insights into how RL-driven parameter initialization can accelerate the scalability and practical deployment of VQAs. Opening up a promising path for the research community in machine learning for quantum, especially barren plateau problems in VQAs.

[LG-46] DenseRec: Revisiting Dense Content Embeddings for Sequential Transformer-based Recommendation RECSYS’25

链接: https://arxiv.org/abs/2508.18442
作者: Jan Malte Lichtenberg,Antonio De Candia,Matteo Ruffini
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: EARL workshop @RecSys’25, Prague, Czech Republic

点击查看摘要

Abstract:Transformer-based sequential recommenders, such as SASRec or BERT4Rec, typically rely solely on learned item ID embeddings, making them vulnerable to the item cold-start problem, particularly in environments with dynamic item catalogs. While dense content embeddings from pre-trained models offer potential solutions, direct integration into transformer-based recommenders has consistently underperformed compared to ID-only approaches. We revisit this integration challenge and propose DenseRec, a simple yet effective method that introduces a dual-path embedding approach. DenseRec learns a linear projection from the dense embedding space into the ID embedding space during training, enabling seamless generalization to previously unseen items without requiring specialized embedding models or complex infrastructure. In experiments on three real-world datasets, we find DenseRec to consistently outperform an ID-only SASRec baseline, even without additional hyperparameter tuning and while using compact embedding models. Our analysis suggests improvements primarily arise from better sequence representations in the presence of unseen items, positioning DenseRec as a practical and robust solution for cold-start sequential recommendation.

[LG-47] Enhancing Trust-Region Bayesian Optimization via Newton Methods

链接: https://arxiv.org/abs/2508.18423
作者: Quanlin Chen,Yiyu Chen,Jing Huo,Tianyu Ding,Yang Gao,Yuetong Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) has been widely applied to optimize expensive black-box functions while retaining sample efficiency. However, scaling BO to high-dimensional spaces remains challenging. Existing literature proposes performing standard BO in multiple local trust regions (TuRBO) for heterogeneous modeling of the objective function and avoiding over-exploration. Despite its advantages, using local Gaussian Processes (GPs) reduces sampling efficiency compared to a global GP. To enhance sampling efficiency while preserving heterogeneous modeling, we propose to construct multiple local quadratic models using gradients and Hessians from a global GP, and select new sample points by solving the bound-constrained quadratic program. Additionally, we address the issue of vanishing gradients of GPs in high-dimensional spaces. We provide a convergence analysis and demonstrate through experimental results that our method enhances the efficacy of TuRBO and outperforms a wide range of high-dimensional BO techniques on synthetic functions and real-world applications.

[LG-48] LLM -Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning

链接: https://arxiv.org/abs/2508.18420
作者: André Quadros,Cassio Silva,Ronnie Alves
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, Accepted to the ENIAC 2025 conference

点击查看摘要

Abstract:This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of reinforcement learning (RL) agents in environments with extreme sparse rewards, where traditional learning struggles due to infrequent positive feedback. We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) to reward state novelty, with an intrinsic reward approach derived from Large Language Models (LLMs). The LLMs leverage their pre-trained knowledge to generate reward signals based on environment and goal descriptions, guiding the agent. We implemented this combined approach with an Actor-Critic (A2C) agent in the MiniGrid DoorKey environment, a benchmark for sparse rewards. Our empirical results show that this combined strategy significantly increases agent performance and sampling efficiency compared to using each strategy individually or a standard A2C agent, which failed to learn. Analysis of learning curves indicates that the combination effectively complements different aspects of the environment and task: VSIMR drives exploration of new states, while the LLM-derived rewards facilitate progressive exploitation towards goals.

[LG-49] DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

链接: https://arxiv.org/abs/2508.18376
作者: Weilin Cai,Le Qin,Shwai He,Junwei Cui,Ang Li,Jiayi Huang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) has become a mainstream architecture for building Large Language Models (LLMs) by reducing per-token computation while enabling model scaling. It can be viewed as partitioning a large Feed-Forward Network (FFN) at the tensor level into fine-grained sub-FFNs, or experts, and activating only a sparse subset for each input. While this sparsity improves efficiency, MoE still faces substantial challenges due to their massive computational scale and unpredictable activation patterns. To enable efficient MoE deployment, we identify dual sparsity at the tensor and neuron levels in pre-trained MoE modules as a key factor for both accuracy and efficiency. Unlike prior work that increases tensor-level sparsity through finer-grained expert design during pre-training, we introduce post-training expert partitioning to induce such sparsity without retraining. This preserves the mathematical consistency of model transformations and enhances both efficiency and accuracy in subsequent fine-tuning and inference. Building upon this, we propose DualSparse-MoE, an inference system that integrates dynamic tensor-level computation dropping with static neuron-level reconstruction to deliver significant efficiency gains with minimal accuracy loss. Experimental results show that enforcing an approximate 25% drop rate with our approach reduces average accuracy by only 0.08%-0.28% across three prevailing MoE models, while nearly all degrees of computation dropping consistently yield proportional computational speedups. Furthermore, incorporating load-imbalance awareness into expert parallelism achieves a 1.41x MoE module speedup with just 0.5% average accuracy degradation. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2508.18376 [cs.LG] (or arXiv:2508.18376v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18376 Focus to learn more arXiv-issued DOI via DataCite

[LG-50] Linear cost mutual information estimation and independence test of similar performance as HSIC

链接: https://arxiv.org/abs/2508.18338
作者: Jarek Duda,Jagoda Bracha,Adrian Przybysz
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Evaluation of statistical dependencies between two data samples is a basic problem of data science/machine learning, and HSIC (Hilbert-Schmidt Information Criterion)~\citeHSIC is considered the state-of-art method. However, for size n data sample it requires multiplication of n\times n matrices, what currently needs \sim O(n^2.37) computational complexity~\citemult, making it impractical for large data samples. We discuss HCR (Hierarchical Correlation Reconstruction) as its linear cost practical alternative of even higher dependence sensitivity in tests, and additionally providing actual joint distribution model by description of dependencies through features being mixed moments, starting with correlation and homoscedasticity, also allowing to approximate mutual information as just sum of squares of such nontrivial mixed moments between two data samples. Such single dependence describing feature is calculated in O(n) linear time. Their number to test varies with dimension d - requiring O(d^2) for pairwise dependencies, O(d^3) if wanting to also consider more subtle triplewise, and so on.

[LG-51] ZTFed-MAS2S: A Zero-Trust Federated Learning Framework with Verifiable Privacy and Trust-Aware Aggregation for Wind Power Data Imputation

链接: https://arxiv.org/abs/2508.18318
作者: Yang Li,Hanjie Wang,Yuanzheng Li,Jiazheng Li,Zhaoyang Dong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
*备注: Accepted by IEEE Transactions on Industrial Informatics, 11 pages, 6 figures

点击查看摘要

Abstract:Wind power data often suffers from missing values due to sensor faults and unstable transmission at edge sites. While federated learning enables privacy-preserving collaboration without sharing raw data, it remains vulnerable to anomalous updates and privacy leakage during parameter exchange. These challenges are amplified in open industrial environments, necessitating zero-trust mechanisms where no participant is inherently trusted. To address these challenges, this work proposes ZTFed-MAS2S, a zero-trust federated learning framework that integrates a multi-head attention-based sequence-to-sequence imputation model. ZTFed integrates verifiable differential privacy with non-interactive zero-knowledge proofs and a confidentiality and integrity verification mechanism to ensure verifiable privacy preservation and secure model parameters transmission. A dynamic trust-aware aggregation mechanism is employed, where trust is propagated over similarity graphs to enhance robustness, and communication overhead is reduced via sparsity- and quantization-based compression. MAS2S captures long-term dependencies in wind power data for accurate imputation. Extensive experiments on real-world wind farm datasets validate the superiority of ZTFed-MAS2S in both federated learning performance and missing data imputation, demonstrating its effectiveness as a secure and efficient solution for practical applications in the energy sector.

[LG-52] Learning Spatio-Temporal Dynamics via Operator-Valued RKHS and Kernel Koopman Methods

链接: https://arxiv.org/abs/2508.18307
作者: Mahishanka Withanachchi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a unified framework for learning the spatio-temporal dynamics of vector valued functions by combining operator valued reproducing kernel Hilbert spaces (OV-RKHS) with kernel based Koopman operator methods. The approach enables nonparametric and data driven estimation of complex time evolving vector fields while preserving both spatial and temporal structure. We establish representer theorems for time dependent OV-RKHS interpolation, derive Sobolev type approximation bounds for smooth vector fields, and provide spectral convergence guarantees for kernel Koopman operator approximations. This framework supports efficient reduced order modeling and long term prediction of high dimensional nonlinear systems, offering theoretically grounded tools for forecasting, control, and uncertainty quantification in spatio- temporal machine learning.

[LG-53] A Fast and Minimal System to Identify Depression Using Smartphones: Explainable Machine Learning-Based Approach

链接: https://arxiv.org/abs/2508.18301
作者: Md Sabbir Ahmed,Nova Ahmed
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Background: Existing robust, pervasive device-based systems developed in recent years to detect depression require data collected over a long period and may not be effective in cases where early detection is crucial. Objective: Our main objective was to develop a minimalistic system to identify depression using data retrieved in the fastest possible time. Methods: We developed a fast tool that retrieves the past 7 days’ app usage data in 1 second (mean 0.31, SD 1.10 seconds). A total of 100 students from Bangladesh participated in our study, and our tool collected their app usage data. To identify depressed and nondepressed students, we developed a diverse set of ML models. We selected important features using the stable approach, along with 3 main types of feature selection (FS) approaches. Results: Leveraging only the app usage data retrieved in 1 second, our light gradient boosting machine model used the important features selected by the stable FS approach and correctly identified 82.4% (n=42) of depressed students (precision=75%, F1-score=78.5%). Moreover, after comprehensive exploration, we presented a parsimonious stacking model where around 5 features selected by the all-relevant FS approach Boruta were used in each iteration of validation and showed a maximum precision of 77.4% (balanced accuracy=77.9%). A SHAP analysis of our best models presented behavioral markers that were related to depression. Conclusions: Due to our system’s fast and minimalistic nature, it may make a worthwhile contribution to identifying depression in underdeveloped and developing regions. In addition, our detailed discussion about the implication of our findings can facilitate the development of less resource-intensive systems to better understand students who are depressed. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2508.18301 [cs.LG] (or arXiv:2508.18301v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.18301 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.2196/28848 Focus to learn more DOI(s) linking to related resources Submission history From: Md Sabbir Ahmed [view email] [v1] Fri, 22 Aug 2025 20:39:14 UTC (1,634 KB) Full-text links: Access Paper: View a PDF of the paper titled A Fast and Minimal System to Identify Depression Using Smartphones: Explainable Machine Learning-Based Approach, by Md Sabbir Ahmed and Nova AhmedView PDFOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-08 Change to browse by: cs cs.CY cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-54] Data-driven models for production forecasting and decision supporting in petroleum reservoirs

链接: https://arxiv.org/abs/2508.18289
作者: Mateus A. Fernandes,Michael M. Furlanetti,Eduardo Gildin,Marcio A. Sampaio
类目: Machine Learning (cs.LG)
*备注: Manuscript as submitted to Journal of Petroleum Exploration and Production Technology

点击查看摘要

Abstract:Forecasting production reliably and anticipating changes in the behavior of rock-fluid systems are the main challenges in petroleum reservoir engineering. This project proposes to deal with this problem through a data-driven approach and using machine learning methods. The objective is to develop a methodology to forecast production parameters based on simple data as produced and injected volumes and, eventually, gauges located in wells, without depending on information from geological models, fluid properties or details of well completions and flow systems. Initially, we performed relevance analyses of the production and injection variables, as well as conditioning the data to suit the problem. As reservoir conditions change over time, concept drift is a priority concern and require special attention to those observation windows and the periodicity of retraining, which are also objects of study. For the production forecasts, we study supervised learning methods, such as those based on regressions and Neural Networks, to define the most suitable for our application in terms of performance and complexity. In a first step, we evaluate the methodology using synthetic data generated from the UNISIM III compositional simulation model. Next, we applied it to cases of real plays in the Brazilian pre-salt. The expected result is the design of a reliable predictor for reproducing reservoir dynamics, with rapid response, capability of dealing with practical difficulties such as restrictions in wells and processing units, and that can be used in actions to support reservoir management, including the anticipation of deleterious behaviors, optimization of production and injection parameters and the analysis of the effects of probabilistic events, aiming to maximize oil recovery.

[LG-55] Reasoning Steps as Curriculum: Using Depth of Thought as a Difficulty Signal for Tuning LLM s

链接: https://arxiv.org/abs/2508.18279
作者: Jeesu Jung,Sangkeun Jung
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Curriculum learning for training LLMs requires a difficulty signal that aligns with reasoning while remaining scalable and interpretable. We propose a simple premise: tasks that demand deeper depth of thought for humans should also be harder for models. Accordingly, we define difficulty as depth of thought (DoT) and operationalize it by counting the discrete steps in a teacher model’s reasoning trace (e.g., Chain-of-Thought). We then train with a shallow to deep curriculum ordered by this DoT and outline how to derive, validate, and schedule it at scale. Our position yields three testable hypotheses: (i) DoT correlates with conventional difficulty on reasoning benchmarks, (ii) DoT-ordered curricula outperform length- or judge-scored curricula under matched budgets, and (iii) the difficulty is robust across teacher models given light formatting controls. We propose an evaluation framework and discuss threats to validity (teacher style, length confounds) alongside practical mitigations. Taken together, we aim to move toward cognitively grounded, interpretable curricula for reasoning-centric training.

[LG-56] Approximating High-Dimensional Earth Movers Distance as Fast as Closest Pair

链接: https://arxiv.org/abs/2508.06774
作者: Lorenzo Beretta,Vincent Cohen-Addad,Rajesh Jayaram,Erik Waingarten
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: FOCS 2025

点击查看摘要

Abstract:We give a reduction from (1+\varepsilon) -approximate Earth Mover’s Distance (EMD) to (1+\varepsilon) -approximate Closest Pair (CP). As a consequence, we improve the fastest known approximation algorithm for high-dimensional EMD. Here, given p\in [1, 2] and two sets of n points X,Y \subseteq (\mathbb R^d,\ell_p) , their EMD is the minimum cost of a perfect matching between X and Y , where the cost of matching two vectors is their \ell_p distance. Further, CP is the basic problem of finding a pair of points realizing \min_x \in X, y\in Y ||x-y||_p . Our contribution is twofold: we show that if a (1+\varepsilon) -approximate CP can be computed in time n^2-\phi , then a 1+O(\varepsilon) approximation to EMD can be computed in time n^2-\Omega(\phi) ; plugging in the fastest known algorithm for CP [Alman, Chan, Williams FOCS’16], we obtain a (1+\varepsilon) -approximation algorithm for EMD running in time n^2-\tilde\Omega(\varepsilon^1/3) for high-dimensional point sets, which improves over the prior fastest running time of n^2-\Omega(\varepsilon^2) [Andoni, Zhang FOCS’23]. Our main technical contribution is a sublinear implementation of the Multiplicative Weights Update framework for EMD. Specifically, we demonstrate that the updates can be executed without ever explicitly computing or storing the weights; instead, we exploit the underlying geometric structure to perform the updates implicitly.

[LG-57] Echoes of the past: A unified perspective on fading memory and echo states

链接: https://arxiv.org/abs/2508.19145
作者: Juan-Pablo Ortega,Florian Rossmannek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) have become increasingly popular in information processing tasks involving time series and temporal data. A fundamental property of RNNs is their ability to create reliable input/output responses, often linked to how the network handles its memory of the information it processed. Various notions have been proposed to conceptualize the behavior of memory in RNNs, including steady states, echo states, state forgetting, input forgetting, and fading memory. Although these notions are often used interchangeably, their precise relationships remain unclear. This work aims to unify these notions in a common language, derive new implications and equivalences between them, and provide alternative proofs to some existing results. By clarifying the relationships between these concepts, this research contributes to a deeper understanding of RNNs and their temporal information processing capabilities.

[LG-58] Universal Dynamics with Globally Controlled Analog Quantum Simulators

链接: https://arxiv.org/abs/2508.19075
作者: Hong-Ye Hu,Abigail McClain Gomez,Liyuan Chen,Aaron Trowbridge,Andy J. Goldschmidt,Zachary Manchester,Frederic T. Chong,Arthur Jaffe,Susanne F. Yelin
类目: Quantum Physics (quant-ph); Quantum Gases (cond-mat.quant-gas); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Recent breakthroughs, such as the coherent control of thousands of atoms, highlight the growing potential for quantum applications at scale. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Crucially, to connect the theoretical possibility with experimental reality, we introduce a new control technique into the experiment - direct quantum optimal control. This method enables the synthesis of complex effective Hamiltonians and allows us to incorporate realistic hardware constraints. To show its practical power, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg atom array. Using the new control framework, we overcome key experimental challenges, including hardware limitations and atom position fluctuations in the non-blockade regime, by identifying smooth, short-duration pulses that achieve high-fidelity dynamics. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our approach. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.

[LG-59] Is attention truly all we need? An empirical study of asset pricing in pretrained RNN sparse and global attention models

链接: https://arxiv.org/abs/2508.19006
作者: Shanyan Lai
类目: Pricing of Securities (q-fin.PR); Machine Learning (cs.LG); Econometrics (econ.EM); Computational Finance (q-fin.CP)
*备注: 55 pages including appendix, 21 figures and 5 tables

点击查看摘要

Abstract:This study investigates the pretrained RNN attention models with the mainstream attention mechanisms such as additive attention, Luong’s three attentions, global self-attention (Self-att) and sliding window sparse attention (Sparse-att) for the empirical asset pricing research on top 420 large-cap US stocks. This is the first paper on the large-scale state-of-the-art (SOTA) attention mechanisms applied in the asset pricing context. They overcome the limitations of the traditional machine learning (ML) based asset pricing, such as mis-capturing the temporal dependency and short memory. Moreover, the enforced causal masks in the attention mechanisms address the future data leaking issue ignored by the more advanced attention-based models, such as the classic Transformer. The proposed attention models also consider the temporal sparsity characteristic of asset pricing data and mitigate potential overfitting issues by deploying the simplified model structures. This provides some insights for future empirical economic research. All models are examined in three periods, which cover pre-COVID-19 (mild uptrend), COVID-19 (steep uptrend with a large drawdown) and one year post-COVID-19 (sideways movement with high fluctuations), for testing the stability of these models under extreme market conditions. The study finds that in value-weighted portfolio back testing, Model Self-att and Model Sparse-att exhibit great capabilities in deriving the absolute returns and hedging downside risks, while they achieve an annualized Sortino ratio of 2.0 and 1.80 respectively in the period with COVID-19. And Model Sparse-att performs more stably than Model Self-att from the perspective of absolute portfolio returns with respect to the size of stocks’ market capitalization.

[LG-60] he GINN framework: a stochastic QED correspondence for stability and chaos in deep neural networks

链接: https://arxiv.org/abs/2508.18948
作者: Rodrigo Carmo Terin
类目: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 3 figures, 1 table

点击查看摘要

Abstract:The development of a Euclidean stochastic field-theoretic approach that maps deep neural networks (DNNs) to quantum electrodynamics (QED) with local U(1) symmetry is presented. Neural activations and weights are represented by fermionic matter and gauge fields, with a fictitious Langevin time enabling covariant gauge fixing. This mapping identifies the gauge parameter with kernel design choices in wide DNNs, relating stability thresholds to gauge-dependent amplification factors. Finite-width fluctuations correspond to loop corrections in QED. As a proof of concept, we validate the theoretical predictions through numerical simulations of standard multilayer perceptrons and, in parallel, propose a gauge-invariant neural network (GINN) implementation using magnitude–phase parameterization of weights. Finally, a double-copy replica approach is shown to unify the computation of the largest Lyapunov exponent in stochastic QED and wide DNNs.

[LG-61] Forecasting Probability Distributions of Financial Returns with Deep Neural Networks

链接: https://arxiv.org/abs/2508.18921
作者: Jakub Michańków
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:This study evaluates deep neural networks for forecasting probability distributions of financial returns. 1D convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) architectures are used to forecast parameters of three probability distributions: Normal, Student’s t, and skewed Student’s t. Using custom negative log-likelihood loss functions, distribution parameters are optimized directly. The models are tested on six major equity indices (S\P 500, BOVESPA, DAX, WIG, Nikkei 225, and KOSPI) using probabilistic evaluation metrics including Log Predictive Score (LPS), Continuous Ranked Probability Score (CRPS), and Probability Integral Transform (PIT). Results show that deep learning models provide accurate distributional forecasts and perform competitively with classical GARCH models for Value-at-Risk estimation. The LSTM with skewed Student’s t distribution performs best across multiple evaluation criteria, capturing both heavy tails and asymmetry in financial returns. This work shows that deep neural networks are viable alternatives to traditional econometric models for financial risk assessment and portfolio management.

[LG-62] Sparse minimum Redundancy Maximum Relevance for feature selection

链接: https://arxiv.org/abs/2508.18901
作者: Peter Naylor,Benjamin Poignard,Héctor Climente-González,Makoto Yamada
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where the parameters estimated as zero coefficients represent the set of inactive features. We establish the conditions under which zero coefficients are correctly identified to guarantee accurate recovery of inactive features. We introduce a multi-stage procedure based on the knockoff filter enabling the penalized mRMR to discard inactive features while controlling the false discovery rate (FDR). Our method performs comparably to HSIC-LASSO but is more conservative in the number of selected features. It only requires setting an FDR threshold, rather than specifying the number of features to retain. The effectiveness of the method is illustrated through simulations and real-world datasets. The code to reproduce this work is available on the following GitHub: this https URL.

[LG-63] mperature-Aware Recurrent Neural Operator for Temperature-Dependent Anisotropic Plasticity in HCP Materials

链接: https://arxiv.org/abs/2508.18806
作者: Yannick Hollenweger,Dennis M. Kochman,Burigede Liu
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network surrogate models for constitutive laws in computational mechanics have been in use for some time. In plasticity, these models often rely on gated recurrent units (GRUs) or long short-term memory (LSTM) cells, which excel at capturing path-dependent phenomena. However, they suffer from long training times and time-resolution-dependent predictions that extrapolate poorly. Moreover, most existing surrogates for macro- or mesoscopic plasticity handle only relatively simple material behavior. To overcome these limitations, we introduce the Temperature-Aware Recurrent Neural Operator (TRNO), a time-resolution-independent neural architecture. We apply the TRNO to model the temperature-dependent plastic response of polycrystalline magnesium, which shows strong plastic anisotropy and thermal sensitivity. The TRNO achieves high predictive accuracy and generalizes effectively across diverse loading cases, temperatures, and time resolutions. It also outperforms conventional GRU and LSTM models in training efficiency and predictive performance. Finally, we demonstrate multiscale simulations with the TRNO, yielding a speedup of at least three orders of magnitude over traditional constitutive models.

[LG-64] Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits

链接: https://arxiv.org/abs/2508.18768
作者: Mengmeng Li,Philipp Schneider,Jelisaveta Aleksić,Daniel Kuhn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the first best-of-both-worlds algorithm for contextual combinatorial semi-bandits that simultaneously guarantees \widetilde\mathcalO(\sqrtT) regret in the adversarial regime and \widetilde\mathcalO(\ln T) regret in the corrupted stochastic regime. Our approach builds on the Follow-the-Regularized-Leader (FTRL) framework equipped with a Shannon entropy regularizer, yielding a flexible method that admits efficient implementations. Beyond regret bounds, we tackle the practical bottleneck in FTRL (or, equivalently, Online Stochastic Mirror Descent) arising from the high-dimensional projection step encountered in each round of interaction. By leveraging the Karush-Kuhn-Tucker conditions, we transform the K -dimensional convex projection problem into a single-variable root-finding problem, dramatically accelerating each round. Empirical evaluations demonstrate that this combined strategy not only attains the attractive regret bounds of best-of-both-worlds algorithms but also delivers substantial per-round speed-ups, making it well-suited for large-scale, real-time applications.

[LG-65] Data-Driven Discovery and Formulation Refines the Quasi-Steady Model of Flapping-Wing Aerodynamics

链接: https://arxiv.org/abs/2508.18703
作者: Yu Kamimizu,Hao Liu,Toshiyuki Nakata
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:Insects control unsteady aerodynamic forces on flapping wings to navigate complex environments. While understanding these forces is vital for biology, physics, and engineering, existing evaluation methods face trade-offs: high-fidelity simulations are computationally or experimentally expensive and lack explanatory power, whereas theoretical models based on quasi-steady assumptions offer insights but exhibit low accuracy. To overcome these limitations and thus enhance the accuracy of quasi-steady aerodynamic models, we applied a data-driven approach involving discovery and formulation of previously overlooked critical mechanisms. Through selection from 5,000 candidate kinematic functions, we identified mathematical expressions for three key additional mechanisms – the effect of advance ratio, effect of spanwise kinematic velocity, and rotational Wagner effect – which had been qualitatively recognized but were not formulated. Incorporating these mechanisms considerably reduced the prediction errors of the quasi-steady model using the computational fluid dynamics results as the ground truth, both in hawkmoth forward flight (at high Reynolds numbers) and fruit fly maneuvers (at low Reynolds numbers). The data-driven quasi-steady model enables rapid aerodynamic analysis, serving as a practical tool for understanding evolutionary adaptations in insect flight and developing bio-inspired flying robots.

[LG-66] ModAn-MulSupCon: Modality-and Anatomy-Aware Multi-Label Supervised Contrastive Pretraining for Medical Imaging

链接: https://arxiv.org/abs/2508.18613
作者: Eichi Takaya,Ryusei Inamori
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background and objective: Expert annotations limit large-scale supervised pretraining in medical imaging, while ubiquitous metadata (modality, anatomical region) remain underused. We introduce ModAn-MulSupCon, a modality- and anatomy-aware multi-label supervised contrastive pretraining method that leverages such metadata to learn transferable representations. Method: Each image’s modality and anatomy are encoded as a multi-hot vector. A ResNet-18 encoder is pretrained on a mini subset of RadImageNet (miniRIN, 16,222 images) with a Jaccard-weighted multi-label supervised contrastive loss, and then evaluated by fine-tuning and linear probing on three binary classification tasks–ACL tear (knee MRI), lesion malignancy (breast ultrasound), and nodule malignancy (thyroid ultrasound). Result: With fine-tuning, ModAn-MulSupCon achieved the best AUC on MRNet-ACL (0.964) and Thyroid (0.763), surpassing all baselines ( p0.05 ), and ranked second on Breast (0.926) behind SimCLR (0.940; not significant). With the encoder frozen, SimCLR/ImageNet were superior, indicating that ModAn-MulSupCon representations benefit most from task adaptation rather than linear separability. Conclusion: Encoding readily available modality/anatomy metadata as multi-label targets provides a practical, scalable pretraining signal that improves downstream accuracy when fine-tuning is feasible. ModAn-MulSupCon is a strong initialization for label-scarce clinical settings, whereas SimCLR/ImageNet remain preferable for frozen-encoder deployments. Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG) Cite as: arXiv:2508.18613 [eess.IV] (or arXiv:2508.18613v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.18613 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Eichi Takaya [view email] [v1] Tue, 26 Aug 2025 02:27:00 UTC (2,801 KB)

[LG-67] Stress-testing cross-cancer generalizability of 3D nnU-Net for PET-CT tumor segmentation: multi-cohort evaluation with novel oesophageal and lung cancer datasets

链接: https://arxiv.org/abs/2508.18612
作者: Soumen Ghosh,Christine Jestin Hannan,Rajat Vashistha,Parveen Kundu,Sandra Brosda,Lauren G.Aoude,James Lonie,Andrew Nathanson,Jessica Ng,Andrew P. Barbour,Viktor Vegh
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust generalization is essential for deploying deep learning based tumor segmentation in clinical PET-CT workflows, where anatomical sites, scanners, and patient populations vary widely. This study presents the first cross cancer evaluation of nnU-Net on PET-CT, introducing two novel, expert-annotated whole-body datasets. 279 patients with oesophageal cancer (Australian cohort) and 54 with lung cancer (Indian cohort). These cohorts complement the public AutoPET dataset and enable systematic stress-testing of cross domain performance. We trained and tested 3D nnUNet models under three paradigms. Target only (oesophageal), public only (AutoPET), and combined training. For the tested sets, the oesophageal only model achieved the best in-domain accuracy (mean DSC, 57.8) but failed on external Indian lung cohort (mean DSC less than 3.4), indicating severe overfitting. The public only model generalized more broadly (mean DSC, 63.5 on AutoPET, 51.6 on Indian lung cohort) but underperformed in oesophageal Australian cohort (mean DSC, 26.7). The combined approach provided the most balanced results (mean DSC, lung (52.9), oesophageal (40.7), AutoPET (60.9)), reducing boundary errors and improving robustness across all cohorts. These findings demonstrate that dataset diversity, particularly multi demographic, multi center and multi cancer integration, outweighs architectural novelty as the key driver of robust generalization. This work presents the demography based cross cancer deep learning segmentation evaluation and highlights dataset diversity, rather than model complexity, as the foundation for clinically robust segmentation.

[LG-68] Revisiting Follow-the-Perturbed-Leader with Unbounded Perturbations in Bandit Problems

链接: https://arxiv.org/abs/2508.18604
作者: Jongyeong Lee,Junya Honda,Shinji Ito,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Follow-the-Regularized-Leader (FTRL) policies have achieved Best-of-Both-Worlds (BOBW) results in various settings through hybrid regularizers, whereas analogous results for Follow-the-Perturbed-Leader (FTPL) remain limited due to inherent analytical challenges. To advance the analytical foundations of FTPL, we revisit classical FTRL-FTPL duality for unbounded perturbations and establish BOBW results for FTPL under a broad family of asymmetric unbounded Fréchet-type perturbations, including hybrid perturbations combining Gumbel-type and Fréchet-type tails. These results not only extend the BOBW results of FTPL but also offer new insights into designing alternative FTPL policies competitive with hybrid regularization approaches. Motivated by earlier observations in two-armed bandits, we further investigate the connection between the 1/2 -Tsallis entropy and a Fréchet-type perturbation. Our numerical observations suggest that it corresponds to a symmetric Fréchet-type perturbation, and based on this, we establish the first BOBW guarantee for symmetric unbounded perturbations in the two-armed setting. In contrast, in general multi-armed bandits, we find an instance in which symmetric Fréchet-type perturbations violate the key condition for standard BOBW analysis, which is a problem not observed with asymmetric or nonnegative Fréchet-type perturbations. Although this example does not rule out alternative analyses achieving BOBW results, it suggests the limitations of directly applying the relationship observed in two-armed cases to the general case and thus emphasizes the need for further investigation to fully understand the behavior of FTPL in broader settings.

[LG-69] An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing

链接: https://arxiv.org/abs/2508.18513
作者: Yusi Wei,Hande Y. Benson,Muge Capan
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The secondary use of healthcare data is vital for research and clinical innovation, but it raises concerns about patient privacy. This study investigates how to balance privacy preservation and data utility in healthcare data sharing, considering the perspectives of both data providers and data users. Using a dataset of adult patients hospitalized between 2013 and 2015, we predict whether sepsis was present at admission or developed during the hospital stay. We identify sub-populations, such as older adults, frequently hospitalized patients, and racial minorities, that are especially vulnerable to privacy attacks due to their unique combinations of demographic and healthcare utilization attributes. These groups are also critical for machine learning (ML) model performance. We evaluate three anonymization methods- k -anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk while maintaining ML utility. Results show that k -anonymity offers limited protection. The methods of Zheng et al. and MO-OBAM provide stronger privacy safeguards, with MO-OBAM yielding the best utility outcomes: only a 2% change in precision and recall compared to the original dataset. This work provides actionable insights for healthcare organizations on how to share data responsibly. It highlights the need for anonymization methods that protect vulnerable populations without sacrificing the performance of data-driven models.

[LG-70] Huracan: A skillful end-to-end data-driven system for ensemble data assimilation and weather prediction

链接: https://arxiv.org/abs/2508.18486
作者: Zekun Ni,Jonathan Weyn,Hang Zhang,Yanfei Xiang,Jiang Bian,Weixin Jin,Kit Thambiratnam,Qi Zhang,Haiyu Dong,Hongyu Sun
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the past few years, machine learning-based data-driven weather prediction has been transforming operational weather forecasting by providing more accurate forecasts while using a mere fraction of computing power compared to traditional numerical weather prediction (NWP). However, those models still rely on initial conditions from NWP, putting an upper limit on their forecast abilities. A few end-to-end systems have since been proposed, but they have yet to match the forecast skill of state-of-the-art NWP competitors. In this work, we propose Huracan, an observation-driven weather forecasting system which combines an ensemble data assimilation model with a forecast model to produce highly accurate forecasts relying only on observations as inputs. Huracan is not only the first to provide ensemble initial conditions and end-to-end ensemble weather forecasts, but also the first end-to-end system to achieve an accuracy comparable with that of ECMWF ENS, the state-of-the-art NWP competitor, despite using a smaller amount of available observation data. Notably, Huracan matches or exceeds the continuous ranked probability score of ECMWF ENS on 75.4% of the variable and lead time combinations. Our work is a major step forward in end-to-end data-driven weather prediction and opens up opportunities for further improving and revolutionizing operational weather forecasting.

[LG-71] From Prediction to Simulation: AlphaFold 3 as a Differentiable Framework for Structural Biology

链接: https://arxiv.org/abs/2508.18446
作者: Alireza Abbaszadeh,Armita Shahlaee
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 37 pages, 5 figures. A perspective article on the conceptual advances of AlphaFold 3 and its paradigm shift toward differentiable simulation in structural biology

点击查看摘要

Abstract:AlphaFold 3 represents a transformative advancement in computational biology, enhancing protein structure prediction through novel multi-scale transformer architectures, biologically informed cross-attention mechanisms, and geometry-aware optimization strategies. These innovations dramatically improve predictive accuracy and generalization across diverse protein families, surpassing previous methods. Crucially, AlphaFold 3 embodies a paradigm shift toward differentiable simulation, bridging traditional static structural modeling with dynamic molecular simulations. By reframing protein folding predictions as a differentiable process, AlphaFold 3 serves as a foundational framework for integrating deep learning with physics-based molecular

[LG-72] Deterministic Coreset Construction via Adaptive Sensitivity Trimming

链接: https://arxiv.org/abs/2508.18340
作者: Faruk Alpay,Taylan Alpay
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 5 algorithms, 1 table

点击查看摘要

Abstract:We develop a rigorous framework for deterministic coreset construction in empirical risk minimization (ERM). Our central contribution is the Adaptive Deterministic Uniform-Weight Trimming (ADUWT) algorithm, which constructs a coreset by excising points with the lowest sensitivity bounds and applying a data-dependent uniform weight to the remainder. The method yields a uniform (1\pm\varepsilon) relative-error approximation for the ERM objective over the entire hypothesis space. We provide complete analysis, including (i) a minimax characterization proving the optimality of the adaptive weight, (ii) an instance-dependent size analysis in terms of a \emphSensitivity Heterogeneity Index, and (iii) tractable sensitivity oracles for kernel ridge regression, regularized logistic regression, and linear SVM. Reproducibility is supported by precise pseudocode for the algorithm, sensitivity oracles, and evaluation pipeline. Empirical results align with the theory. We conclude with open problems on instance-optimal oracles, deterministic streaming, and fairness-constrained ERM.

信息检索

[IR-0] Extracting Information from Scientific Literature via Visual Table Question Answering Models

链接: https://arxiv.org/abs/2508.18661
作者: Dongyoun Kim,Hyung-do Choi,Youngsun Jang,John Kim
类目: Information Retrieval (cs.IR)
*备注: Accepted at ACM International Conference on Research in Adaptive and Convergent Systems, November 5-8, 2024, Pompei, Italy

点击查看摘要

Abstract:This study explores three approaches to processing table data in scientific papers to enhance extractive question answering and develop a software tool for the systematic review process. The methods evaluated include: (1) Optical Character Recognition (OCR) for extracting information from documents, (2) Pre-trained models for document visual question answering, and (3) Table detection and structure recognition to extract and merge key information from tables with textual content to answer extractive questions. In exploratory experiments, we augmented ten sample test documents containing tables and relevant content against RF- EMF-related scientific papers with seven predefined extractive question-answer pairs. The results indicate that approaches preserving table structure outperform the others, particularly in representing and organizing table content. Accurately recognizing specific notations and symbols within the documents emerged as a critical factor for improved results. Our study concludes that preserving the structural integrity of tables is essential for enhancing the accuracy and reliability of extractive question answering in scientific documents.

[IR-1] REALM: Recursive Relevance Modeling for LLM -based Document Re-Ranking EMNLP2025

链接: https://arxiv.org/abs/2508.18379
作者: Pinhuan Wang,Zhiqiu Xia,Chunhua Liao,Feiyi Wang,Hang Liu
类目: Information Retrieval (cs.IR)
*备注: Accepted to EMNLP 2025 (Main Conference). 13 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong capabilities in document re-ranking, a key component in modern Information Retrieval (IR) systems. However, existing LLM-based approaches face notable limitations, including ranking uncertainty, unstable top-k recovery, and high token cost due to token-intensive prompting. To effectively address these limitations, we propose REALM, an uncertainty-aware re-ranking framework that models LLM-derived relevance as Gaussian distributions and refines them through recursive Bayesian updates. By explicitly capturing uncertainty and minimizing redundant queries, REALM achieves better rankings more efficiently. Experimental results demonstrate that our REALM surpasses state-of-the-art re-rankers while significantly reducing token usage and latency, promoting it as the next-generation re-ranker for modern IR systems.

附件下载

点击下载今日全部论文列表