Arxiv今日论文 | 2025-09-15

本篇博文主要内容为 2025-09-15 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决预训练自动语音识别（ASR）模型在面对未见词汇和特定语域时性能下降的问题，尤其是在无法获取语音数据的现实场景中，如何实现有效的领域自适应。其核心解决方案是提出WhisTLE方法，关键在于利用变分自编码器（VAE）建模文本到编码器输出的潜在空间映射，并通过深度监督策略微调解码器，从而实现仅基于文本的高效适配；该方法在推理阶段恢复原始编码器，不引入额外计算开销，同时结合文本到语音（TTS）适配可进一步提升效果，在多个测试场景中显著优于现有基线方法。

链接: https://arxiv.org/abs/2509.10452
作者: Akshat Pandey,Karun Kumar,Raphael Tang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
zh

[NLP-1] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

【速读】：该论文旨在解决开放大语言模型（Large Language Models, LLMs）在结合浏览工具进行深度搜索时存在的两大问题：一是长程推理能力不足，二是缺乏足够难度的监督数据。为应对这些问题，其关键解决方案包括两个方面：首先，提出一种自动从开放知识图谱中合成复杂、困难且难以发现的问题的方法，以构建更具挑战性的训练数据；其次，采用端到端多轮强化学习（multi-turn reinforcement learning, RL）来提升LLMs在深度搜索场景下的长期推理能力。实验表明，所提出的DeepDive方法在BrowseComp基准上达到了当前开源模型中的领先性能，并通过多轮RL显著提升了多个基准上的表现，同时支持测试时的工具调用扩展与并行采样策略。

链接: https://arxiv.org/abs/2509.10446
作者: Rui Lu,Zhenyu Hou,Zihan Wang,Hanchen Zhang,Xiao Liu,Yujiang Li,Shi Feng,Jie Tang,Yuxiao Dong
机构: Tsinghua University (清华大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at this https URL.
zh

[NLP-2] RefactorCoderQA: Benchmarking LLM s for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理与问题求解能力上的局限性，尤其是在复杂多领域编程任务中表现不足的问题。其解决方案的关键在于提出一种云边协同架构，通过多智能体提示（multi-agent prompting）框架实现分工协作：边缘侧部署轻量级GuideLLM提供方法论指导，云端运行高性能SolverLLM生成代码解决方案，并由JudgeLLM自动评估结果的正确性与质量。该架构显著提升了模型在真实场景下的准确性与实用性，实验表明所提出的RefactorCoder-MoE模型在多领域编程基准RefactorCoderQA上达到76.84%的整体准确率，优于主流开源与商用基线模型。

链接: https://arxiv.org/abs/2509.10436
作者: Shadikur Rahman,Aroosa Hameed,Gautam Srivastava,Syed Muhammad Danish
机构: Algoma University (阿尔戈马大学); Carleton University (卡尔顿大学); Brandon University (布兰登大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures, submitted to IEEE Transactions on Services Computing

点击查看摘要

Abstract:To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.
zh

[NLP-3] Long Context Automated Essay Scoring with Language Models

【速读】：该论文试图解决的问题是：基于Transformer的文本模型因架构限制只能处理固定长度的输入文本，而高年级学生撰写的作文通常超出这些模型的最大输入长度，导致在自动作文评分（Automated Essay Scoring, AES）任务中需对文本进行截断，从而严重影响模型对作文结构和评分维度（如组织性、逻辑连贯性等）的完整评估能力。解决方案的关键在于采用对标准Transformer架构进行改进的模型，包括XLNet、Longformer、ModernBERT、Mamba和Llama等，这些模型通过引入长序列建模机制（如局部-全局注意力、滑动窗口机制或状态空间模型），显著提升了对超长文本的处理能力，从而保障了评分过程中的上下文完整性与评估有效性。

链接: https://arxiv.org/abs/2509.10417
作者: Christopher Ormerod,Gitit Kehat
机构: Cambium Assessment Inc. (Cambium评估公司)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model’s ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.
zh

[NLP-4] Is In-Context Learning Learning?

【速读】：该论文旨在解决当前关于上下文学习（In-context Learning, ICL）是否真正构成“学习”的争议问题，特别是其在无需额外训练的情况下，如何通过少量示例（few-shot exemplars）实现对未见任务的泛化能力。论文的关键在于从数学和实证两个层面论证ICL确实是一种学习机制，但其有效性高度依赖于模型先验知识、提示（prompting）设计以及数据分布特性；通过大规模消融实验系统分析了记忆效应、预训练影响、分布偏移和提示风格等因素，发现ICL虽有效，但存在局限性——尤其在示例数量增多时，模型准确性趋于稳定且对提示形式敏感，表明其本质是基于提示中模式规律的推断而非鲁棒的通用泛化能力。

链接: https://arxiv.org/abs/2509.10414
作者: Adrian de Wynter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Director’s cut

点击查看摘要

Abstract:In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.
zh

[NLP-5] Abduct Act Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems）中失败归因（Failure Attribution）的问题，即精准定位导致任务失败的具体步骤。当前方法将此问题视为对长对话日志的模式识别任务，但存在步级准确率极低（低于17%）的缺陷，难以用于复杂系统的调试，其根本原因在于缺乏稳健的反事实推理能力——无法判断修正单一动作是否真正避免了任务失败。解决方案的关键在于提出Abduct-Act-Predict (A2P) Scaffolding框架，将失败归因从模式识别重构为结构化的因果推断任务：通过单次推理过程中的三个步骤——(1) 归纳（Abduction），推断隐藏的根因；(2) 行动（Action），定义最小纠正干预；(3) 预测（Prediction），模拟后续轨迹并验证干预有效性——从而在保持全局上下文感知的同时，引入严格的因果逻辑，显著提升归因准确性。

链接: https://arxiv.org/abs/2509.10401
作者: Alva West,Yixuan Weng,Minjun Zhu,Zhen Lin,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Failure attribution in multi-agent systems – pinpointing the exact step where a decisive error occurs – is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent’s actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model’s analysis. Our extensive experiments on the Who\When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46% step-level accuracy, a 2.85 \times improvement over the 16.67% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31% step accuracy, a 2.43 \times improvement over the baseline’s 12.07%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution.
zh

[NLP-6] Dropping Experts Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLM s EMNLP2025

【速读】：该论文旨在解决稀疏混合专家（Sparse Mixture-of-Experts, SMoE）架构在大型语言模型（Large Language Models, LLMs）中因需加载全部专家参数而导致的高内存占用问题，从而限制其实际部署效率。尽管已有方法通过专家级剪枝与合并降低开销，但未能充分挖掘神经元层级的结构冗余。解决方案的关键在于提出一种任务无关且无需重训练的框架 DERN（Dropping Experts, Recombining Neurons），其核心步骤包括：基于路由器统计信息剪枝冗余专家；将被剪枝专家分解为神经元级别的专家片段，并将其分配给最兼容的保留专家；最后在每个保留专家内部合并这些片段以构建紧凑表示。该方法有效缓解了专家间语义冲突问题，在保持性能的同时显著减少专家数量和内存消耗，实验证明在50%专家稀疏度下可提升常识推理和MMLU基准性能超过5%，且无需额外训练。

链接: https://arxiv.org/abs/2509.10377
作者: Yixiao Zhou,Ziyu Zhao,Dongzhou Cheng,zhiliang wu,Jie Gui,Yi Yang,Fei Wu,Yu Cheng,Hehe Fan
机构: Zhejiang University (浙江大学); Shanghai Innovation Institute (上海创新研究院); Southeast University (东南大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP2025

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.
zh

[NLP-7] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识密集型任务中因知识冲突（knowledge conflict）导致的不忠实响应问题，即模型倾向于依赖内部参数化知识而非提供的外部信息。解决方案的关键在于提出一种新颖的自提升框架——Self Improving Faithfulness Aware Contrastive (SI FACT)，其核心机制是利用自指导（self-instruct）方法自动构建高质量、结构化的对比学习数据，包括锚点样本、语义等价的正样本和模拟不忠实响应的负样本，通过对比学习训练模型，在表示空间中拉近忠实响应的距离并推远不忠实响应的距离，从而显著提升模型在上下文中的忠实性表现，同时大幅降低人工标注成本。

链接: https://arxiv.org/abs/2509.10208
作者: Shengqiang Fu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided this http URL address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive this http URL framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful this http URL approach significantly reduces the cost of manual this http URL,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation this http URL on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal this http URL results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.
zh

[NLP-8] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

【速读】：该论文旨在解决大语言模型在处理长文本输入时的局限性问题，特别是在社会科学研究中对法律文本（如法案）进行多分类任务时，因模型最大输入长度限制（如512 tokens）而难以有效处理数百页的长文档。其关键解决方案是通过实验比较多种模型（包括专为长文本设计的Longformer以及GPT-3.5和GPT-4等生成式AI模型），评估它们在跨语言（5种语言）政策主题分类任务中的表现，结果表明：尽管Longformer专门优化了长文本处理能力，但在该任务中并未展现出显著优势；相比之下，基于XLM-RoBERTa的开源模型在性能上优于GPT系列模型；此外，研究还发现类别间的支持度与内容重叠程度是影响长文本分类准确性的关键因素。

链接: https://arxiv.org/abs/2509.10199
作者: Miklós Sebők,Viktor Kovács,Martin Bánóczy,Daniel Møller Eriksen,Nathalie Neptune,Philippe Roussille
机构: HUN-REN Centre for Social Sciences (匈牙利科学院社会科学研究中⼼); Aarhus University (奥胡斯大学); Institut de Recherche en Informatique de Toulouse (图卢兹信息研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.
zh

[NLP-9] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations

【速读】：该论文旨在解决生成式 AI（Generative AI）在情感支持对话中因过度乐观或不当积极表达而导致的“不一致积极”（incongruent positivity）问题，即模型输出可能显得敷衍、轻视或脱离现实，尤其在高情绪强度场景下更为明显。解决方案的关键在于：首先，通过区分对话情绪强度（分为 Mild 和 Severe 两类），揭示 LLM 在不同情境下对积极支持的误判倾向；其次，利用带有强弱情绪反应标注的数据集对模型进行微调，并开发了一个基于 DeBERTa 和 MentalBERT 的弱监督多标签分类集成模型，以更精准识别不一致积极类型的表达；最终提出应从生成通用正向回应转向注重情绪共鸣与积极情感平衡的“一致性支持”（congruent support）策略，从而提升在线对话系统的情感适配性与用户信任度。

链接: https://arxiv.org/abs/2509.10184
作者: Leen Almajed,Abeer ALdayel
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper is under review

点击查看摘要

Abstract:In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.
zh

[NLP-10] Benchmark of stylistic variation in LLM -generated texts

【速读】：该论文旨在解决生成式 AI (Generative AI) 在文本风格和语域特征上与人类写作是否存在系统性差异的问题，以及如何量化和比较这些差异。其解决方案的关键在于采用 Biber 多维分析（Multidimensional Analysis, MDA）方法，对人类撰写文本与大语言模型（Large Language Models, LLMs）生成的对应文本进行对比分析，构建了可比的语料库（如 AI-Brown 和 AI-Koditex），并基于此创建了一个具有解释性的基准，用于在多个维度上评估不同 LLM 的表现，尤其区分基础模型与指令微调模型之间的差异。

链接: https://arxiv.org/abs/2509.10179
作者: Jiří Milička,Anna Marklová,Václav Cvrček
机构: Department of Linguistics, Faculty of Arts, Charles University, Prague (查尔斯大学文学院语言学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber’s multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.
zh

[NLP-11] owards Reliable and Interpretable Document Question Answering via VLMs

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在文档理解任务中难以准确实现答案空间定位的问题，这一局限性严重制约了模型的可解释性和实际应用价值。解决方案的关键在于提出一个即插即用的边界框预测模块——DocExplainerV0，该模块通过解耦答案生成与空间定位两个过程，使得其可直接应用于现有VLM系统（包括无法进行微调的专有模型），从而提升文档信息提取的可解释性与鲁棒性。

链接: https://arxiv.org/abs/2509.10129
作者: Alessio Chen,Simone Giovannini,Andrea Gemelli,Fabio Coppini,Simone Marinai
机构: Università degli Studi di Firenze (佛罗伦萨大学); Letxbe.ai
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textitDocExplainerV0, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.
zh

[NLP-12] Population-Aligned Persona Generation for LLM -based Social Simulation

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的社会模拟中因人格设定（persona）集代表性不足而导致的群体偏差问题，即如何构建高质量、与真实人口分布对齐的人格设定集合。其解决方案的关键在于提出一个系统性框架：首先利用大语言模型（LLM）从长期社交媒体数据中生成叙事型人格设定，并通过严格的质量评估过滤低保真度个体；随后采用重要性采样（importance sampling）实现全局层面与参考心理测量分布（如大五人格特质）的一致性对齐；最后引入任务特定模块，将全局对齐的人格设定适配至目标子群体，从而在保持整体代表性的同时支持多样化的模拟场景需求。

链接: https://arxiv.org/abs/2509.10127
作者: Zhengyu Hu,Zheyuan Xiao,Max Xiong,Yuxuan Lei,Tianfu Wang,Jianxun Lian,Kaize Ding,Ziang Xiao,Nicholas Jing Yuan,Xing Xie
机构: HKUST(香港科技大学); Microsoft Research Asia(微软亚洲研究院); Duke University(杜克大学); Northwestern University(西北大学); Johns Hopkins University(约翰霍普金斯大学); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.
zh

[NLP-13] Prominence-aware automatic speech recognition for conversational speech

【速读】：该论文旨在解决对话式奥地利德语中语音识别（ASR）对韵律信息利用不足的问题，通过融合韵律显著性（prominence）检测与ASR系统以提升模型对话语重音的感知能力。其解决方案的关键在于：首先基于wav2vec2模型微调得到词级韵律显著性检测器，并用于大规模语料库的自动韵律标注；随后训练一种新型的韵律感知ASR系统，该系统能够同时输出词汇及其对应的显著性等级。实验表明，尽管整体ASR性能未优于基线系统，但当识别词序列正确时，韵律检测准确率达到85.53%，验证了Transformer架构在编码韵律信息方面的有效性，为韵律增强型ASR提供了新范式。

链接: https://arxiv.org/abs/2509.10116
作者: Julian Linke,Barbara Schuppler
机构: Signal Processing and Speech Communication Laboratory (信号处理与语音通信实验室)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper investigates prominence-aware automatic speech recognition (ASR) by combining prominence detection and speech recognition for conversational Austrian German. First, prominence detectors were developed by fine-tuning wav2vec2 models to classify word-level prominence. The detector was then used to automatically annotate prosodic prominence in a large corpus. Based on those annotations, we trained novel prominence-aware ASR systems that simultaneously transcribe words and their prominence levels. The integration of prominence information did not change performance compared to our baseline ASR system, while reaching a prominence detection accuracy of 85.53% for utterances where the recognized word sequence was correct. This paper shows that transformer-based models can effectively encode prosodic information and represents a novel contribution to prosody-enhanced ASR, with potential applications for linguistic research and prosody-informed dialogue systems.
zh

[NLP-14] Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records CCS

【速读】：该论文旨在解决阿拉伯语医疗对话机器人（Medical Chatbot）发展中因高质量标注数据稀缺而导致的模型可扩展性和泛化能力不足的问题。其解决方案的关键在于提出了一种可扩展的合成数据增强策略，利用生成式 AI（Generative AI）系统 ChatGPT-4o 和 Gemini 2.5 Pro 自动生成80,000条语境相关且医学一致的问答对，并基于原始数据结构进行语义过滤与人工验证后整合进训练流程，从而将训练语料扩充至10万条记录。实验表明，使用 ChatGPT-4o 生成的数据在多个大语言模型（如 Mistral-7B 和 AraGPT2）上均显著提升 F1 分数并减少幻觉现象，证明了该合成数据方法在低资源医疗自然语言处理（NLP）场景中的有效性与实用性。

链接: https://arxiv.org/abs/2509.10108
作者: Abdulrahman Allam,Seif Ahmed,Ali Hamdi,Khaled Shaban
机构: MSA University (MSA大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted in AICCSA 2025

点击查看摘要

Abstract:The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.
zh

[NLP-15] VARCO-VISION-2.0 Technical Report

【速读】：该论文旨在解决多模态理解中语言与视觉信息对齐不足、空间位置感知弱以及双语（韩语与英语）支持能力有限的问题。解决方案的关键在于提出VARCO-VISION-2.0，一个具备改进能力的双语视觉语言模型（Vision-Language Model, VLM），其核心创新包括：采用四阶段课程训练策略结合内存高效技术以增强跨模态对齐；通过布局感知OCR实现文本内容及其空间位置的联合预测，从而提升复杂输入（如文档、图表和表格）的理解能力；同时利用偏好优化（preference optimization）提升安全性并保持语言建模能力。该模型在OpenCompass VLM基准测试中表现优异，验证了其在双语场景下强大的空间定位能力和竞争力。

链接: https://arxiv.org/abs/2509.10105
作者: Young-rok Cha,Jeongho Ju,SunYoung Park,Jong-Hyeon Lee,Younghyun Yu,Youngjune Kim
机构: NC AI (NC AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities

点击查看摘要

Abstract:We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.
zh

[NLP-16] Arabic Large Language Models for Medical Text Generation

【速读】：该论文旨在解决当前医院管理系统（HMS）在全球范围内面临的挑战，包括医疗资源有限、急诊服务可及性差以及现有方法在处理非标准输入和低资源语言（如阿拉伯语）时难以提供准确实时医疗建议的问题。其解决方案的关键在于通过微调大语言模型（LLMs）来生成阿拉伯语医学文本，具体采用从社交媒体平台收集的真实世界医患对话数据集进行训练，并针对多种阿拉伯方言进行清洗与预处理；实验表明，微调后的Mistral-7B-Instruct-v0.2模型在BERTScore指标上表现最优，显示出在非正式输入下生成连贯且相关医学回复的能力，从而为多语言、多元文化环境下的智能医疗服务提供了可扩展和适应性强的解决方案。

链接: https://arxiv.org/abs/2509.10095
作者: Abdulrahman Allam,Seif Ahmed,Ali Hamdi,Ammar Mohammed
机构: October University for Modern Sciences & Arts (MSA)
类目: Computation and Language (cs.CL)
备注: Published in 2025 4th International Conference on Computer Technologies (ICCTech)

点击查看摘要

Abstract:Efficient hospital management systems (HMS) are critical worldwide to address challenges such as overcrowding, limited resources, and poor availability of urgent health care. Existing methods often lack the ability to provide accurate, real-time medical advice, particularly for irregular inputs and underrepresented languages. To overcome these limitations, this study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation. The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input. The research methodology required the collection of a unique dataset from social media platforms, capturing real-world medical conversations between patients and doctors. The dataset, which includes patient complaints together with medical advice, was properly cleaned and preprocessed to account for multiple Arabic dialects. Fine-tuning state-of-the-art generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium, optimized the system’s ability to generate reliable medical text. Results from evaluations indicate that the fine-tuned Mistral-7B model outperformed the other models, achieving average BERT (Bidirectional Encoder Representations from Transformers) Score values in precision, recall, and F1-scores of 68.5%, 69.08%, and 68.5%, respectively. Comparative benchmarking and qualitative assessments validate the system’s ability to produce coherent and relevant medical replies to informal input. This study highlights the potential of generative artificial intelligence (AI) in advancing HMS, offering a scalable and adaptable solution for global healthcare challenges, especially in linguistically and culturally diverse environments.
zh

[NLP-17] Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery SIGIR2025

【速读】：该论文旨在解决气候科学文献日益增长的复杂性和体量所带来的信息检索难题，即研究人员难以在模型、数据集、区域和变量之间高效获取相关知识。其解决方案的关键在于构建一个面向气候领域的知识图谱（Knowledge Graph, KG），该KG基于气候学出版物及更广泛的科学文本构建，支持结构化语义查询，从而精准发现如“哪些模型已在特定区域得到验证”或“哪些数据集常与特定遥相关模式共同使用”等复杂关联。通过Cypher查询语言实现具体问题解答，并进一步将KG集成至基于检索增强生成（Retrieval-Augmented Generation, RAG）的大语言模型系统中，以提升气候相关问答的透明度与可靠性，凸显了知识图谱在实际科研场景中的价值。

链接: https://arxiv.org/abs/2509.10087
作者: Mustapha Adamu,Qi Zhang,Huitong Pan,Longin Jan Latecki,Eduard C. Dragut
机构: Temple University (天普大学)
类目: Computation and Language (cs.CL)
备注: ACM SIGIR 2025 Workshop MANILA

点击查看摘要

Abstract:The growing complexity and volume of climate science literature make it increasingly difficult for researchers to find relevant information across models, datasets, regions, and variables. This paper introduces a domain-specific Knowledge Graph (KG) built from climate publications and broader scientific texts, aimed at improving how climate knowledge is accessed and used. Unlike keyword based search, our KG supports structured, semantic queries that help researchers discover precise connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns. We demonstrate how the KG answers such questions using Cypher queries, and outline its integration with large language models in RAG systems to improve transparency and reliability in climate-related question answering. This work moves beyond KG construction to show its real world value for climate researchers, model developers, and others who rely on accurate, contextual scientific information.
zh

[NLP-18] Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models

【速读】：该论文旨在解决当前研究中广泛使用人类设计的心理测量问卷（如BFI、PVQ）来评估大语言模型（Large Language Models, LLMs）人格特质与价值观时所存在的生态效度不足问题。其关键解决方案在于通过系统比较传统问卷与生态效度更高的问卷在测量LLMs时的表现，发现传统问卷不仅导致LLMs人格特征的显著偏差，还因项目数量不足而难以稳定测量，并可能误导性地呈现LLMs具有稳定心理构念的假象，尤其在角色提示（persona-prompted）场景下会夸大其人格特征。因此，论文主张谨慎使用现有心理量表评估LLMs，强调开发更贴近真实交互情境的测量工具的重要性。

链接: https://arxiv.org/abs/2509.10078
作者: Dongmin Choi,Woojung Song,Jongwook Han,Eun-Ju Lee,Yohan Jo
机构: Graduate School of Data Science, Seoul National University (首尔国立大学数据科学研究生院); Department of Communication, Interdisciplinary Program in Artificial Intelligence, Seoul National University (首尔国立大学传播系，人工智能跨学科项目)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity–the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.
zh

[NLP-19] !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment EMNLP2025

【速读】：该论文旨在解决细粒度阿拉伯语可读性评估（fine-grained Arabic readability assessment）中的挑战，特别是针对类别不平衡和数据稀缺问题。解决方案的关键在于构建一个置信度加权的集成模型，融合四种互补的Transformer模型（AraBERTv2、AraELECTRA、MARBERT 和 CAMeLBERT），每种模型使用不同的损失函数进行微调以捕捉多样化的可读性信号；同时通过加权训练、高级预处理、基于最强模型的SAMER语料库重标注以及利用Gemini 2.5 Flash生成约10,000个罕见类别的合成样本，有效缓解数据不足问题；最后引入针对性后处理步骤校正预测分布偏移，显著提升性能，最终在句子级和文档级分别获得87.5%和87.4%的QWK得分。

链接: https://arxiv.org/abs/2509.10040
作者: Mohamed Basem,Mohamed Younes,Seif Ahmed,Abdelrahman Moustafa
机构: MSA University (MSA 大学)
类目: Computation and Language (cs.CL)
备注: 10 Pages , 8 figures , ArabicNLP 2025 , Co-located with EMNLP 2025

点击查看摘要

Abstract:We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment, achieving first place in six of six tracks. Our approach is a confidence-weighted ensemble of four complementary transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with distinct loss functions to capture diverse readability signals. To tackle severe class imbalance and data scarcity, we applied weighted training, advanced preprocessing, SAMER corpus relabeling with our strongest model, and synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level samples. A targeted post-processing step corrected prediction distribution skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level, demonstrating the power of model and loss diversity, confidence-informed fusion, and intelligent augmentation for robust Arabic readability prediction.
zh

[NLP-20] Linguistic trajectories of bipolar disorder on social media

【速读】：该论文旨在解决如何通过社交媒体（Social Media, SM）语言特征实现对双相情感障碍（Bipolar Disorder, BD）患者在诊断前后长期语言变化轨迹的量化分析问题。传统临床评估受限于时间分辨率和样本规模，而本研究的关键解决方案在于提出一种新颖的方法来精确确定用户诊断时间点，并基于此构建从BD诊断前3年到诊断后21年的纵向语言轨迹模型，从而揭示BD急性期与慢性期中与情绪波动、共病精神症状、物质滥用及思维紊乱等相关的广泛语言改变，尤其发现具有显著12个月周期性的语言模式，支持季节性情绪发作假说，为利用生成式AI驱动的社交媒体语言分析实现心理健康的大规模监测提供了实证依据。

链接: https://arxiv.org/abs/2509.10035
作者: Laurin Plank,Armin Zlomuzica
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Language provides valuable markers of affective disorders such as bipolar disorder (BD), yet clinical assessments remain limited in scale. In response, analyses of social media (SM) language have gained prominence due to their high temporal resolution and longitudinal scope. Here, we introduce a method to determine the timing of users’ diagnoses and apply it to study language trajectories from 3 years before to 21 years after BD diagnosis - contrasted with uses reporting unipolar depression (UD) and non-affected users (HC). We show that BD diagnosis is accompanied by pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, unusual thought content, and disorganized thought. We further observe recurring mood-related language changes across two decades after the diagnosis, with a pronounced 12-month periodicity suggestive of seasonal mood episodes. Finally, trend-level evidence suggests an increased periodicity in users estimated to be female. In sum, our findings provide evidence for language alterations in the acute and chronic phase of BD. This validates and extends recent efforts leveraging SM for scalable monitoring of mental health.
zh

[NLP-21] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLM s

【速读】：该论文旨在解决如何利用开源大语言模型（Large Language Models, LLMs）在少样本（few-shot）设置下高效准确地完成多标签意图分类（multi-label intent classification）任务，以提升任务导向型对话系统（task-oriented chatbots）的自然语言理解能力。其解决方案的关键在于：首先，选用三个主流开源预训练LLM（LLama2-7B-hf、Mistral-7B-v0.1、Yi-6B）在MultiWOZ 2.1数据集上进行系统性对比实验，采用包含20个示例的提示（prompt）设计实现少样本分类；其次，引入指令微调（instruction-based fine-tuning）方法，并与基于BERT的监督学习模型（BertForSequenceClassification）进行性能对比，从而揭示不同方法在准确率、F1分数、推理时间及显存占用等方面的差异。研究发现，Mistral-7B-v0.1在多数类别上表现最优，但基于BERT的监督学习方法整体性能仍优于最佳少样本生成式LLM，为小规模开源模型应用于复杂多意图对话场景提供了可复现的技术框架与基准参考。

链接: https://arxiv.org/abs/2509.10010
作者: Adnan Ahmad,Philine Kowol,Stefan Hillmann,Sebastian Möller
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.
zh

[NLP-22] Unsupervised Hallucination Detection by Inspecting Reasoning Processes EMNLP2025

【速读】：该论文旨在解决无监督幻觉检测（hallucination detection）中存在的泛化能力不足问题，即现有方法依赖与事实正确性无关的代理信号（proxy signals），导致检测探针偏向表面特征而非真实语义一致性，从而限制了在不同数据集和场景下的适用性。其解决方案的关键在于提出IRIS框架，通过利用大语言模型（LLM）内部表征中与事实正确性相关的特征：首先引导LLM对给定陈述进行细致真实性验证，并提取其上下文化嵌入（contextualized embedding）作为判别性特征；同时将每个响应的不确定性视为软伪标签（soft pseudolabel），用于训练判别模型。该方法完全无监督、计算成本低，且在少量训练数据下仍具高效性，显著提升了检测性能与跨场景适应能力。

链接: https://arxiv.org/abs/2509.10004
作者: Ponhvoan Srey,Xiaobao Wu,Anh Tuan Luu
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in EMNLP 2025

点击查看摘要

Abstract:Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
zh

[NLP-23] CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

【速读】：该论文旨在解决中国少数民族语言（如藏语、维吾尔语和传统蒙古语）在生成式AI任务中因书写系统与国际标准不一致而导致的语料库严重匮乏问题，尤其针对有监督的标题生成任务。其解决方案的关键在于构建了一个大规模、高质量的多语言数据集——中文少数民族标题生成数据集（Chinese Minority Headline Generation, CMHG），包含10万条藏语样本及每种5万条维吾尔语和蒙古语样本，并设计了由母语者标注的测试集，以期成为该领域未来研究的基准。

链接: https://arxiv.org/abs/2509.09990
作者: Guixian Xu,Zeli Su,Ziyin Zhang,Jianing Liu,XU Han,Ting Zhang,Yushuang Dong
机构: Minzu University of China (中央民族大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
zh

[NLP-24] Large Language Models Meet Legal Artificial Intelligence: A Survey

【速读】：该论文旨在解决当前法律人工智能（Legal AI）领域中大语言模型（Large Language Models, LLMs）研究碎片化、评估标准不统一以及应用框架缺乏系统梳理的问题。其解决方案的关键在于：首先，全面综述了16个法律LLM系列和47个基于LLM的法律任务框架，构建了一个结构化的知识体系；其次，整合了15个基准测试和29个数据集，为不同法律能力的量化评估提供了标准化工具；最后，深入分析了当前面临的挑战并提出了未来发展方向，从而为初学者提供系统性入门指南，并推动该领域的持续创新与落地应用。

链接: https://arxiv.org/abs/2509.09969
作者: Zhitian Hou,Zihan Ye,Nanli Zeng,Tianyong Hao,Kun Zeng
机构: Sun Yat-sen University (中山大学); China Mobile Internet Co., Ltd. (中国移动互联网有限公司); South China Normal University (华南师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at this https URL.
zh

[NLP-25] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case

【速读】：该论文旨在解决生成式 AI (Generative AI) 在社会调查研究中应用时存在的可靠性与偏差问题，特别是评估大型语言模型（Large Language Models, LLMs）生成的合成受访者数据是否能够准确还原真实人群的项目分布，并识别潜在的社会偏见。其解决方案的关键在于通过大规模实证对比实验，基于智利一项概率抽样公共意见调查的真实人类回答作为基准（ground-truth），系统性地评估128个提示-模型-问题组合所生成的189,696个合成问卷样本在多个关键社会人口学维度上的表现，采用准确率、精确率、召回率和F1分数等指标进行元分析，从而量化LLM合成数据的拟合程度与潜在偏差来源，为后续算法校准和分布一致性检验提供依据。

链接: https://arxiv.org/abs/2509.09871
作者: Bastián González-Bustamante,Nando Verelst,Carla Cisternas
机构: Universidad Diego Portales (迭戈·波塔莱斯大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Working paper: 18 pages, 4 tables, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.
zh

[NLP-26] Vibe Check: Understanding the Effects of LLM -Based Conversational Agents Personality and Alignment on User Perceptions in Goal-Oriented Tasks

【速读】：该论文旨在解决生成式 AI（Generative AI）驱动的对话代理（Conversational Agents, CAs）在目标导向任务中，其人格表达水平与用户-代理人格匹配度如何影响用户感知的问题。解决方案的关键在于提出并验证了“特质调制键”（Trait Modulation Keys）框架，通过控制大五人格（Big Five traits）的表达强度（低、中、高），发现中等程度的人格表达呈现倒U型效应，显著优于极端表达，并且人格匹配（尤其是外向性和情绪稳定性）进一步提升用户对智能性、愉悦感、拟人化、采纳意愿、信任和亲和力等维度的积极评价，从而为CA人格设计提供了可量化的优化路径。

链接: https://arxiv.org/abs/2509.09870
作者: Hasibur Rahman,Smit Desai
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) enable conversational agents (CAs) to express distinctive personalities, raising new questions about how such designs shape user perceptions. This study investigates how personality expression levels and user-agent personality alignment influence perceptions in goal-oriented tasks. In a between-subjects experiment (N=150), participants completed travel planning with CAs exhibiting low, medium, or high expression across the Big Five traits, controlled via our novel Trait Modulation Keys framework. Results revealed an inverted-U relationship: medium expression produced the most positive evaluations across Intelligence, Enjoyment, Anthropomorphism, Intention to Adopt, Trust, and Likeability, significantly outperforming both extremes. Personality alignment further enhanced outcomes, with Extraversion and Emotional Stability emerging as the most influential traits. Cluster analysis identified three distinct compatibility profiles, with “Well-Aligned” users reporting substantially positive perceptions. These findings demonstrate that personality expression and strategic trait alignment constitute optimal design targets for CA personality, offering design implications as LLM-based CAs become increasingly prevalent.
zh

[NLP-27] LLM s as Agent ic Cooperative Players in Multiplayer UNO

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）能否作为主动参与者，在非对抗性任务中协助人类达成目标，而非仅仅提供被动回答。研究聚焦于评估LLM代理在UNO卡牌游戏中是否能够有效辅助另一玩家获胜，从而检验其协作能力。解决方案的关键在于构建一个集成工具，使仅解码器架构的LLM能够在RLCard游戏环境中作为代理参与游戏；该工具允许模型接收完整的博弈状态信息，并通过两种不同的提示策略生成文本响应，进而测试不同参数规模（从1B到70B）的模型在协助他人完成任务时的表现。

链接: https://arxiv.org/abs/2509.09867
作者: Yago Romano Matinez,Jesse Roberts
机构: Tennessee Tech University (田纳西理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs promise to assist humans – not just by answering questions, but by offering useful guidance across a wide range of tasks. But how far does that assistance go? Can a large language model based agent actually help someone accomplish their goal as an active participant? We test this question by engaging an LLM in UNO, a turn-based card game, asking it not to win but instead help another player to do so. We built a tool that allows decoder-only LLMs to participate as agents within the RLCard game environment. These models receive full game-state information and respond using simple text prompts under two distinct prompting strategies. We evaluate models ranging from small (1B parameters) to large (70B parameters) and explore how model scale impacts performance. We find that while all models were able to successfully outperform a random baseline when playing UNO, few were able to significantly aid another player.
zh

[NLP-28] Latency and Token-Aware Test-Time Compute

【速读】：该论文试图解决大语言模型（Large Language Model, LLM）在推理时计算资源分配不合理的问题，尤其是在动态调整计算量和选择生成策略（如并行生成与增量解码）方面，现有方法通常仅关注token使用量而忽略实际延迟（wall-clock latency），且未充分考虑代理型工作流（agentic workflows）对响应效率的高要求。解决方案的关键在于将推理时扩展（inference-time scaling）建模为一个动态计算分配与方法选择的联合优化问题，通过显式引入token成本和墙钟延迟（wall-clock latency）两个维度，在每个查询层面自适应地决定采用何种生成策略（如beam search或best-of-N）以及分配多少计算资源，从而在准确率与效率之间实现更优权衡，并具备部署实用性。

链接: https://arxiv.org/abs/2509.09864
作者: Jenny Y. Huang,Mehul Damani,Yousef El-Kurdi,Ramon Astudillo,Wei Sun
机构: MIT (麻省理工学院); IBM Research (IBM研究实验室); MIT-IBM Watson AI Lab (麻省理工-IBM华生人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.
zh

[NLP-29] opic-Guided Reinforcement Learning with LLM s for Enhancing Multi-Document Summarization

【速读】：该论文旨在解决多文档摘要（Multi-Document Summarization, MDS）中如何有效整合多个来源信息并保持摘要连贯性和主题相关性的挑战。其解决方案的关键在于提出一种基于主题引导的强化学习方法：首先通过显式提示模型使用主题标签来增强生成摘要的信息量，进而设计了一种新的主题奖励机制嵌入到Group Relative Policy Optimization (GRPO)框架中，用于量化生成摘要与源文档之间的主题一致性，从而优化内容选择过程。实验结果表明，该方法在Multi-News和Multi-XScience数据集上均显著优于现有强基线模型，验证了利用主题线索提升MDS性能的有效性。

链接: https://arxiv.org/abs/2509.09852
作者: Chuyuan Li,Austin Xu,Shafiq Joty,Giuseppe Carenini
机构: University of British Columbia (不列颠哥伦比亚大学); Salesforce AI Research (Salesforce人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.
zh

[NLP-30] Prag matic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization

【速读】：该论文旨在解决多模态对话中话轮组织（conversational turn organization）机制缺乏系统建模的问题，特别是现有数据集未充分编码交流者在自然情境下使用的交互手势（interactive gestures）用于话轮管理的策略。解决方案的关键在于：首先，基于对语用框架（pragmatic frames）如何被概念化和唤起的分析，提出语言与手势之间存在相关性；其次，开发了一套标注方法，将语用框架标注引入已有的Frame2多模态数据集（原仅标注语义框架），从而实现了对自然场景中手势用于话轮传递、获取和维持行为的系统记录与编码。这一方法不仅填补了机器学习可用数据集的空白，还揭示了手势使用背后涉及心理空间、概念整合与概念隐喻的认知机制。

链接: https://arxiv.org/abs/2509.09804
作者: Helen de Andrade Abreu,Tiago Timponi Torrent,Ely Edison da Silva Matos
机构: 未知
类目: Computation and Language (cs.CL)
备注: Paper submitted to Language Sciences Journal

点击查看摘要

Abstract:This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language.
zh

[NLP-31] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在专业化推理任务中因计算资源受限而难以高效适配的问题。其解决方案的关键在于提出一种分层高效的微调策略（Hierarchical Efficient Fine-Tuning, HEFT），通过在粗粒度权重空间与细粒度表示空间中协同组合两种参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法：首先使用低秩适应（Low-Rank Adaptation, LoRA）进行全局性的权重空间基础调整，随后采用表示微调（Representation Fine-Tuning, ReFT）对内部激活进行精细化修正，从而实现性能与效率的显著提升。

链接: https://arxiv.org/abs/2509.09801
作者: Brennen Hill
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model’s weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05%) or ReFT-only (83.36%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.
zh

[NLP-32] Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture

【速读】：该论文旨在解决传统业务流程管理（Business Process Management, BPM）系统和面向对象语义技术在建模复杂动态系统时存在的局限性，例如难以灵活处理事件驱动的动态行为、数据与业务逻辑分离导致的耦合问题以及缺乏运行时可修改性。其解决方案的关键在于提出了一种名为boldsea的语义事件方法，通过构建可执行本体（executable ontologies），将语义模型作为动态结构直接控制过程执行；其中，核心创新是引入了形式化的BSL（boldsea Semantic Language）及其BNF语法，并设计了无需编译即可直接解释语义模型的boldsea-engine架构，从而实现运行时事件模型的动态调整、时间透明性保障以及数据与业务逻辑在统一语义框架下的无缝融合。

链接: https://arxiv.org/abs/2509.09775
作者: Aleksandr Boldachev
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Software Engineering (cs.SE)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:This paper presents boldsea, Boldachev’s semantic-event approach – an architecture for modeling complex dynamic systems using executable ontologies – semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine’s architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.
zh

[NLP-33] Discrimination by LLM s: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在决策和摘要任务中因背景、性别和年龄等因素引发的社会不平等与信息偏见问题，尤其关注这些偏见的跨语言传播特性及缓解策略的有效性。其解决方案的关键在于设计并测试基于提示（prompt-instructed）的干预措施，结果显示，尽管无法完全消除偏见，但新提出的指令可显著降低不同群体间的偏见差距——最高实现27%的平均降幅；同时发现GPT-4o相较GPT-3.5在英文环境下表现出更低的偏见水平，表明新一代模型对提示驱动的偏见缓解具有更强适应性。

链接: https://arxiv.org/abs/2509.09735
作者: Willem Huijzer,Jieying Chen
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.
zh

[NLP-34] MCP-Agent Bench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

【速读】：该论文旨在解决当前评估基准无法准确反映基于模型上下文协议（Model Context Protocol, MCP）的智能体在真实场景中工具交互能力的问题，从而导致对代理性能的误判和难以可靠区分其能力水平。解决方案的关键在于提出MCP-AgentBench——一个专门设计用于严格评估语言智能体在MCP驱动工具交互中表现的综合性基准，其核心包括：构建包含33个运行服务器和188种不同工具的稳健MCP测试环境；开发涵盖6类复杂度递增的600个系统化查询的评测集；以及引入以任务完成为导向的MCP-Eval评估方法，优先关注实际应用中的成功结果。该方案为研究社区提供了标准化、可靠的评估框架，助力打造真正具备互操作性和实用价值的下一代AI代理系统。

链接: https://arxiv.org/abs/2509.09734
作者: Zikang Guo,Benfeng Xu,Chiwei Zhu,Wentao Hong,Xiaorui Wang,Zhendong Mao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP’s growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench – a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP’s transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.
zh

[NLP-35] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在处理中国古代文献时面临的挑战，即传统数字化方法仅能扫描图像而无法理解内容，且现有VLMs难以应对古籍特有的视觉复杂性与语言多样性。解决方案的关键在于构建首个针对中国古代文献的基准测试集AncientDoc，涵盖从页面级光学字符识别（OCR）到知识推理的五类任务，覆盖14种文献类型、超100部古籍及约3000页文本，从而系统评估VLMs在古文理解与知识推理方面的性能，并引入人工对齐的大语言模型进行评分以提升评估可靠性。

链接: https://arxiv.org/abs/2509.09731
作者: Haiyang Yu,Yuchuan Wu,Fan Shi,Lei Liao,Jinghui Lu,Xiaodong Ge,Han Wang,Minghan Zhuo,Xuecheng Wu,Xiang Fei,Hao Feng,Guozhi Tang,An-Lan Wang,Hanshen Zhu,Yangfan He,Quanhuan Liang,Liyuan Meng,Chao Feng,Can Huang,Jingqun Tang,Bin Li
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
zh

[NLP-36] MultimodalHugs: Enabling Sign Language Processing in Hugging Face

【速读】：该论文旨在解决手势语言处理（Sign Language Processing, SLP）领域中因复杂且非标准化的代码导致的可复现性差和比较不公平的问题。现有工具如Hugging Face虽支持快速实验，但在整合手势语言任务时缺乏灵活性。解决方案的关键在于提出MultimodalHugs框架，其基于Hugging Face构建，通过增加多模态数据抽象层，兼容姿态估计等多样化输入模态（如手部关键点或像素数据），同时保留Hugging Face生态系统的易用性和可扩展性，从而提升SLP研究的标准化与通用性。

链接: https://arxiv.org/abs/2509.09729
作者: Gerard Sant,Zifan Jiang,Carlos Escolano,Amit Moryossef,Mathias Müller,Rico Sennrich,Sarah Ebling
机构: University of Zurich (苏黎世大学); Barcelona Supercomputing Center (巴塞罗那超级计算中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM) Cite as: arXiv:2509.09729 [cs.CL] (or arXiv:2509.09729v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.09729 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-37] A meta-analysis on the performance of machine-learning based language models for sentiment analysis

【速读】：该论文旨在解决机器学习（Machine Learning, ML）在Twitter情感分析任务中性能评估的不一致性和不可比性问题，特别是针对现有研究中广泛使用但易受类别不平衡和情感类别数量影响的整体准确率（Overall Accuracy）指标所导致的误导性结果。其解决方案的关键在于：首先，通过元分析方法系统量化ML模型在Twitter情感分析中的平均性能，并揭示整体准确率的局限性；其次，强调标准化报告模型性能的重要性，包括提供独立测试集上的混淆矩阵（Confusion Matrix），以实现跨研究的可靠比较，从而推动该领域研究的可重复性和透明度提升。

链接: https://arxiv.org/abs/2509.09728
作者: Elena Rohde,Jonas Klingwort,Christian Borgs
机构: Institute of Sociology, Faculty of Social Sciences University of Duisburg-Essen (杜伊斯堡-埃森大学社会学研究所); Department of Research & Development, Statistics Netherlands (CBS) (荷兰统计局研究中心); IT.NRW – Statistical Office of NRW (北莱茵-威斯特法伦州统计办公室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.
zh

[NLP-38] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLM s

【速读】：该论文旨在解决现有大语言模型（Large Language Model, LLM）在金融教育问答（Question Answering, QA）任务中难以捕捉复杂定量推理、专业术语理解及现实场景建模的问题。其关键解决方案是一个多智能体框架，通过角色化提示（role-based prompting）协同工作：包括基础生成器（Base Generator）、证据检索器（Evidence Retriever）和专家评审员（Expert Reviewer）三个模块，在单次迭代中完成答案的生成与精炼。该框架利用检索增强生成（Retrieval-Augmented Generation, RAG）从6本金融教材中获取上下文证据，并引入领域专家级提示策略进行批判性修正，显著提升了答案准确性（较零样本思维链基线提升6.6–8.3%），且实现了低成本高效能的金融QA性能优化。

链接: https://arxiv.org/abs/2509.09727
作者: Andy Zhu,Yingjun Du
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: 8 pages, 6 figures, Underreview

点击查看摘要

Abstract:Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from this http URL, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.
zh

[NLP-39] Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure

【速读】：该论文旨在解决形式化证明（formal proofs）难以被人类直接理解的问题，其核心挑战在于形式化语言与自然语言之间的语义鸿沟。解决方案的关键在于利用大语言模型（LLMs）的非形式化（informalization）和摘要能力，将机器可验证的形式化证明步骤转化为高质量、可读性强且准确的自然语言表述，从而提升形式化数学内容的人类可读性与传播效率。

链接: https://arxiv.org/abs/2509.09726
作者: Seiji Hattori,Takuya Matsuzaki,Makoto Fujiwara
机构: Tokyo University of Science (东京科学大学)
类目: Computation and Language (cs.CL)
备注: Submitted to INLG 2025 (accepted)

点击查看摘要

Abstract:This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.
zh

[NLP-40] BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025

【速读】：该论文旨在解决生物医学文本中实体链接（Entity Linking, EL）任务在多语言和嵌套命名实体场景下的研究空白问题，现有方法主要基于英文单语平铺式提及（flat mentions），难以应对实际文本中常见的嵌套结构与跨语言复杂性。解决方案的关键在于提出一个轻量级流水线系统（BIBERT-Pipe），其核心改进聚焦于三个任务对齐模块：（1）两阶段检索-排序机制，其中检索阶段使用原始预训练模型，排序阶段引入领域特定微调以提升准确性；（2）边界提示（boundary cues），通过可学习的[Ms]/[Me]标签显式提供跨度信息，增强模型对嵌套和重叠提及的鲁棒性；（3）数据集增强策略，利用三种互补的数据源自动扩充排序阶段训练语料，无需人工标注即可提升覆盖范围。该方案保持原实体链接模型不变，仅优化关键组件，在BioNNE 2025多语言嵌套实体链接评测中取得第三名，验证了其有效性与竞争力。

链接: https://arxiv.org/abs/2509.09725
作者: Chunyu Li,Xindi Zheng,Siqi Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at this https URL.
zh

[NLP-41] DiTTO-LLM : Framework for Discovering Topic-based Technology Opportunities via Large Language Model

【速读】：该论文旨在解决如何有效识别新兴技术机会的问题，尤其是在技术演进过程中捕捉潜在创新方向的挑战。其解决方案的关键在于构建一个基于技术间时间关系的框架，通过从专利数据中提取文本并映射出基于主题的技术关联，进而追踪这些主题随时间的变化来识别技术机会。为提升效率，该框架利用大语言模型（Large Language Model, LLM）进行主题抽取，并采用提示工程（prompt engineering）驱动对话式语言模型辅助发现技术机会，从而实现对人工智能等前沿技术发展趋势的精准识别与预测。

链接: https://arxiv.org/abs/2509.09724
作者: Wonyoung Kim,Sujeong Seo,Juhyun Lee
机构: Chung-Ang University (中央大学); Postal Savings & Insurance Development Institute (邮政储蓄与保险发展研究所); Korea Institute of Science and Technology Information (韩国科学技术信息研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 figures

点击查看摘要

Abstract:Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.
zh

[NLP-42] ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

【速读】：该论文旨在解决心理测量学中构建名义网络（nomological networks）的长期挑战，即如何系统性地阐明概念与测量指标之间的理论关联以增强效度验证。这一问题自Cronbach与Meehl于70年前提出以来，始终是测量验证的基础性难题，导致临床试验可能无法检测治疗效应，公共政策也可能误判干预目标。解决方案的关键在于提出ALIGNS（Analysis of Latent Indicators to Generate Nomological Structures），一个基于大语言模型（large language model）并使用已验证问卷量表训练的系统，能够自动推导出涵盖心理学、医学和社会政策等领域的超大规模名义网络（含55万条指标），首次将大语言模型应用于测量效度验证的核心问题。

链接: https://arxiv.org/abs/2509.09723
作者: Kai R. Larsen,Sen Yan,Roland Müller,Lan Sang,Mikko Rönkkö,Ravi Starzl,Donald Edmondson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system’s importance, accessibility, and suitability. ALIGNS is freely available at this http URL, complementing traditional validation methods with large-scale nomological analysis.
zh

[NLP-43] Improving MLLM Historical Record Extraction with Test-Time Image

【速读】：该论文旨在解决从噪声较大的历史文档中使用大语言模型（Large Language Model, LLM）进行文本提取时的稳定性问题。其解决方案的关键在于提出了一种新颖的集成框架：通过Gemini 2.0 Flash对每张图像的多个增强变体进行转录，并利用自定义的类似Needleman-Wunsch算法的对齐器融合这些输出，从而生成一致的转录结果并附带置信度评分。该方法在622份宾夕法尼亚州死亡记录的新数据集上验证，相较单次推理基线提升了4个百分点的准确性，且发现填充（padding）和模糊（blurring）对提升准确率最有效，而网格扭曲（grid warp）扰动则最有助于区分高/低置信度样本。

链接: https://arxiv.org/abs/2509.09722
作者: Taylor Archibald,Tony Martinez
机构: Brigham Young University (杨百翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models.
zh

[NLP-44] VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

【速读】：该论文旨在解决当前生成式语音模型（Spoken Language Models, SLMs）在面对自然语言指令时，难以实现可控说话风格适应的问题，即模型是否能够根据语音指令动态调整音色（timbre）、语调（prosody）或角色设定（persona）等风格特征。解决方案的关键在于提出了一项新任务——Voice Style Adaptation (VSA)，并构建了首个双语（中文-英文）基准数据集VStyle，涵盖声学属性、自然语言指令、角色扮演和隐含共情四类场景；同时设计了“大型音频语言模型作为评判者”（Large Audio Language Model as a Judge, LALM as a Judge）评估框架，通过文本忠实性、风格一致性与自然度的多阶段评分机制，实现客观、可复现的评估体系，从而推动以用户为中心的语音交互发展。

链接: https://arxiv.org/abs/2509.09716
作者: Jun Zhan,Mingyang Han,Yuxuan Xie,Chen Wang,Dong Zhang,Kexin Huang,Haoxiang Shi,DongXiao Wang,Tengtao Song,Qinyuan Cheng,Shimin Li,Jun Song,Xipeng Qiu,Bo Zheng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \hrefthis https URLproject’s homepage.
zh

[NLP-45] Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中幻觉（hallucination）问题的根本成因，即识别并刻画导致LLM内在易产生幻觉的关键属性。其解决方案的核心在于通过将现有评测数据集（HaluEval 和 TruthfulQA）中的问答格式转换为多种符号化（symbolic）形式，系统性地分析不同符号属性对幻觉发生率的影响。研究发现，尽管随着模型规模扩大幻觉率有所下降（如Gemma-2-2B至Gemma-2-27B从79.0%降至63.9%），但修饰词和命名实体等符号属性仍引发极高比例的幻觉（83.87%–94.98%），表明当前LLM在处理符号输入时存在结构性弱点，这成为抑制幻觉的根本瓶颈。

链接: https://arxiv.org/abs/2509.09715
作者: Naveen Lamba,Sanju Tiwari,Manas Gaur
机构: Sharda University (沙达大学); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model’s internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs–regardless of their scale.
zh

[NLP-46] How Small Transformation Expose the Weakness of Semantic Similarity Measures

【速读】：该论文旨在解决当前软件工程中语义相似性度量方法的有效性问题，尤其关注大语言模型（Large Language Models, LLMs）是否真正理解语义关系而非仅识别表面模式。研究系统评估了18种不同方法（包括基于词的方法、嵌入技术、LLM-based系统和结构感知算法），并通过控制文本与代码的细微变化来量化各方法对语义关系的敏感度。关键发现在于：嵌入方法因距离计算方式不当（如使用欧氏距离）而表现不佳，切换为余弦相似度后性能提升24%至66%；相比之下，LLM-based方法在区分语义差异上更具鲁棒性，能有效将语义相反内容标记为低相似度（0.00–0.29），而嵌入方法常错误赋予高相似度（0.82–0.99）。因此，解决方案的核心在于优化距离度量策略并采用更注重语义推理的LLM-based方法。

链接: https://arxiv.org/abs/2509.09714
作者: Serge Lionel Nikiema,Albérick Euraste Djire,Abdoul Aziz Bonkoungou,Micheline Bénédicte Moumoula,Jordan Samhi,Abdoul Kader Kabore,Jacques Klein,Tegawendé F. Bissyande
机构: University of Luxembourg(卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods’ poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.09714 [cs.CL] (or arXiv:2509.09714v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.09714 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-47] HANRAG : Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

【速读】：该论文旨在解决当前检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理多跳查询（multi-hop queries）时存在的效率低和噪声累积问题。具体而言，现有方法常因过度依赖迭代式检索而浪费大量步骤，且直接使用复杂原始查询进行检索易导致相关性不足，引入冗余噪声，进而影响生成质量。解决方案的关键在于提出一种基于启发式策略的框架HANRAG，其核心机制包括：由一个强大的“揭示器”（revelator）引导查询路由与分解，将复杂查询拆解为子查询，并对检索到的文档进行噪声过滤，从而提升系统在不同复杂度任务下的适应性和抗噪能力。

链接: https://arxiv.org/abs/2509.09713
作者: Duolin Sun,Dan Yang,Yue Shen,Yihan Jiao,Zhehao Tan,Jie Feng,Lianzhen Zhong,Jian Wang,Peng Wei,Jinjie Gu
机构: Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system’s adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.
zh

[NLP-48] he Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

【速读】：该论文旨在解决如何有效利用小规模开放权重大语言模型（Large Language Model, LLM）来准确传递接纳与承诺疗法（Acceptance and Commitment Therapy, ACT）的问题，特别是提升其在模拟治疗场景中的干预 fidelity 和共情能力。解决方案的关键在于采用偏好对齐的策略优化方法——即 odds ratio policy optimization (ORPO)，相较于传统的监督微调（Supervised Fine-Tuning, SFT），ORPO 更能学习 ACT 的“治疗过程”而非仅模仿“内容”，从而显著提升模型在 ACT-Fidelity Measure (ACT-FM) 和 Therapist Empathy Scale (TES) 上的表现；同时，显式链式思维（Chain-of-Thought, COT）推理仅对 SFT 训练的模型有显著增益，表明其作为认知支架的作用依赖于训练范式，而 ORPO 本身已具备更强的内在机制以实现高质量的 ACT 表达。

链接: https://arxiv.org/abs/2509.09712
作者: Talha Tahir
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ( \chi^2(5) = 185.15, p .001 ) and therapeutic empathy ( \chi^2(5) = 140.37, p .001 ). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ( p .001 ), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.
zh

[NLP-49] Psychiatry-Bench: A Multi-Task Benchmark for LLM s in Psychiatry

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在精神科临床应用中评估资源不足且缺乏临床有效性的问题，即现有评测数据多依赖于小规模临床访谈语料、社交媒体文本或合成对话，难以真实反映精神科推理的复杂性与专业性。其解决方案的关键在于构建了一个严格筛选的基准测试平台——PsychiatryBench，该平台完全基于权威专家验证的精神病学教科书和病例集，涵盖11类问答任务（如诊断推理、治疗计划、纵向随访等），共计超过5300个专家标注条目，并采用传统指标与“LLM作为评判者”相似度评分框架对前沿模型（如Google Gemini、LLaMA 3等）进行系统评估，从而揭示了当前模型在多轮随访和管理任务中的临床一致性与安全性短板，为开发更适配高风险心理健康场景的专用模型提供了可扩展的评测基础。

链接: https://arxiv.org/abs/2509.09711
作者: Aya E. Fouda,Abdelrahamn A. Hassan,Radwa J. Hanafy,Mohammed E. Fouda
机构: Compumacy for Artificial Intelligence solutions (人工智能解决方案公司); Department of Behavioural Health- Saint Elizabeths Hospital (行为健康部门-圣伊丽莎白医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an “LLM-as-judge” similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.
zh

[NLP-50] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data

【速读】：该论文旨在解决传统基于代理的交通模型中个体出行日记生成依赖大量专有家庭出行调查数据的问题，提出了一种利用大型语言模型（Large Language Model, LLM）从开源数据（如美国社区调查ACS和智能选址数据库SLD）中随机生成人物画像并直接提示合成出行日记的新方法。其解决方案的关键在于：首先通过开放数据构建人口统计学上匹配的群体样本，再借助LLM实现对出行目的、方式、频次及时间间隔等维度的端到端模拟；同时引入一种新型“一对一群体真实度评分”（one-to-cohort realism score），结合Jensen-Shannon散度量化生成日记与真实日记在分布上的相似性，从而实现了无需额外校准即可评估合成日记质量的可量化指标体系，验证了LLM在零样本场景下的可行性与统计代表性。

链接: https://arxiv.org/abs/2509.09710
作者: Sepehr Golrokh Amin,Devin Rhoads,Fatemeh Fakhrmoosavi,Nicholas E. Lownes,John N. Ivan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM’s statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM’s zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.
zh

[NLP-51] Assisting Research Proposal Writing with Large Language Models : Evaluation and Refinement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在学术写作中因生成错误或虚构参考文献而引发的伦理问题，以及现有内容质量评估依赖主观人工判断、缺乏客观性和一致性的问题。解决方案的关键在于提出两个量化评估指标——内容质量和参考文献有效性，并基于这两个指标得分设计了一种迭代提示（iterative prompting）方法，通过反复优化生成过程显著提升内容质量并减少参考文献的不准确与虚构，从而有效应对学术场景中的关键伦理挑战。

链接: https://arxiv.org/abs/2509.09709
作者: Jing Ren,Weiqi Wang
机构: University of Technology Sydney (悉尼科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics–content quality and reference validity–and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT’s writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts.
zh

[NLP-52] Beyond Im Sorry I Cant: Dissecting Large Language Model Refusal

【速读】：该论文旨在解决指令微调的大语言模型（Large Language Models, LLMs）在面对有害提示时表现出拒绝响应的安全行为的内在机制不明确的问题。解决方案的关键在于利用残差流激活训练的稀疏自编码器（Sparse Autoencoders, SAEs），通过三阶段搜索流程——拒绝方向定位、贪婪过滤和交互发现——识别出能够因果性地将模型从拒绝状态转变为合规状态的特征集合，从而实现对安全行为的细粒度解析与干预。

链接: https://arxiv.org/abs/2509.09708
作者: Nirmalendu Prakash,Yeo Wei Jie,Amir Abdullah,Ranjan Satapathy,Erik Cambria,Roy Ka Wei Lee
机构: 1. National University of Singapore (新加坡国立大学); 2. Nanyang Technological University (南洋理工大学); 3. University of Malaya (马来亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.
zh

[NLP-53] LLM -Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm

【速读】：该论文旨在解决复杂组合优化问题中传统元启发式算法因缺乏对具体问题实例结构特性的利用而导致搜索效率低下的问题。其解决方案的关键在于提出一种将大型语言模型（Large Language Models, LLMs）与有偏随机键遗传算法（Biased Random-Key Genetic Algorithm, BRKGA）深度融合的新框架，通过人-LLM协作设计一组计算高效的实例特定指标，并由LLM基于这些指标生成先验的、实例驱动的启发式偏置（heuristic bias），从而引导BRKGA在搜索空间中聚焦于更有潜力的区域，显著提升求解质量，尤其在高复杂度实例上表现突出。

链接: https://arxiv.org/abs/2509.09707
作者: Camilo Chacón Sartori,Martín Isla Pino,Pedro Pinacho-Davidson,Christian Blum
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to a journal for review

点击查看摘要

Abstract:Integrating Large Language Models (LLMs) within metaheuristics opens a novel path for solving complex combinatorial optimization problems. While most existing approaches leverage LLMs for code generation to create or refine specific heuristics, they often overlook the structural properties of individual problem instances. In this work, we introduce a novel framework that integrates LLMs with a Biased Random-Key Genetic Algorithm (BRKGA) to solve the NP-hard Longest Run Subsequence problem. Our approach extends the instance-driven heuristic bias paradigm by introducing a human-LLM collaborative process to co-design and implement a set of computationally efficient metrics. The LLM analyzes these instance-specific metrics to generate a tailored heuristic bias, which steers the BRKGA toward promising areas of the search space. We conduct a comprehensive experimental evaluation, including rigorous statistical tests, convergence and behavioral analyses, and targeted ablation studies, comparing our method against a standard BRKGA baseline across 1,050 generated instances of varying complexity. Results show that our top-performing hybrid, BRKGA+Llama-4-Maverick, achieves statistically significant improvements over the baseline, particularly on the most complex instances. Our findings confirm that leveraging an LLM to produce an a priori, instance-driven heuristic bias is a valuable approach for enhancing metaheuristics in complex optimization domains.
zh

[NLP-54] Differential Robustness in Transformer Language Models: Empirical Evaluation Under Adversarial Text Attacks ACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对对抗攻击时的鲁棒性不足问题，特别是评估不同模型架构对文本扰动的敏感性。研究通过系统化设计的对抗测试工具TextFooler和BERTAttack，发现RoBERTa-Base与Flan-T5表现出极强的抗攻击能力，攻击成功率均为0%，而BERT-Base则高度脆弱，攻击成功率高达93.75%。解决方案的关键在于识别出当前防御机制虽有效但计算开销大，并提出需发展更高效、实用的防御策略以提升LLM安全性。

链接: https://arxiv.org/abs/2509.09706
作者: Taniya Gidatkar,Oluwaseun Ajao,Matthew Shardlow
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 4 tables, to appear in proceedings of Recent Advances in Natural Language Processing (RANLP 2025) and ACL Anthology

点击查看摘要

Abstract:This study evaluates the resilience of large language models (LLMs) against adversarial attacks, specifically focusing on Flan-T5, BERT, and RoBERTa-Base. Using systematically designed adversarial tests through TextFooler and BERTAttack, we found significant variations in model robustness. RoBERTa-Base and FlanT5 demonstrated remarkable resilience, maintaining accuracy even when subjected to sophisticated attacks, with attack success rates of 0%. In contrast. BERT-Base showed considerable vulnerability, with TextFooler achieving a 93.75% success rate in reducing model accuracy from 48% to just 3%. Our research reveals that while certain LLMs have developed effective defensive mechanisms, these safeguards often require substantial computational resources. This study contributes to the understanding of LLM security by identifying existing strengths and weaknesses in current safeguarding approaches and proposes practical recommendations for developing more efficient and effective defensive strategies.
zh

[NLP-55] he Non-Determinism of Small LLM s: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

【速读】：该论文旨在解决小规模语言模型（small LLMs，参数量为2B-8B）在多次回答相同问题时的一致性问题，尤其是如何在保证准确性的同时提升答案的稳定性。其关键解决方案在于系统性地评估不同模型架构（小模型 vs. 中等规模模型，50B-80B）、推理温度、微调状态（finetuned vs. base）等因素对答案一致性的影响，并提出新的分析与可视化工具以量化一致性水平与准确率之间的权衡关系。实验表明，在低推理温度下，小模型的一致性问答比例通常介于50%-80%，且一致答案的准确性与整体准确率呈正相关，而中等规模模型展现出更高的答案一致性水平。

链接: https://arxiv.org/abs/2509.09705
作者: Claudio Pinhanez,Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Yago Primerano
机构: IBM Research Brazil(IBM研究巴西)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.
zh

[NLP-56] mporal Preferences in Language Models for Long-Horizon Assistance

【速读】：该论文旨在探究语言模型（Language Models, LMs）在跨期选择（intertemporal choice）中是否表现出对未来或当前的偏好倾向，以及这种偏好是否可以被系统性地操控。其核心问题是：语言模型的时间偏好是否存在方向性差异？能否通过提示工程（prompt engineering）来调节这种偏好？解决方案的关键在于引入一个可量化的操作指标——时间取向操控性（Manipulability of Time Orientation, MTO），用以衡量模型在不同提示下（未来导向 vs. 当前导向）所展现出的时间偏好变化。研究发现，以推理为导向的模型（如DeepSeek-Reasoner和grok-3-mini）在面对未来导向提示时更倾向于选择延迟回报，且能将未来导向内化为自身作为AI决策者的认知框架；这表明通过提示设计可有效引导模型行为，从而为构建与多元长期目标对齐的AI助手提供设计依据，并推动个性化情境校准与社会敏感部署的研究方向。

链接: https://arxiv.org/abs/2509.09704
作者: Ali Mazyaki,Mohammad Naghizadeh,Samaneh Ranjkhah Zonouzaghi,Hossein Setareh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM’s revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.
zh

[NLP-57] CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor EMNLP2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际部署中面临的知识产权（Intellectual Property, IP）保护难题，特别是模型盗用和未经授权再分发的风险。现有指纹技术存在隐蔽性、鲁棒性和泛化能力之间的固有权衡：部分方法易被分布偏移检测到，部分对对抗性修改敏感，或一旦指纹被泄露即失效。论文提出了一种基于规则驱动的新型指纹框架CTCC（Contextual Correlation-based Fingerprinting），其核心创新在于通过编码多轮对话中的上下文关联（如反事实关系）而非依赖单轮次或词元级触发机制来嵌入指纹信息。该设计使得指纹验证可在仅黑盒访问条件下完成，同时有效降低误报率并防止指纹泄露，并支持在共享语义规则下持续构建指纹，即使部分触发机制暴露亦能维持有效性。实验表明，CTCC在多种LLM架构上均显著优于现有方法，在隐蔽性和鲁棒性方面表现更优。

链接: https://arxiv.org/abs/2509.09703
作者: Zhenhua Xu,Xixiang Zhao,Xubin Yue,Shengwei Tian,Changting Lin,Meng Han
机构: Zhejiang University (浙江大学); GenTel.io; The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP2025 MainConference

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at this https URL.
zh

[NLP-58] Creativity Benchmark: A benchmark for marketing creativity for LLM models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在营销创意任务中缺乏可靠、可复现的评估体系的问题。其解决方案的关键在于构建了一个名为Creativity Benchmark的系统性评估框架，涵盖100个品牌（12类）、三种提示类型（Insights、Ideas、Wild Ideas），并通过678名专业创意人员对11,012个匿名样本进行两两偏好判断，利用Bradley-Terry模型量化模型表现。结果显示，当前LLMs在品牌和提示类型间表现差异较小（top-bottom spread ≈ 0.45，对应胜率约61%），且模型多样性显著，提示重构敏感性强；同时发现自动化评判机制与人类评价相关性弱且存在偏差，表明专家人工评估仍是必要环节，且需引入多样性意识的工作流程以提升创意生成质量。

链接: https://arxiv.org/abs/2509.09702
作者: Ninad Bhat,Kieran Browne,Pip Bingemann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 30 Pages, 14 figures

点击查看摘要

Abstract:We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is \Delta\theta \approx 0.45 , which implies a head-to-head win probability of 0.61 ; the highest-rated model beats the lowest only about 61% of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.
zh

[NLP-59] Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task

【速读】：该论文旨在解决端到端语音到文本翻译（Speech-to-Text Translation, STT）中因成对语音-文本数据稀缺而导致性能受限的问题。解决方案的关键在于从正则化（regularization）的角度重新审视多任务学习（Multi-Task Learning, MTL）框架，通过引入三种正则化机制：跨模态的一致性正则化（consistency regularization，用于增强不同模态间的一致性）、同模态的R-drop（用于同一模态内增强模型鲁棒性），以及MT损失系数作为额外的正则化来源。作者进一步提出“正则化边界”（regularization horizon）这一高维空间中的最优正则化轨迹，指导超参数在该区域内调优，从而实现接近当前最先进水平的性能表现。

链接: https://arxiv.org/abs/2509.09701
作者: JungHo Jung,Junhyun Lee
机构: University of Pennsylvania (宾夕法尼亚大学); Samsung Research (三星研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset.
zh

[NLP-60] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection ECAI2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际应用中因生成虚假或不准确文本（即幻觉，hallucination）而导致的可靠性问题。其解决方案的关键在于提出了一种名为跨层注意力探针（Cross-Layer Attention Probing, CLAP）的新颖激活探针技术，该方法将整个残差流（residual stream）中的激活信息视为一个联合序列进行处理，从而实现对幻觉的细粒度检测能力——能够在同一提示的不同采样响应中区分出幻觉与非幻觉内容。这一特性使得CLAP能够支持“先检测后缓解”（detect-then-mitigate）策略，在提升检测精度的同时有效降低幻觉发生率并增强模型整体可靠性。

链接: https://arxiv.org/abs/2509.09700
作者: Malavika Suresh,Rahaf Aljundi,Ikechukwu Nkisi-Orji,Nirmalie Wiratunga
机构: Robert Gordon University (罗伯特戈登大学); Toyota Motor Europe (丰田汽车欧洲公司); Ikechukwu Nkisi-Orji; Nirmalie Wiratunga
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published at the TRUST-AI workshop, ECAI 2025

点击查看摘要

Abstract:With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.
zh

[NLP-61] Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs

【速读】：该论文旨在解决临床文档自动化编码问题，即如何将非结构化的临床文本高效、准确地映射到标准化的疾病分类体系（如ICD-9），以支持临床研究、医院管理和患者护理。当前手动编码效率低下，而现有自动编码方法在处理高维且长尾分布的目标空间时表现有限。其解决方案的关键在于构建一种基于文档级知识图谱（Knowledge Graph, KG）的结构化输入表示方法，该KG通过整合患者中心的多源医学知识，以23%的原始文本量保留90%的信息，从而增强模型对输入语义的理解能力。该方法被集成至最先进的PLM-ICD架构中，在多个基准测试上实现了最高达3.20%的Macro-F1提升，并显著提高了训练效率与结果可解释性。

链接: https://arxiv.org/abs/2509.09699
作者: Mingyang Li,Viktor Schlegel,Tingting Mu,Warren Del-Pinto,Goran Nenadic
机构: The University of Manchester (曼彻斯特大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient’s condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23% of the original text while retaining 90% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline.
zh

[NLP-62] Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors

【速读】：该论文旨在解决推荐系统中准确模拟用户行为的长期挑战，这一挑战源于用户交互的复杂性和随机性。现有方法多依赖大型语言模型（Large Language Models, LLMs）进行行为建模，但存在三个关键问题：（i）如何高效连续地解析大规模表格型用户-物品交互数据；（ii）克服预训练带来的归纳偏置以精准学习个体用户知识；（iii）在数百万用户规模下实现上述目标。论文提出的核心解决方案是：利用冻结的LLM提取鲁棒的文本用户表示，并基于微调的小型语言模型（Small Language Models, SLMs）构建低成本、资源高效的用户代理（user agents）。进一步地，通过为不同用户群体或“人格”（persona）训练多个低秩适配器（low-rank adapters），在可扩展性与性能之间取得最优平衡，从而显著提升离线指标与真实场景表现之间的对齐度。

链接: https://arxiv.org/abs/2509.09689
作者: Himanshu Thakur,Eshani Agrawal,Smruthi Mukund
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A long-standing challenge in developing accurate recommendation models is simulating user behavior, mainly due to the complex and stochastic nature of user interactions. Towards this, one promising line of work has been the use of Large Language Models (LLMs) for simulating user behavior. However, aligning these general-purpose large pre-trained models with user preferences necessitates: (i) effectively and continously parsing large-scale tabular user-item interaction data, (ii) overcoming pre-training-induced inductive biases to accurately learn user specific knowledge, and (iii) achieving the former two at scale for millions of users. While most previous works have focused on complex methods to prompt an LLM or fine-tune it on tabular interaction datasets, our approach shifts the focus to extracting robust textual user representations using a frozen LLM and simulating cost-effective, resource-efficient user agents powered by fine-tuned Small Language Models (SLMs). Further, we showcase a method for training multiple low-rank adapters for groups of users or \textitpersona, striking an optimal balance between scalability and performance of user behavior agents. Our experiments provide compelling empirical evidence of the efficacy of our methods, demonstrating that user agents developed using our approach have the potential to bridge the gap between offline metrics and real-world performance of recommender systems.
zh

[NLP-63] AI-Powered Assistant for Long-Term Access to RHIC Knowledge

【速读】：该论文旨在解决重离子对撞机（Relativistic Heavy Ion Collider, RHIC）在运行25年后，如何长期保存其海量数据（约1 ExaByte）及嵌入其中的科学知识的问题。解决方案的关键在于构建一个基于生成式 AI (Generative AI) 的智能助手系统，该系统利用检索增强生成（Retrieval-Augmented Generation, RAG）技术和模型上下文协议（Model Context Protocol），对RHIC实验中的结构化与非结构化内容进行索引，并实现领域适配的自然语言交互，从而提升科学数据的可复现性、教育价值和未来发现潜力。

链接: https://arxiv.org/abs/2509.09688
作者: Mohammad Atif,Vincent Garonne,Eric Lancon,Jerome Lauret,Alexandr Prozorov,Michal Vranovsky
机构: Brookhaven National Lab (布鲁克海文国家实验室); Brookhaven National Laboratory (布鲁克海文国家实验室); Czech Technical University (捷克技术大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the Relativistic Heavy Ion Collider (RHIC) at Brookhaven National Laboratory concludes 25 years of operation, preserving not only its vast data holdings ( \sim 1 ExaByte) but also the embedded scientific knowledge becomes a critical priority. The RHIC Data and Analysis Preservation Plan (DAPP) introduces an AI-powered assistant system that provides natural language access to documentation, workflows, and software, with the aim of supporting reproducibility, education, and future discovery. Built upon Large Language Models using Retrieval-Augmented Generation and the Model Context Protocol, this assistant indexes structured and unstructured content from RHIC experiments and enables domain-adapted interaction. We report on the deployment, computational performance, ongoing multi-experiment integration, and architectural features designed for a sustainable and explainable long-term AI access. Our experience illustrates how modern AI/ML tools can transform the usability and discoverability of scientific legacy data.
zh

[NLP-64] xt-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation

【速读】：该论文旨在解决过程挖掘（Process Mining）领域中文本到SQL（text-to-SQL）转换任务的专用数据集缺失问题，以支持自然语言查询数据库，提升非SQL专家用户的可访问性及专家用户的生产力。解决方案的关键在于构建了一个双语（葡萄牙语-英语）基准数据集text-2-SQL-4-PM，其包含1,655条自然语言语句（含人工改写）、205条SQL语句及10个限定符，并通过专家手动标注、专业翻译和精细化注释流程确保数据质量，从而适配过程挖掘中特有的术语体系与单表关系结构，为文本到SQL模型的评估与开发提供可靠支撑。

链接: https://arxiv.org/abs/2509.09684
作者: Bruno Yui Yamate,Thais Rodrigues Neubauer,Marcelo Fantinato,Sarajane Marques Peres
机构: University of São Paulo (圣保罗大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: 33 pages

点击查看摘要

Abstract:This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks.
zh

[NLP-65] DB3 Teams Solution For Meta KDD Cup 25

【速读】：该论文旨在解决多模态、多轮问答任务中的复杂信息检索与生成式AI（Generative AI）幻觉控制问题，特别是针对Meta CRAG-MM Challenge 2025所提出的跨模态知识推理挑战。其解决方案的关键在于构建一个集成架构：一方面设计了面向不同任务的领域特异性检索管道，分别处理图像索引的知识图谱、网络来源及多轮对话上下文；另一方面采用统一的大语言模型（LLM）微调策略，结合监督微调（SFT）、直接偏好优化（DPO）和强化学习（RL）进行拒绝响应训练，有效抑制生成过程中的幻觉现象。该方法在三项子任务中均取得优异成绩，尤其在第一人称视角查询任务中表现突出，最终获得总冠军。

链接: https://arxiv.org/abs/2509.09681
作者: Yikuan Xia,Jiazun Chen,Yirui Zhan,Suifeng Zhao,Weipeng Jiang,Chaorui Zhang,Wei Han,Bo Bai,Jun Gao
机构: Peking University (北京大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents the db3 team’s winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup’25. Addressing the challenge’s unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges.
zh

[NLP-66] Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

【速读】：该论文旨在解决序列级强化学习（Sequence-Level Reinforcement Learning）中因固定裁剪范围导致的长度不公平问题，即在PPO/GRPO类裁剪策略移植到序列层面时，固定裁剪区间会系统性地重加权短响应与长响应，从而扭曲有效目标函数。解决方案的关键在于提出FSPO（Fair Sequence Policy Optimization），其核心创新是直接在重要性采样（Importance Sampling, IS）权重空间中施加长度公平性约束：通过引入一个基于高斯启发的裁剪带，其中包含KL修正漂移项并按 $\sqrt{L}$ 缩放，从而最小化长度重加权误差（Length Reweighting Error, LRE），理论上保证裁剪后的更新方向与真实更新方向之间的夹角余弦接近最优，实验证明该方法能显著降低不同长度区间内的裁剪率波动，提升训练稳定性并优于所有基线模型。

链接: https://arxiv.org/abs/2509.09177
作者: Hanyi Mao,Quanjia Xiao,Lei Pang,Haixiao Liu
机构: University of Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose FSPO (Fair Sequence Policy Optimization), a sequence-level reinforcement learning method for LLMs that enforces length-fair clipping directly in the importance-sampling (IS) weight space. We revisit sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping is transplanted to sequences: a fixed clip range systematically reweights short vs. long responses, distorting the effective objective. Theoretically, we formalize length fairness via a Length Reweighting Error (LRE) and prove that small LRE yields a directional cosine guarantee between the clipped and true updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the sequence log-IS ratio with a band that applies a KL-corrected drift term and scales as \sqrtL . Empirically, FSPO flattens clip rates across length bins, stabilizes training, and outperforms all baselines across multiple evaluation datasets.
zh

[NLP-67] Generative Engine Optimization: How to Dominate AI Search

【速读】：该论文旨在解决生成式 AI 搜索引擎（如 ChatGPT、Perplexity 和 Gemini）兴起所引发的传统搜索引擎优化（SEO）失效问题，提出并构建面向新一代信息检索系统的“生成式引擎优化”（Generative Engine Optimization, GEO）战略框架。其核心挑战在于：AI 搜索系统依赖合成答案与引用来源，表现出对第三方权威媒体内容的显著偏好，而传统 SEO 中依赖品牌自建内容和社交平台流量的策略已不再有效。解决方案的关键在于四个维度：一是提升内容的机器可读性和可解释性，使其易于被 AI 系统理解和引用；二是通过主导 earned media（第三方权威媒体）来建立 AI 认知下的权威地位；三是制定针对不同 AI 搜索引擎及其语言特性的差异化策略；四是帮助中小品牌突破生成式搜索中的“大品牌偏见”，实现公平可见性。该研究基于大规模控制实验，为从业者提供了实证驱动的落地路径。

链接: https://arxiv.org/abs/2509.08919
作者: Mahe Chen,Xiaoxuan Wang,Kaiwen Chen,Nick Koudas
机构: University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The rapid adoption of generative AI-powered search engines like ChatGPT, Perplexity, and Gemini is fundamentally reshaping information retrieval, moving from traditional ranked lists to synthesized, citation-backed answers. This shift challenges established Search Engine Optimization (SEO) practices and necessitates a new paradigm, which we term Generative Engine Optimization (GEO). This paper presents a comprehensive comparative analysis of AI Search and traditional web search (Google). Through a series of large-scale, controlled experiments across multiple verticals, languages, and query paraphrases, we quantify critical differences in how these systems source information. Our key findings reveal that AI Search exhibit a systematic and overwhelming bias towards Earned media (third-party, authoritative sources) over Brand-owned and Social content, a stark contrast to Google’s more balanced mix. We further demonstrate that AI Search services differ significantly from each other in their domain diversity, freshness, cross-language stability, and sensitivity to phrasing. Based on these empirical results, we formulate a strategic GEO agenda. We provide actionable guidance for practitioners, emphasizing the critical need to: (1) engineer content for machine scannability and justification, (2) dominate earned media to build AI-perceived authority, (3) adopt engine-specific and language-aware strategies, and (4) overcome the inherent “big brand bias” for niche players. Our work provides the foundational empirical analysis and a strategic framework for achieving visibility in the new generative search landscape. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2509.08919 [cs.IR] (or arXiv:2509.08919v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.08919 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-68] Error Analysis in a Modular Meeting Transcription System

【速读】：该论文旨在解决语音分离中通道泄漏（cross-channel leakage）问题对会议转录性能的影响，尤其关注在仅主讲者发声区域中信号泄漏对最终识别效果的潜在干扰。其关键解决方案在于引入一种考虑时间局部性的泄漏分析框架，能够更敏感地检测到此类泄漏现象；同时通过对比不同语音活动检测（VAD）策略发现，先进的说话人日志（diarization）方法相较于基于能量的简单VAD可将与理想分割（oracle segmentation）的差距减少三分之一，并进一步揭示了剩余差异的主要来源。该研究实现了仅使用LibriSpeech数据训练识别模块时，在LibriCSS数据集上的最先进性能。

链接: https://arxiv.org/abs/2509.10143
作者: Peter Vieting,Simon Berger,Thilo von Neumann,Christoph Boeddeker,Ralf Schlüter,Reinhold Haeb-Umbach
机构: ∗* = RWTH Aachen University (亚琛工业大学); ‡ = Fraunhofer Institute for Digital Media Technology IDMT (弗劳恩霍夫数字媒体技术研究所); § = Institute for Speech Communication (语音通信研究所)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at ITG Conference on Speech Communication 2025

点击查看摘要

Abstract:Meeting transcription is a field of high relevance and remarkable progress in recent years. Still, challenges remain that limit its performance. In this work, we extend a previously proposed framework for analyzing leakage in speech separation with proper sensitivity to temporal locality. We show that there is significant leakage to the cross channel in areas where only the primary speaker is active. At the same time, the results demonstrate that this does not affect the final performance much as these leaked parts are largely ignored by the voice activity detection (VAD). Furthermore, different segmentations are compared showing that advanced diarization approaches are able to reduce the gap to oracle segmentation by a third compared to a simple energy-based VAD. We additionally reveal what factors contribute to the remaining difference. The results represent state-of-the-art performance on LibriCSS among systems that train the recognition module on LibriSpeech data only.
zh

[NLP-69] Unified Learnable 2D Convolutional Feature Extraction for ASR

【速读】：该论文旨在解决当前自动语音识别（ASR）系统中神经前端（Neural front-end）设计存在的局限性问题，即现有方法仍受传统特征提取技术的强烈影响，且架构分散、缺乏统一性，难以实现通用性和参数效率。其解决方案的关键在于提出一种统一的二维卷积神经网络前端架构，通过系统性地减少对已有技术的依赖，构建一个参数高效、适用于计算资源受限场景的通用特征提取器，实验表明该方案在性能上可媲美现有的监督学习型可训练特征提取器。

链接: https://arxiv.org/abs/2509.10031
作者: Peter Vieting,Benedikt Hilmes,Ralf Schlüter,Hermann Ney
机构: 1. RWTH Aachen University (亚琛工业大学); 2. Fraunhofer Institute for Telecommunications (弗劳恩霍夫电信研究所)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at ITG Conference on Speech Communication 2025

点击查看摘要

Abstract:Neural front-ends represent a promising approach to feature extraction for automatic speech recognition (ASR) systems as they enable to learn specifically tailored features for different tasks. Yet, many of the existing techniques remain heavily influenced by classical methods. While this inductive bias may ease the system design, our work aims to develop a more generic front-end for feature extraction. Furthermore, we seek to unify the front-end architecture contrasting with existing approaches that apply a composition of several layer topologies originating from different sources. The experiments systematically show how to reduce the influence of existing techniques to achieve a generic front-end. The resulting 2D convolutional front-end is parameter-efficient and suitable for a scenario with limited computational resources unlike large models pre-trained on unlabeled audio. The results demonstrate that this generic unified approach is not only feasible but also matches the performance of existing supervised learnable feature extractors.
zh

[NLP-70] Whisper Has an Internal Word Aligner

【速读】：该论文旨在解决从强健的自动语音识别模型（如Whisper）中获取高精度词级时间戳的问题，现有方法要么需要额外训练，要么在严格的时间容忍度（如20–100 ms）下表现不佳。其解决方案的关键在于：首先发现Whisper模型中存在能捕捉精确词对齐的注意力头（attention heads），这些头与无法实现准确对齐的注意力头具有显著差异；其次，通过使用字符而非词片（wordpiece）作为输入，在教师强制（teacher forcing）机制下筛选有效注意力头，从而无需训练即可获得更精细、更准确的词级时间戳对齐结果。

链接: https://arxiv.org/abs/2509.09987
作者: Sung-Lin Yeh,Yen Meng,Hao Tang
机构: Centre for Speech Technology Research, University of Edinburgh (爱丁堡大学语音技术研究中心)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: ASRU 2025

点击查看摘要

Abstract:There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper. Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.
zh

[NLP-71] HypoGeneAgent : A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets

【速读】：该论文旨在解决单细胞和扰动测序（Perturb-seq）研究中细胞簇划分与功能注释的主观性问题，即传统方法依赖启发式规则和专家经验进行分辨率选择和基因本体（Gene Ontology, GO）注释，缺乏客观量化标准。解决方案的关键在于提出一个由大语言模型（Large Language Model, LLM）驱动的框架——HYPOGENEAGENT，其核心机制包括：首先利用LLM对每个基因程序或扰动模块生成基于GO的假设并赋予置信度评分；随后通过句向量嵌入计算簇内一致性（intra-cluster agreement）与簇间区分度（inter-cluster separation），并融合二者得到可优化的分辨率评分（Resolution Score），从而实现自动、客观地选择最优聚类粒度与功能注释。

链接: https://arxiv.org/abs/2509.09740
作者: Ying Yuan,Xing-Yue Monica Ge,Aaron Archer Waterman,Tommaso Biancalani,David Richmond,Yogesh Pandit,Avtar Singh,Russell Littman,Jin Liu,Jan-Christian Huetter,Vladimir Ermakov
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies.
zh

计算机视觉

[CV-0] GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

【速读】：该论文旨在解决视觉语言导航（Vision-and-Language Navigation, VLN）中现有零样本方法在连续环境下的泛化能力不足问题，尤其是针对真实场景部署时的挑战。传统方法多适用于离散环境或需在连续模拟器中进行无监督训练，难以直接迁移至现实世界。其解决方案的关键在于提出一种无需训练的图约束优化框架，通过将自然语言指令分解为显式的空间约束，并构建涵盖所有空间关系类型的约束库，从而将导航引导建模为图约束优化问题；具体而言，将人类指令转化为有向无环图（Directed Acyclic Graph, DAG），以节点和边作为查询检索约束库并生成图约束，再由约束求解器确定航点位置，进而规划机器人路径与目标位置。此外，通过引入导航树与回溯机制处理无解或多解情况，显著提升了模型在未见环境中的零样本适应能力与导航效率。

链接: https://arxiv.org/abs/2509.10454
作者: Hang Yin,Haoyu Wei,Xiuwei Xu,Wenxuan Guo,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Beijing Key Laboratory of Embodied Intelligence Systems (北京市 embodied 智能系统重点实验室); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CoRL 2025. Project page: [this https URL]( this https URL )

点击查看摘要

Abstract:In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
zh

[CV-1] SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimers Prediction Tasks and Datasets

【速读】：该论文旨在解决当前深度学习模型在阿尔茨海默病（Alzheimer’s disease, AD）预测任务中面临的三大挑战：标注数据稀缺、跨数据集泛化能力差，以及对输入扫描数量和时间间隔变化的适应性不足。其解决方案的关键在于引入三种先进的时序自监督学习（temporal self-supervised learning, SSL）方法，并创新性地扩展了模型以处理变长输入序列并学习鲁棒的空间特征；通过在包含3,161名患者的四个公开数据集上进行预训练，该SSL模型在六项下游任务中优于监督学习方法，展现出优异的跨任务适应性和临床应用潜力。

链接: https://arxiv.org/abs/2509.10453
作者: Emily Kaczmarek,Justin Szeto,Brennan Nichyporuk,Tal Arbel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alzheimer’s disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer’s prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer’s prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at this https URL.
zh

[CV-2] InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis ICCV2025

【速读】：该论文旨在解决当前扩散模型在高分辨率图像生成中计算复杂度急剧上升的问题，特别是4K图像生成时间超过100秒的瓶颈。其解决方案的关键在于提出第二代潜空间扩散模型（latent diffusion models）的改进框架——InfGen，该框架将固定尺寸的潜在表示（latent representation）视为内容特征，并引入一个一步式生成器（one-step generator）来解码任意分辨率的图像，从而无需重新训练扩散模型即可实现高效、灵活的高分辨率图像生成，显著降低计算开销并将4K图像生成时间压缩至10秒以内。

链接: https://arxiv.org/abs/2509.10441
作者: Tao Han,Wanghan Xu,Junchao Gong,Xiaoyu Yue,Song Guo,Luping Zhou,Lei Bai
机构: Hong Kong University of Science and Technology (香港科技大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbfInfGen, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
zh

[CV-3] Multimodal SAM-adapter for Semantic Segmentation

【速读】：该论文旨在解决当前基于RGB图像的语义分割方法在恶劣环境条件下（如弱光、遮挡和恶劣天气）性能下降的问题，这些问题限制了其在自动驾驶、医疗影像和机器人等关键场景中的鲁棒性。解决方案的关键在于提出MM SAM-adapter框架，通过引入一个适配器网络（adapter network），将融合后的多模态特征（如LiDAR或红外数据）注入到Segment Anything Model (SAM) 的RGB特征中，从而在保持SAM强大泛化能力的同时，仅在辅助模态提供额外信息时才选择性地利用它们，实现多模态信息的平衡与高效使用。

链接: https://arxiv.org/abs/2509.10408
作者: Iacopo Curti,Pierluigi Zama Ramirez,Alioscia Petrelli,Luigi Di Stefano
机构: University of Bologna (博洛尼亚大学); SINA (意大利国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM’s rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: this https URL.
zh

[CV-4] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards

【速读】：该论文旨在解决当前压缩视频质量增强（Compressed Video Quality Enhancement, CVQE）研究中存在的三大问题：缺乏将方法与具体编码标准及失真特征系统关联的分类体系、跨编码类型架构范式的比较分析不足，以及基准测试实践不完善。其解决方案的关键在于提出三个核心贡献：首先，构建了一个涵盖架构范式、编码标准和压缩域特征利用的新型分类体系；其次，设计了一个统一的基准测试框架，集成现代压缩协议和标准测试序列以实现多准则公平评估；最后，系统分析了先进方法中重建性能与计算复杂度之间的权衡关系，并指明未来研究的潜在方向。

链接: https://arxiv.org/abs/2509.10407
作者: Xiem HoangVan,Dang BuiDinh,Sang NguyenQuang,Wen-Hsiao Peng
机构: VNU University of Engineering and Technology (越南国立大学工程与技术学院); National Yang Ming Chiao Tung University (台湾阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compressed video quality enhancement (CVQE) is crucial for improving user experience with lossy video codecs like H.264/AVC, H.265/HEVC, and H.266/VVC. While deep learning based CVQE has driven significant progress, existing surveys still suffer from limitations: lack of systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis of architectural paradigms across coding types, and underdeveloped benchmarking practices. To address these gaps, this paper presents three key contributions. First, it introduces a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization. Second, it proposes a unified benchmarking framework integrating modern compression protocols and standard test sequences for fair multi-criteria evaluation. Third, it provides a systematic analysis of the critical trade-offs between reconstruction performance and computational complexity observed in state-of-the-art methods and highlighting promising directions for future research. This comprehensive review aims to establish a foundation for consistent assessment and informed model selection in CVQE research and deployment.
zh

[CV-5] Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition

【速读】：该论文旨在解决真实场景下图像内在光度因子分解（即分离阴影（shading）与反射率（reflectance））的难题，传统方法受限于缺乏大规模真实世界标注数据，依赖合成数据或稀疏标注，难以泛化至复杂户外环境。其解决方案的关键在于提出一种无需训练的新颖方法，仅利用一对可见光与热成像（thermal imaging）图像进行自监督优化：基于光学原理——不可见光被不透明表面吸收并转化为热量，从而在热图像中体现为温度变化，这使得可见光与热图像强度之间的序关系（ordinality）可映射到阴影与反射率之间的序关系，进而为神经网络提供密集的自监督信号，实现对阴影和反射率的有效恢复。

链接: https://arxiv.org/abs/2509.10388
作者: Zeqing Leo Yuan,Mani Ramanagopal,Aswin C. Sankaranarayanan,Srinivasa G. Narasimhan
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Decomposing an image into its intrinsic photometric factors–shading and reflectance–is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. Recent methods rely on synthetic data or sparse annotations for limited indoor and even fewer outdoor scenes. We introduce a novel training-free approach for intrinsic image decomposition using only a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities between visible and thermal image intensities to the ordinalities of shading and reflectance, which can densely self-supervise an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over recent learning-based models and point toward a scalable path to curating real-world ordinal supervision, previously infeasible via manual labeling.
zh

[CV-6] Efficient Learned Image Compression Through Knowledge Distillation

【速读】：该论文旨在解决基于神经网络的图像压缩模型在资源受限平台（如移动设备或嵌入式系统）上难以实现实时部署的问题，因其对计算资源和能耗要求较高。解决方案的关键在于引入知识蒸馏（knowledge distillation）技术，即通过一个训练成熟的大型教师模型指导小型学生模型的学习过程，使后者在保持较高图像压缩性能的同时显著降低计算复杂度与资源消耗。实验表明，该方法可在不同网络架构规模、质量-比特率权衡场景下有效提升小模型性能，并节省处理与能源成本，为神经网络图像压缩的实际应用提供了可行路径。

链接: https://arxiv.org/abs/2509.10366
作者: Fabien Allemand,Attilio Fiandrotti,Sumanta Chaudhuri,Alaa Eddine Mazouz
机构: Télécom SudParis (法国电信巴黎南学院); Télécom Paris (法国电信巴黎学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 21 figures

点击查看摘要

Abstract:Learned image compression sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based compression methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then quantized, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-decoded, and a decoder reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image compression by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image compression tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to transformer-based models. The code is publicly available at: this https URL .
zh

[CV-7] Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

【速读】：该论文旨在解决文本引导图像编辑方法在面对对抗攻击时的脆弱性问题，尤其是针对其视觉模块的鲁棒性不足。解决方案的关键在于提出一种名为“Attention Attack”的新型攻击方法，该方法通过利用源图像的自动生成描述作为代理编辑提示，干扰文本提示与图像视觉表示之间的交叉注意力（cross-attention）机制，从而破坏图像内容与其文本描述的一致性，且无需了解具体的编辑模型或编辑提示。此攻击策略在保持不可感知性的前提下显著降低编辑性能，验证了现有免疫评估指标的局限性，并引入了基于语义一致性和分割掩码的空间布局扰动度量方式以更准确评估攻击效果。

链接: https://arxiv.org/abs/2509.10359
作者: Matteo Trippodo,Federico Becattini,Lorenzo Seidenari
机构: University of Florence (佛罗伦萨大学); University of Siena (锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as Regular Paper at ACM Multimedia 2025

点击查看摘要

Abstract:Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.
zh

[CV-8] owards Understanding Visual Grounding in Visual Language Models

【速读】：该论文旨在解决视觉 grounding（Visual Grounding）在通用视觉语言模型（Vision Language Models, VLMs）中的核心作用与实现机制问题，即如何使模型能够准确地将自然语言描述映射到图像或视频中的特定区域，从而支持细粒度理解与交互。其解决方案的关键在于系统梳理现代 VLMs 中的视觉 grounding 技术范式，明确其核心组件（如跨模态对齐、注意力机制和多模态推理结构），并分析其在 referring expression comprehension、视觉问答、图像描述生成以及模拟与真实环境控制等场景下的应用效果。此外，论文还强调了视觉 grounding 与多模态链式思维（multimodal chain-of-thought）及推理能力之间的深层关联，为未来研究提供了理论基础与技术路径。

链接: https://arxiv.org/abs/2509.10345
作者: Georgios Pantazopoulos,Eda B. Özyiğit
机构: The Alan Turing Institute (艾伦图灵研究所); Heriot-Watt University (赫瑞-瓦特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.
zh

[CV-9] GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography MICCAI2025

【速读】：该论文旨在解决现有生成式视觉语言模型（Visual Language Model, VLM）在乳腺X线摄影（mammography）图像分析中因数据有限和自然图像与医学图像之间领域差异而导致的性能瓶颈问题。特别是，当前方法通常忽略乳腺X线特有的多视角关系（multi-view relationships），未能有效建模双侧对应结构（ipsilateral correspondence），从而丢失关键几何上下文信息，影响诊断准确性。解决方案的关键在于提出GLAM模型——通过几何引导的全局与局部对齐机制，在预训练阶段融合全局和局部视觉-视觉、视觉-语言对比学习，显式建模多视角间的跨视图对齐与细粒度局部特征，从而提升模型对乳腺影像的多视角理解能力。该方法在EMBED数据集上预训练后，在多个不同设置下均优于基线模型。

链接: https://arxiv.org/abs/2509.10344
作者: Yuexi Du,Lihui Chen,Nicha C. Dvornek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.
zh

[CV-10] GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT

【速读】：该论文旨在解决光学相干断层扫描（Optical Coherence Tomography, OCT）图像中由散斑噪声（speckle noise）引起的细节模糊问题，该噪声会干扰视网膜解剖结构的准确识别与诊断。解决方案的关键在于提出一种基于伽马分布的扩散概率模型（Denoising Diffusion Gamma Model），以更精确地建模散斑噪声的统计特性，并引入噪声抑制保真项（Noise-Reduced Fidelity Term），利用预处理得到的低噪声图像引导去噪过程，从而避免高频噪声的重新引入。此外，通过将隐式扩散去噪框架（Denoising Diffusion Implicit Model）适配至伽马模型，显著提升了推理效率。实验表明，该方法在峰值信噪比（PSNR）、结构相似性（SSIM）和均方误差（MSE）等指标上优于传统及当前最先进的深度学习去噪方法，同时保留了更清晰的边缘和精细解剖结构。

链接: https://arxiv.org/abs/2509.10341
作者: Botond Fazekas,Thomas Pinetz,Guilherme Aresta,Taha Emre,Hrvoje Bogunovic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) is a vital imaging modality for diagnosing and monitoring retinal diseases. However, OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation. While numerous denoising methods exist, many struggle to balance noise reduction with the preservation of crucial anatomical structures. This paper introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for OCT image despeckling that leverages the strengths of diffusion probabilistic models. Unlike conventional diffusion models that assume Gaussian noise, GARD employs a Denoising Diffusion Gamma Model to more accurately reflect the statistical properties of speckle. Furthermore, we introduce a Noise-Reduced Fidelity Term that utilizes a pre-processed, less-noisy image to guide the denoising process. This crucial addition prevents the reintroduction of high-frequency noise. We accelerate the inference process by adapting the Denoising Diffusion Implicit Model framework to our Gamma-based model. Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details.
zh

[CV-11] I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

【速读】：该论文旨在解决视觉Transformer（Vision Transformer, ViT）在资源受限设备上部署时面临的高内存占用和计算成本问题，尤其是量化（quantization）过程中因误差累积导致模型性能显著下降的挑战。其解决方案的关键在于提出I-Segmenter——首个完全基于整数运算（integer-only）的ViT分割框架：通过系统性地将浮点运算替换为整数运算，并引入λ-ShiftGELU激活函数以缓解均匀量化对长尾激活分布的局限性；同时移除L2归一化层并用最近邻插值替代双线性插值，从而确保整个计算图在训练与推理阶段均保持纯整数执行。该设计在仅需单张校准图像的一次性量化（one-shot PTQ）下仍能实现接近浮点基线的精度（平均仅降低5.1%），同时模型体积减少最高达3.8倍，推理速度提升最高1.2倍，显著提升了ViT分割模型在边缘设备上的实用性。

链接: https://arxiv.org/abs/2509.10334
作者: Jordan Sassoon,Michal Szczepanski,Martyna Poreba
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); List (电子与信息技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose \lambda -ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.
zh

[CV-12] Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching ACM-MM2025

【速读】：该论文旨在解决扩散变压器（Diffusion Transformers）在图像和视频生成任务中因迭代去噪过程导致的高计算成本问题。现有方法如特征缓存（Feature Caching）仅利用时间维度上的相似性来加速，忽略了空间维度上的冗余信息。其解决方案的关键在于提出一种正交且互补的思路——基于聚类驱动的特征缓存（Cluster-Driven Feature Caching, ClusCa），该方法在每个时间步对token进行空间聚类，仅计算每个簇中的一个代表性token，并将其信息传播至簇内其余token，从而将token数量减少超过90%。此方法无需重新训练即可直接应用于任意扩散变压器模型，在文本到图像与文本到视频生成任务中均展现出显著加速效果，例如在FLUX模型上实现4.96倍加速的同时保持ImageReward高达99.49%。

链接: https://arxiv.org/abs/2509.10312
作者: Zhixin Zheng,Xinyu Wang,Chang Zou,Shaobo Wang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 11 figures; Accepted by ACM MM2025; Mainly focus on feature caching for diffusion transformers acceleration

点击查看摘要

Abstract:Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at this https URL.
zh

[CV-13] A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

【速读】：该论文旨在解决复杂城市环境中街道家具（street furniture）的高精度地理定位问题，这对于地方政府和私营部门有效监控与维护公共基础设施至关重要。解决方案的关键在于提出一种基于能量图（energy maps）的概率框架，该框架通过编码物体位置的空间可能性来建模资产分布；能量图以地图为基础的地理定位格式表示，使优化过程能够无缝集成外部地理空间信息（如GIS图层、道路地图或布设约束），从而提升上下文感知能力和定位准确性。进一步引入随机生灭优化算法（stochastic birth-and-death optimisation algorithm）以推断最可能的资产配置，实验基于都柏林市中心路灯设施的真实地理数据集进行仿真验证，证明了该方法在可扩展性和精度上的潜力。

链接: https://arxiv.org/abs/2509.10310
作者: Evan Murphy,Marco Viola,Vladimir A. Krylov
机构: Dublin City University (都柏林城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Proceedings of the 27th Irish Machine Vision and Image Processing Conference (IMVIP 2025)

点击查看摘要

Abstract:In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository this https URL.
zh

[CV-14] Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks）和视觉Transformer（Vision Transformers）在计算机视觉任务中对对抗性扰动（adversarial perturbations）高度敏感的问题，同时克服传统防御方法计算开销大或缺乏形式化保证的局限。其解决方案的关键在于提出一种基于Lipschitz常数引导的随机深度（Stochastic Depth，DropPath）方法，通过随网络深度增加的丢弃概率来控制模型的有效Lipschitz常数，从而对深层网络进行正则化，在保持干净准确率的同时提升对抗鲁棒性，并显著降低浮点运算量（FLOPs）。

链接: https://arxiv.org/abs/2509.10298
作者: Laith Nayal,Mahmoud Mousatat,Bader Rasheed
机构: Innopolis University ( Innopolis 大学); LLC NUHA TECH (NUHA 技术有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 tables

点击查看摘要

Abstract:Deep neural networks and Vision Transformers achieve state-of-the-art performance in computer vision but are highly vulnerable to adversarial perturbations. Standard defenses often incur high computational cost or lack formal guarantees. We propose a Lipschitz-guided stochastic depth (DropPath) method, where drop probabilities increase with depth to control the effective Lipschitz constant of the network. This approach regularizes deeper layers, improving robustness while preserving clean accuracy and reducing computation. Experiments on CIFAR-10 with ViT-Tiny show that our custom depth-dependent schedule maintains near-baseline clean accuracy, enhances robustness under FGSM, PGD-20, and AutoAttack, and significantly reduces FLOPs compared to baseline and linear DropPath schedules.
zh

[CV-15] MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection

【速读】：该论文旨在解决零样本三维（Zero-shot 3D, ZS-3D）异常检测中因仅依赖点云数据而忽略RGB图像与文本语义等多模态互补信息的问题，从而在数据稀缺、隐私受限或标注成本高昂的场景下提升检测性能。其解决方案的关键在于提出一种多模态协作学习框架（Multimodal Collaboration Learning for Anomaly Detection, MCL-AD），通过两个核心机制实现：一是多模态提示学习机制（Multimodal Prompt Learning Mechanism, MPLM），利用与对象无关的解耦文本提示和多模态对比损失增强模态内表征能力与跨模态协同学习；二是协作调制机制（Collaborative Modulation Mechanism, CMM），通过联合调制RGB图像引导分支与点云引导分支，充分挖掘两者互补表示以提升异常检测精度。

链接: https://arxiv.org/abs/2509.10282
作者: Gang Li,Tianjiao Chen,Mingle Zhou,Min Li,Delong Han,Jin Wan
机构: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Page 14, 5 pictures

点击查看摘要

Abstract:Zero-shot 3D (ZS-3D) anomaly detection aims to identify defects in 3D objects without relying on labeled training data, making it especially valuable in scenarios constrained by data scarcity, privacy, or high annotation cost. However, most existing methods focus exclusively on point clouds, neglecting the rich semantic cues available from complementary modalities such as RGB images and texts priors. This paper introduces MCL-AD, a novel framework that leverages multimodal collaboration learning across point clouds, RGB images, and texts semantics to achieve superior zero-shot 3D anomaly detection. Specifically, we propose a Multimodal Prompt Learning Mechanism (MPLM) that enhances the intra-modal representation capability and inter-modal collaborative learning by introducing an object-agnostic decoupled text prompt and a multimodal contrastive loss. In addition, a collaborative modulation mechanism (CMM) is proposed to fully leverage the complementary representations of point clouds and RGB images by jointly modulating the RGB image-guided and point cloud-guided branches. Extensive experiments demonstrate that the proposed MCL-AD framework achieves state-of-the-art performance in ZS-3D anomaly detection.
zh

[CV-16] Detecting Text Manipulation in Images using Vision Language Models BMVC-2025 WWW

【速读】：该论文旨在解决当前大型视觉语言模型（Large Vision Language Models, LVLMs）在图像篡改检测中表现优异，但在文本篡改检测方面研究严重不足的问题。其解决方案的关键在于系统性地评估闭源与开源LVLMs在多种文本篡改数据集上的性能表现，发现开源模型虽逐步逼近闭源模型（如GPT-4o），但仍存在差距；同时，进一步验证了专为图像篡改设计的LVLM在文本篡改检测任务上存在泛化能力不足的问题，并通过在真实场景文本和虚构身份证文本等挑战性任务上的基准测试，揭示了现有模型在复杂现实应用中的局限性。

链接: https://arxiv.org/abs/2509.10278
作者: Vidit Vidit,Pavel Korshunov,Amir Mohammadi,Christophe Ecabert,Ketan Kotwal,Sébastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Synthetic Realities and Biometric Security Workshop BMVC-2025. For paper page see this https URL

点击查看摘要

Abstract:Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.
zh

[CV-17] SignClip: Leverag ing Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

【速读】：该论文旨在解决手语翻译（Sign Language Translation, SLT）中忽视非手动线索（如口部动作）导致的翻译准确性不足问题。现有方法多聚焦于手势特征，而忽略了口部动作在手语中对语义消歧的关键作用。解决方案的核心在于提出SignClip框架，通过融合手势的空间特征与唇动特征，并引入分层对比学习机制，实现跨模态（手语-唇语-文本）的语义一致性对齐，从而显著提升翻译性能。

链接: https://arxiv.org/abs/2509.10266
作者: Wenfang Wu,Tingting Yuan,Yupeng Li,Daling Wang,Xiaoming Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.
zh

[CV-18] MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成模型中普遍存在且难以评估的物理伪影（physical artifacts）问题，如解剖结构错误和几何失真等，这些问题严重影响图像感知质量并限制实际应用。解决方案的关键在于提出一个系统性的评估框架 MagicMirror，其核心包括：(1) 建立细粒度的伪影分类体系；(2) 构建首个大规模人工标注数据集 MagicData340K（含34万张带细粒度标签的生成图像）；(3) 训练基于视觉-语言模型（Vision-Language Model, VLM）的 MagicAssessor，实现精准的伪影评估与标签输出；(4) 设计新颖的数据采样策略与多层级奖励机制以应对类别不平衡和奖励欺骗（reward hacking）问题，并基于 Group Relative Policy Optimization (GRPO) 进行优化；最终利用 MagicAssessor 构建自动化基准测试平台 MagicBench，揭示当前顶级 T2I 模型（如 GPT-image-1）仍存在显著伪影，凸显伪影减少是未来研究的关键方向。

链接: https://arxiv.org/abs/2509.10260
作者: Jia Wang,Jie Hu,Xiaoqi Ma,Hanghang Ma,Yanbing Zeng,Xiaoming Wei
机构: University of Chinese Academy of Sciences (中国科学院大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: this https URL.
zh

[CV-19] Mask Consistency Regularization in Object Removal

【速读】：该论文旨在解决图像修复（image inpainting）中对象移除（object removal）任务面临的两个关键问题：一是掩码幻觉（mask hallucination），即模型在掩码区域内生成与周围环境无关或虚假的内容；二是掩码形状偏差（mask-shape bias），即模型倾向于填充与掩码形状相似的对象，而非基于上下文的合理内容。解决方案的关键在于提出掩码一致性正则化（Mask Consistency Regularization, MCR），通过在训练过程中引入两种掩码扰动——膨胀（dilation）和重塑（reshape），强制不同扰动分支的输出与原始掩码输出保持一致性。其中，膨胀掩码引导模型输出与周边内容对齐，而重塑掩码则促使模型摆脱对掩码形状的依赖，从而有效降低幻觉和形状偏差，提升修复结果的鲁棒性和语义一致性。

链接: https://arxiv.org/abs/2509.10259
作者: Hua Yuan,Jin Yuan,Yicheng Jiang,Yao Zhang,Xin Geng,Yong Rui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object removal, a challenging task within image inpainting, involves seamlessly filling the removed region with content that matches the surrounding context. Despite advancements in diffusion models, current methods still face two critical challenges. The first is mask hallucination, where the model generates irrelevant or spurious content inside the masked region, and the second is mask-shape bias, where the model fills the masked area with an object that mimics the mask’s shape rather than surrounding content. To address these issues, we propose Mask Consistency Regularization (MCR), a novel training strategy designed specifically for object removal tasks. During training, our approach introduces two mask perturbations: dilation and reshape, enforcing consistency between the outputs of these perturbed branches and the original mask. The dilated masks help align the model’s output with the surrounding content, while reshaped masks encourage the model to break the mask-shape bias. This combination of strategies enables MCR to produce more robust and contextually coherent inpainting results. Our experiments demonstrate that MCR significantly reduces hallucinations and mask-shape bias, leading to improved performance in object removal.
zh

[CV-20] Robustness and Diagnostic Performance of Super-Resolution Fetal Brain MRI MICCAI2025

【速读】：该论文旨在解决胎儿脑部磁共振成像（fetal brain MRI）中因快速多视角二维切片采集导致的分辨率低、运动伪影及三维解剖结构不完整等问题。其解决方案的关键在于采用超分辨率重建（super-resolution reconstruction, SRR）方法，通过融合切片到体素的配准与超分辨率技术，从低分辨率二维切片生成高分辨率三维体积数据，并进一步评估不同SRR方法在健康对照组和病理病例（如脑室扩大，ventriculomegaly）中的重建性能、体积测量一致性及对诊断分类的影响。研究发现，NeSVoR在各类样本中均表现出最高的重建成功率（90%），且尽管不同SRR方法间存在显著的体积估计差异，但对脑室扩大的诊断分类性能不受影响，凸显了该方法的鲁棒性与诊断结果的稳定性。

链接: https://arxiv.org/abs/2509.10257
作者: Ema Masterl,Tina Vipotnik Vesnaver,Žiga Špiclin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the PIPPI Workshop of MICCAI 2025

点击查看摘要

Abstract:Fetal brain MRI relies on rapid multi-view 2D slice acquisitions to reduce motion artifacts caused by fetal movement. However, these stacks are typically low resolution, may suffer from motion corruption, and do not adequately capture 3D anatomy. Super-resolution reconstruction (SRR) methods aim to address these limitations by combining slice-to-volume registration and super-resolution techniques to generate high-resolution (HR) 3D volumes. While several SRR methods have been proposed, their comparative performance - particularly in pathological cases - and their influence on downstream volumetric analysis and diagnostic tasks remain underexplored. In this study, we applied three state-of-the-art SRR method - NiftyMIC, SVRTK, and NeSVoR - to 140 fetal brain MRI scans, including both healthy controls (HC) and pathological cases (PC) with ventriculomegaly (VM). Each HR reconstruction was segmented using the BoUNTi algorithm to extract volumes of nine principal brain structures. We evaluated visual quality, SRR success rates, volumetric measurement agreement, and diagnostic classification performance. NeSVoR demonstrated the highest and most consistent reconstruction success rate (90%) across both HC and PC groups. Although significant differences in volumetric estimates were observed between SRR methods, classification performance for VM was not affected by the choice of SRR method. These findings highlight NeSVoR’s robustness and the resilience of diagnostic performance despite SRR-induced volumetric variability.
zh

[CV-21] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection

【速读】：该论文旨在解决当前AI生成图像检测方法在面对未见过的生成模型时泛化能力不足的问题，其根本原因在于现有方法过度依赖特定生成模型产生的特征痕迹（如风格先验和压缩模式）。解决方案的关键在于提出GAMMA训练框架，通过引入多样化的操作策略（如基于修复的篡改和语义保持扰动）来减少域偏移并增强语义对齐，并采用多任务监督机制（双分割头与分类头）实现跨生成域的像素级来源归属；此外，还设计了反向交叉注意力机制，使分割头能够引导并修正分类分支中的偏差表征，从而显著提升检测模型在未知生成模型上的泛化性能。

链接: https://arxiv.org/abs/2509.10250
作者: Haozhen Yan,Yan Hong,Suning Lang,Jiahui Zhan,Yikun Ji,Yujie Gao,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.
zh

[CV-22] On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints

【速读】：该论文旨在解决空间场景下三维物体重建中隐式与显式新视角合成（Novel View Synthesis, NVS）方法的性能差异问题，特别是外观嵌入（appearance embeddings）对几何精度的影响。其关键解决方案在于系统性地比较了K-Planes、高斯点绘（Gaussian Splatting）和凸点绘（Convex Splatting）三种方法在SPEED+数据集上的表现，发现尽管外观嵌入可提升光照变化下的图像保真度（photometric fidelity），但并未显著改善几何准确性——这对空间机器人任务至关重要；同时揭示了凸点绘相比高斯点绘能生成更紧凑、更少杂乱的表示，更适合安全敏感应用如避障与交互。

链接: https://arxiv.org/abs/2509.10241
作者: Elias De Smijter,Renaud Detry,Christophe De Vleeschouwer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures, to be presented at ASTRA25,

点击查看摘要

Abstract:We present the first systematic comparison of implicit and explicit Novel View Synthesis methods for space-based 3D object reconstruction, evaluating the role of appearance embeddings. While embeddings improve photometric fidelity by modeling lighting variation, we show they do not translate into meaningful gains in geometric accuracy - a critical requirement for space robotics applications. Using the SPEED+ dataset, we compare K-Planes, Gaussian Splatting, and Convex Splatting, and demonstrate that embeddings primarily reduce the number of primitives needed for explicit methods rather than enhancing geometric fidelity. Moreover, convex splatting achieves more compact and clutter-free representations than Gaussian splatting, offering advantages for safety-critical applications such as interaction and collision avoidance. Our findings clarify the limits of appearance embeddings for geometry-centric tasks and highlight trade-offs between reconstruction quality and representation efficiency in space scenarios.
zh

[CV-23] LayerLock: Non-collapsing Representation Learning with Progressive Freezing ICCV2025

【速读】：该论文旨在解决自监督视觉表征学习中训练效率低和潜在表示崩溃（representation collapse）的问题。其解决方案的关键在于提出一种名为LayerLock的渐进式层冻结策略，通过在训练过程中按深度顺序逐步冻结ViT（Vision Transformer）层，实现从像素级预测到潜在空间预测的平滑过渡。这一策略不仅加速了标准视频掩码自编码器（MAE）的训练过程，还避免了传统潜空间预测方法中常见的表示坍缩问题，从而在大规模模型（高达40亿参数）上显著提升了性能。

链接: https://arxiv.org/abs/2509.10156
作者: Goker Erdogan,Nikhil Parthasarathy,Catalin Ionescu,Drew Hudson,Alexander Lerchner,Andrew Zisserman,Mehdi Sajjadi,Joao Carreira
机构: Google DeepMind(谷歌深度思维); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from “representation collapse”. We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.
zh

[CV-24] Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

【速读】：该论文旨在解决向量量化（Vector Quantization, VQ）在图像生成离散分词器中的训练不稳定性问题，具体表现为直通估计偏差（straight-through estimation bias）、一步滞后更新（one-step-behind updates）以及稀疏码本梯度（sparse codebook gradients），这些问题导致重建性能不佳和码本使用率低。解决方案的关键在于提出VQBridge，这是一种基于映射函数方法的鲁棒、可扩展且高效的投影模块，通过压缩-处理-恢复（compress-process-recover）流水线优化码向量，从而实现稳定有效的码本训练；结合学习annealing策略，最终构建出FVQ（FullVQ），实现了在多种码本配置下100%的码本利用率，并显著提升图像重建与生成性能。

链接: https://arxiv.org/abs/2509.10140
作者: Yifan Chang,Jie Qin,Limeng Qiao,Xiaofeng Wang,Zheng Zhu,Lin Ma,Xingang Wang
机构: CASIA(中国科学院自动化研究所); Meituan(美团); GigaAI; UCAS(中国科学院大学); Luoyang Institute for Robot and Intelligent Equipment(洛阳机器人与智能装备研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress-process-recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.
zh

[CV-25] Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment BMVC2025

【速读】：该论文旨在解决医学图像分割中因跨域数据分布差异导致的模型性能下降问题，特别是在视盘（optic disc）和视杯（optic cup）分割任务中，当训练数据与目标域影像采集协议或成像条件不一致时，现有分割模型往往出现显著性能退化。解决方案的关键在于提出一种无源域（source-free）域适应框架 Grad-CL，其核心创新包括两个阶段：第一阶段利用梯度引导的伪标签精炼模块，通过提取类特定显著特征实现更准确的不确定性量化与原型估计，从而优化噪声伪标签；第二阶段引入基于余弦相似度的对比学习策略，显式增强梯度感知特征下视盘与视杯类别间的判别性分离能力，从而提升分割精度与边界清晰度。

链接: https://arxiv.org/abs/2509.10134
作者: Rini Smita Thakur,Rajeev Ranjan Dwivedi,Vinod K Kurmi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in BMVC 2025

点击查看摘要

Abstract:Accurate segmentation of the optic disc and cup is critical for the early diagnosis and management of ocular diseases such as glaucoma. However, segmentation models trained on one dataset often suffer significant performance degradation when applied to target data acquired under different imaging protocols or conditions. To address this challenge, we propose \textbfGrad-CL, a novel source-free domain adaptation framework that leverages a pre-trained source model and unlabeled target data to robustly adapt segmentation performance without requiring access to the original source data. Grad-CL combines a gradient-guided pseudolabel refinement module with a cosine similarity-based contrastive learning strategy. In the first stage, salient class-specific features are extracted via a gradient-based mechanism, enabling more accurate uncertainty quantification and robust prototype estimation for refining noisy pseudolabels. In the second stage, a contrastive loss based on cosine similarity is employed to explicitly enforce inter-class separability between the gradient-informed features of the optic cup and disc. Extensive experiments on challenging cross-domain fundus imaging datasets demonstrate that Grad-CL outperforms state-of-the-art unsupervised and source-free domain adaptation methods, achieving superior segmentation accuracy and improved boundary delineation. Project and code are available at this https URL.
zh

[CV-26] Realism Control One-step Diffusion for Real-World Image Super-Resolution

【速读】：该论文旨在解决单步扩散（One-step Diffusion, OSD）方法在真实世界图像超分辨率（Real-ISR）任务中难以平衡保真度（fidelity）与真实感（realism）的问题。由于现有OSD模型通常基于单一采样步数进行训练或蒸馏，缺乏灵活的控制机制以适应不同场景下对保真度和真实感的动态权衡，而多步扩散方法则可通过调整采样步数实现此类控制。解决方案的关键在于提出一种现实可控的单步扩散框架（Realism Controlled One-step Diffusion, RCOD），其核心创新包括：(1) 引入潜在域分组策略（latent domain grouping strategy），在噪声预测阶段显式调控保真度与真实感之间的权衡；(2) 设计退化感知采样策略（degradation-aware sampling strategy），使蒸馏正则化与分组策略对齐，增强控制能力；(3) 提出视觉提示注入模块（visual prompt injection module），用退化感知的视觉标记替代传统文本提示，提升恢复精度与语义一致性。该方案在不显著改变原始训练范式和数据的前提下，实现了高效且灵活的性能优化。

链接: https://arxiv.org/abs/2509.10122
作者: Zongliang Wu,Siming Zheng,Peng-Tao Jiang,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage. The code will be released.
zh

[CV-27] A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

【速读】：该论文旨在解决无参考（no-reference）人脸图像质量评估（Face Image Quality Assessment, FIQA）中普遍存在的两个问题：一是现有通用图像质量评估方法难以捕捉人脸特有的退化特征，二是当前先进FIQA模型计算复杂度高，限制了其在实际场景中的部署。解决方案的关键在于提出一种轻量级且高效的FIQA方法，通过集成两个紧凑的卷积神经网络（MobileNetV3-Small 和 ShuffleNetV2），采用预测层面的简单平均融合策略，并引入相关性感知损失（MSECorrLoss），该损失函数结合均方误差（MSE）与皮尔逊相关系数正则项，以增强模型输出与人类主观感知的一致性。此设计在保证高精度的同时显著降低计算开销，适用于真实世界环境下的实时应用。

链接: https://arxiv.org/abs/2509.10114
作者: MohammadAli Hamidi,Hadi Amirpour,Luigi Atzori,Christian Timmerer
机构: University of Cagliari (卡利亚里大学); CNIT (国家电信网络研究中心); Alpen-Adria Universität Klagenfurt (阿尔卑斯-亚得里亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.
zh

[CV-28] HHI-Assist: A Dataset and Benchmark of Human-Human Interaction in Physical Assistance Scenario

【速读】：该论文旨在解决助老助残场景中机器人在物理交互环境下准确预测人类运动的难题，以实现安全、响应灵敏的辅助服务。当前挑战源于助人场景的多样性以及物理交互中护理者与被照顾者之间耦合动力学的复杂性。解决方案的关键在于两个核心贡献：一是构建了名为HHI-Assist的新数据集，包含人类在助人任务中的动作捕捉片段；二是提出了一种基于条件Transformer的去噪扩散模型（denoising diffusion model），能够有效建模交互双方的姿态耦合动态，从而在未见场景中展现出优于基线方法的泛化性能和预测精度。

链接: https://arxiv.org/abs/2509.10096
作者: Saeed Saadatnejad,Reyhaneh Hosseininejad,Jose Barreiros,Katherine M. Tsui,Alexandre Alahi
机构: * indicates equal contribution.

Toyota Research Institute (TRI) (丰田研究院);
Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to RA-L 2025

点击查看摘要

Abstract:The increasing labor shortage and aging population underline the need for assistive robots to support human care recipients. To enable safe and responsive assistance, robots require accurate human motion prediction in physical interaction scenarios. However, this remains a challenging task due to the variability of assistive settings and the complexity of coupled dynamics in physical interactions. In this work, we address these challenges through two key contributions: (1) HHI-Assist, a dataset comprising motion capture clips of human-human interactions in assistive tasks; and (2) a conditional Transformer-based denoising diffusion model for predicting the poses of interacting agents. Our model effectively captures the coupled dynamics between caregivers and care receivers, demonstrating improvements over baselines and strong generalization to unseen scenarios. By advancing interaction-aware motion prediction and introducing a new dataset, our work has the potential to significantly enhance robotic assistance policies. The dataset and code are available at: this https URL
zh

[CV-29] Leverag ing Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing

【速读】：该论文旨在解决多人体分割（multi-human parsing）任务中在人体重叠场景下分割精度下降的问题，尤其关注实例级与部件级信息融合时因遮挡导致的性能瓶颈。其解决方案的关键在于引入多视角信息以增强模型对遮挡情况下的理解能力：通过弱监督的人体实例标注和多视角一致性损失（multi-view consistency loss），在训练过程中整合来自不同视角的互补信息，从而提升模型在复杂遮挡场景下的分割鲁棒性。

链接: https://arxiv.org/abs/2509.10093
作者: Laura Bragagnolo,Matteo Terreran,Leonardo Barcellona,Stefano Ghidoni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIAP 2025

点击查看摘要

Abstract:Multi-human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance-level and part-level information for fine-grained human understanding. In this work, we demonstrate that, while state-of-the-art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi-view information to improve multi-human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi-view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi-automatic annotation strategy to generate human instance segmentation masks from multi-view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20% relative improvement on human parsing over the baseline model in occlusion scenarios.
zh

[CV-30] BEVTraj: Map-Free End-to-End Trajectory Prediction in Birds-Eye View with Deformable Attention and Sparse Goal Proposals

【速读】：该论文旨在解决自动驾驶中轨迹预测因依赖预构建高精地图（High-Definition Map, HD Map）而带来的局限性问题，包括地图覆盖范围受限、难以适应动态环境变化，以及局部地图构建模块对预定义元素识别不足导致的关键场景信息缺失或误差引入。解决方案的关键在于提出一种直接在鸟瞰视图（Bird’s-Eye View, BEV）空间中运行的轨迹预测框架BEVTraj，其不依赖任何预构建地图，而是利用实时传感器数据提取密集BEV特征，并通过可变形注意力机制高效捕获相关上下文信息；同时引入稀疏目标候选提议（Sparse Goal Candidate Proposal, SGCP）模块，实现端到端的轨迹预测而无需后处理步骤，从而在保持与最先进HD地图模型相当性能的同时显著提升灵活性和鲁棒性。

链接: https://arxiv.org/abs/2509.10080
作者: Minsang Kong,Myeongjun Kim,Sang Gu Kang,Sang Hun Lee
机构: Kookmin University (酷尔敏大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems (under review)

点击查看摘要

Abstract:In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird’s-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird’s-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at this https URL.
zh

[CV-31] Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking Analysis and Exploration

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在无人机（Unmanned Aerial Vehicle, UAV）遥感场景中数学推理能力不足的问题，尤其是缺乏对几何、逻辑与代数等领域知识的系统性评估。现有VLMs虽在通用多模态任务中表现优异，但在涉及精确距离计算、轨迹估计和空间分析等复杂数学推理任务时性能显著下降。解决方案的关键在于构建首个专门针对航空影像的多模态数学推理基准——AVI-Math，其包含3,773个高质量、多样化的真实UAV视角问题，覆盖6个数学主题和20个子领域，并通过14种主流VLM的全面评测揭示了当前模型在该领域的普遍局限性。此外，论文进一步探索了链式思维（Chain-of-Thought）提示与微调技术的有效性，为提升VLM在真实应用场景下的可信数学推理能力提供了可行路径。

链接: https://arxiv.org/abs/2509.10059
作者: Yue Zhou,Litong Feng,Mengcheng Lan,Xue Yang,Qingyun Li,Yiping Ke,Xue Jiang,Wayne Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 16 figures

点击查看摘要

Abstract:Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at this https URL
zh

[CV-32] Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成中颜色对齐不准的问题，尤其针对模糊或复合色彩描述（如“Tiffany blue”、“lime green”）导致的生成结果与人类意图不一致的问题。现有方法依赖交叉注意力调整、参考图像或微调，难以系统性处理颜色歧义。其解决方案的关键在于提出一种无需训练的框架：首先利用大语言模型（Large Language Model, LLM）解析文本提示中的模糊色彩术语，随后基于这些色彩在CIELAB颜色空间中的空间关系，直接在文本嵌入空间中优化颜色混合操作，从而提升颜色保真度，且无需额外训练或外部参考图像。

链接: https://arxiv.org/abs/2509.10058
作者: Sung-Lin Tsai,Bo-Lun Huang,Yu Ting Shen,Cheng Yu Yeo,Chiang Tseng,Bo-Kai Ruan,Wen-Sheng Lien,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025 (MM '25)

点击查看摘要

Abstract:Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.
zh

[CV-33] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

【速读】：该论文旨在解决当前多语言视觉问答（multilingual visual question answering, mVQA）中生成式 AI 模型在跨语言、跨模态推理能力上的局限性，尤其是现有链式思维（Chain-of-thought, CoT）方法主要依赖文本推理、缺乏对多语言和多模态信息协同建模的支持问题。解决方案的关键在于提出首个面向语言感知的视觉链式思维框架 LaV-CoT，其核心创新包括：（1）设计了一个可解释的多阶段推理流程，涵盖基于边界框（Bounding Box）的文本摘要、语言识别、对象级空间描述生成与逐步逻辑推理；（2）构建了一种自动化的数据精炼机制，通过迭代生成、纠错与优化实现高质量多语言 CoT 标注数据的规模化生产；（3）采用两阶段训练范式——监督微调（Supervised Fine-Tuning, SFT）结合语言感知组相对策略优化（Language-aware Group Relative Policy Optimization, GRPO），并引入可验证的多维度奖励信号（如语言一致性、结构准确性与语义对齐度）以提升模型推理能力和泛化性能。

链接: https://arxiv.org/abs/2509.10026
作者: Jing Huang,Zhiya Tan,Shutao Gong,Fanwei Zeng,Jianshu Li
机构: Ant Group(蚂蚁集团); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 Pages, 12 Figures, 2 Tables

点击查看摘要

Abstract:As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbfLaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to (\sim)9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2 \times larger scales by (\sim)2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \hrefthis https URL
zh

[CV-34] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images

【速读】：该论文旨在解决从单张野外（in-the-wild）2D图像中重建高质量3D人脸模型的问题，这一任务面临的真实场景复杂性和缺乏真实标注数据的挑战。解决方案的关键在于提出一种基于卷积神经网络的分层多级注意力网络（Hierarchical Multi-Level Attention Network, MLANet），该网络通过在2D人脸特征提取的不同阶段引入多级注意力机制，增强对关键面部结构的感知能力；同时采用半监督训练策略，结合公开数据集中的3D Morphable Model (3DMM) 参数与可微渲染器，实现端到端优化，从而有效提升重建精度和鲁棒性。

链接: https://arxiv.org/abs/2509.10024
作者: Danling Cao
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was completed during the author’s MPhil studies at the University of Manchester

点击查看摘要

Abstract:Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.
zh

[CV-35] Efficient and Accurate Downfacing Visual Inertial Odometry

【速读】：该论文旨在解决高精度视觉惯性里程计（Visual Inertial Odometry, VIO）在微小型无人机（micro- and nano-UAVs）上实时运行时面临的计算资源受限与精度难以兼顾的问题。传统VIO系统通常依赖于高性能计算平台，而针对嵌入式微控制器的轻量级实现往往牺牲了定位精度。解决方案的关键在于设计了一条优化的VIO流水线，集成并量化了先进的特征检测与跟踪方法（SuperPoint、PX4FLOW、ORB），适配基于RISC-V架构的超低功耗片上系统（SoC），同时引入刚体运动模型以降低平面运动场景下的估计误差。实验表明，在GAP9低功耗SoC上，采用ORB特征追踪器时，该方案平均均方根误差（RMSE）相较基线降低达3.65倍；且PX4FLOW在移动速度低于24像素/帧时，可实现与ORB相当的跟踪精度但运行时间更短，验证了其在资源受限环境下的高效性与实用性。

链接: https://arxiv.org/abs/2509.10021
作者: Jonas Kühne,Christian Vogt,Michele Magno,Luca Benini
机构: ETH Zurich (苏黎世联邦理工学院); University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: This article has been accepted for publication in the IEEE Internet of Things Journal (IoT-J)

点击查看摘要

Abstract:Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent’s movement through a camera and an IMU sensor. This paper presents an efficient and accurate VIO pipeline optimized for applications on micro- and nano-UAVs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline’s suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker. The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame.
zh

[CV-36] Few-Part-Shot Font Generation ICDAR2025

【速读】：该论文旨在解决传统少样本字体生成（few-shot font generation）方法对完整字符形状依赖过高所带来的效率瓶颈问题。其解决方案的关键在于提出了一种基于局部设计元素（partial shapes）的全新字体生成模型，仅需少量部分笔画或局部结构即可生成完整字体，从而显著提升字体创建效率，并揭示局部设计细节如何影响整体字符结构。

链接: https://arxiv.org/abs/2509.10006
作者: Masaki Akiba,Shumpei Takezaki,Daichi Haraguchi,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICDAR 2025 Workshop on Machine Learning

点击查看摘要

Abstract:This paper proposes a novel model of few-part-shot font generation, which designs an entire font based on a set of partial design elements, i.e., partial shapes. Unlike conventional few-shot font generation, which requires entire character shapes for a couple of character classes, our approach only needs partial shapes as input. The proposed model not only improves the efficiency of font creation but also provides insights into how partial design details influence the entire structure of the individual characters.
zh

[CV-37] UNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion

【速读】：该论文旨在解决RGB-thermal（RGB-T）语义分割中热红外特征提取不足与跨模态融合效果不佳的问题，同时缓解因冗余编码器导致的实时性下降。其解决方案的关键在于提出TUNI模型，通过设计一个集成多模态特征提取与跨模态融合能力的统一RGB-T编码器，利用大规模RGB和伪热数据预训练实现特征提取与融合的一体化学习；并通过精简热分支结构降低模型复杂度，结合引入的RGB-T局部模块，采用自适应余弦相似度机制选择性强化跨模态一致性和差异性局部特征，从而在保持高精度的同时显著提升计算效率与实时性能。

链接: https://arxiv.org/abs/2509.10005
作者: Xiaodong Guo,Tong Liu,Yike Li,Zi’ang Lin,Zhihong Deng
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at this https URL.
zh

[CV-38] FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction ICONIP2025

【速读】：该论文旨在解决太阳耀斑预测中因耀斑类别严重不平衡而导致的模型性能不足问题（即高频率低强度耀斑与低频率高强度耀斑之间的分布不均衡），这直接影响了预测的准确性与可靠性。其解决方案的关键在于提出一种基于多个深度状态空间模型（deep state space models）的预测框架，并创新性地引入频率局部边界感知可靠性损失函数（frequency local-boundary-aware reliability loss, FLARE loss），该损失函数能够有效提升模型在类别不平衡条件下的预测性能与校准可靠性，实验表明该方法在Gandin-Murphy-Gerrity评分和真实技能统计（True Skill Statistic）两项标准指标上均优于现有基线方法。

链接: https://arxiv.org/abs/2509.09988
作者: Yusuke Takagi,Shunya Nagashima,Komei Sugiura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Solar and Stellar Astrophysics (astro-ph.SR)
备注: Accepted for presentation at ICONIP2025

点击查看摘要

Abstract:Accurate and reliable solar flare predictions are essential to mitigate potential impacts on critical infrastructure. However, the current performance of solar flare forecasting is insufficient. In this study, we address the task of predicting the class of the largest solar flare expected to occur within the next 72 hours. Existing methods often fail to adequately address the severe class imbalance across flare classes. To address this issue, we propose a solar flare prediction model based on multiple deep state space models. In addition, we introduce the frequency local-boundary-aware reliability loss (FLARE loss) to improve predictive performance and reliability under class imbalance. Experiments were conducted on a multi-wavelength solar image dataset covering a full 11-year solar activity cycle. As a result, our method outperformed baseline approaches in terms of both the Gandin-Murphy-Gerrity score and the true skill statistic, which are standard metrics in terms of the performance and reliability.
zh

[CV-39] ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

【速读】：该论文旨在解决RGB-Event视觉目标跟踪中，现有人工神经网络（ANN）难以充分挖掘事件流稀疏性和异步性特征的问题，以及混合ANN-SNN架构在跨模态与跨范式特征融合上的挑战。其解决方案的关键在于提出ISTASTrack——首个基于Transformer的ANN-SNN混合跟踪器，核心创新是设计了基于稀疏表示理论的ISTA适配器（ISTA adapters），通过展开迭代收缩阈值算法（iterative shrinkage thresholding algorithm, ISTA）实现双分支间双向特征交互，并引入时间下采样注意力模块以对齐多步SNN特征与单步ANN特征在潜在空间中的时序一致性，从而有效融合RGB图像的空间上下文与事件流的时空动态信息，显著提升跟踪性能与能效。

链接: https://arxiv.org/abs/2509.09977
作者: Siying Liu,Zikai Wang,Hanle Zheng,Yifan Hu,Xilin Wang,Qingkai Yang,Jibin Wu,Hao Guo,Lei Deng
机构: Tsinghua University (清华大学); Taiyuan University of Technology (太原理工大学); Beijing Institute of Technology (北京理工大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbfANN-\textbfSNN hybrid \textbfTracker equipped with \textbfISTA adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at this https URL.
zh

[CV-40] Event Camera Guided Visual Media Restoration 3D Reconstruction: A Survey

【速读】：该论文旨在解决传统帧基（frame-based）视觉系统在低延迟、高动态范围和复杂光照条件下的性能瓶颈问题，特别是针对视频增强与3D重建任务中因运动模糊、低光、噪声等挑战导致的视觉质量下降问题。其解决方案的关键在于融合事件流（event stream）与传统帧图像数据，利用事件相机（event camera）具备的异步、高时间分辨率和低功耗特性，结合深度学习方法，在时空两个维度上实现更精准的图像/视频增强与3D重建：时间维度上提升帧间插值与运动去模糊能力，空间维度上改善超分辨率、低光增强、高动态范围（HDR）合成及伪影抑制效果。通过系统梳理相关深度学习模型并整合公开数据集，该文为未来基于事件驱动融合的视觉媒体恢复与增强研究提供了理论基础与实践指导。

链接: https://arxiv.org/abs/2509.09971
作者: Aupendu Kar,Vishnu Raj,Guan-Ming Su
机构: Dolby Laboratories, Inc.(杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event camera sensors are bio-inspired sensors which asynchronously capture per-pixel brightness changes and output a stream of events encoding the polarity, location and time of these changes. These systems are witnessing rapid advancements as an emerging field, driven by their low latency, reduced power consumption, and ultra-high capture rates. This survey explores the evolution of fusing event-stream captured with traditional frame-based capture, highlighting how this synergy significantly benefits various video restoration and 3D reconstruction tasks. The paper systematically reviews major deep learning contributions to image/video enhancement and restoration, focusing on two dimensions: temporal enhancement (such as frame interpolation and motion deblurring) and spatial enhancement (including super-resolution, low-light and HDR enhancement, and artifact reduction). This paper also explores how the 3D reconstruction domain evolves with the advancement of event driven fusion. Diverse topics are covered, with in-depth discussions on recent works for improving visual quality under challenging conditions. Additionally, the survey compiles a comprehensive list of openly available datasets, enabling reproducible research and benchmarking. By consolidating recent progress and insights, this survey aims to inspire further research into leveraging event camera systems, especially in combination with deep learning, for advanced visual media restoration and enhancement.
zh

[CV-41] An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock CVPR

【速读】：该论文旨在解决长时多目标跟踪（Long-term Multi-Object Tracking, MOT）中因目标身份切换导致性能随时间下降的问题。现有MOT方法在长时间视频中易出现ID漂移，难以维持稳定的身份一致性。其解决方案的关键在于引入基于隐马尔可夫模型（Hidden Markov Model, HMM）的框架，将稀疏但真实的个体标识信息（如牲畜在喂食站的识别）与跟踪过程融合，通过概率建模有效整合不确定的身份信息和运动轨迹，从而提升跟踪的准确性和鲁棒性。实验表明，该方法在猪只10分钟跟踪数据集上显著改善了ByteTrack的F1分数，并在MOT17和MOT20基准数据集上验证了通用性与对身份不确定性具有强鲁棒性。

链接: https://arxiv.org/abs/2509.09962
作者: Anne Marthe Sophie Ngo Bibinbe,Chiron Bang,Patrick Gagnon,Jamie Ahloy-Dallaire,Eric R. Paquet
机构: Université Laval (拉瓦尔大学); Florida Atlantic University (佛罗里达大西洋大学); CDPQ (加拿大退休金计划投资委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 1 table, accepted at CVPR animal workshop 2024, submitted to IJCV

点击查看摘要

Abstract:The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen’s feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: this https URL
zh

[CV-42] Augment to Segment: Tackling Pixel-Level Imbalance in Wheat Disease and Pest Segmentation

【速读】：该论文旨在解决小麦叶片病害与虫害分割中因虫害区域占标注像素比例极低而导致的像素级不平衡问题（pixel-level imbalance），此问题会引发模型对常见类别过拟合、对稀有类别学习不足，从而影响整体分割性能。解决方案的关键在于提出一种随机投影复制粘贴（Random Projected Copy-and-Paste, RPCP）增强技术：从标注训练图像中提取稀有虫害区域块，施加随机几何变换以模拟多样性，并将其粘贴至合适位置（避免与已有病斑重叠）；同时引入随机投影滤波器对粘贴区域进行局部特征优化，实现与背景的自然融合，从而有效提升虫害类别的分割精度，且不损害其他类别的性能。

链接: https://arxiv.org/abs/2509.09961
作者: Tianqi Wei,Xin Yu,Zhi Chen,Scott Chapman,Zi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of foliar diseases and insect damage in wheat is crucial for effective crop management and disease control. However, the insect damage typically occupies only a tiny fraction of annotated pixels. This extreme pixel-level imbalance poses a significant challenge to the segmentation performance, which can result in overfitting to common classes and insufficient learning of rare classes, thereby impairing overall performance. In this paper, we propose a Random Projected Copy-and-Paste (RPCP) augmentation technique to address the pixel imbalance problem. Specifically, we extract rare insect-damage patches from annotated training images and apply random geometric transformations to simulate variations. The transformed patches are then pasted in appropriate regions while avoiding overlaps with lesions or existing damaged regions. In addition, we apply a random projection filter to the pasted regions, refining local features and ensuring a natural blend with the new background. Experiments show that our method substantially improves segmentation performance on the insect damage class, while maintaining or even slightly enhancing accuracy on other categories. Our results highlight the effectiveness of targeted augmentation in mitigating extreme pixel imbalance, offering a straightforward yet effective solution for agricultural segmentation problems.
zh

[CV-43] Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification

【速读】：该论文旨在解决指代表达理解（Referring Expression Comprehension, REC）任务中对特定训练数据依赖过强的问题，即传统方法通常需要针对REC任务进行专门的模型微调。其解决方案的关键在于提出一种零样本（zero-shot）的工作流：将REC重构为基于框区域的视觉-语言验证任务——利用COCO-clean通用检测器（YOLO-World）生成候选区域后，通过通用视觉语言模型（VLM）独立判断每个区域是否匹配文本描述，从而实现真/假分类。该方法避免了跨框干扰、支持拒绝回答和多匹配场景，且无需任何任务特定微调，在RefCOCO、RefCOCO+和RefCOCOg数据集上优于现有零样本基线及部分微调后的GroundingDINO模型，表明工作流设计比任务特定预训练更能驱动高性能零样本REC表现。

链接: https://arxiv.org/abs/2509.09958
作者: Jeffrey Liu,Rongbin Hu
机构: mycube.tv
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.
zh

[CV-44] Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge

【速读】：该论文旨在解决大规模Transformer模型在资源受限边缘设备上的部署难题，其核心挑战在于高计算和通信开销限制了模型的实际应用。解决方案的关键在于提出了一种无需训练的自适应token合并机制（adaptive token merging），该机制通过在每一层基于语义冗余度动态调整合并策略，实现对Transformer表示的运行时压缩——即根据输入内容的冗余性选择性地合并语义相近的token，从而在不牺牲任务性能的前提下显著降低推理成本与通信负担。该方法将合并策略发现建模为多目标优化问题，并采用贝叶斯优化（Bayesian optimization）获得准确率、推理成本和通信成本之间的帕累托最优解，最终在ImageNet分类和视觉问答任务中均实现了高效且鲁棒的性能表现。

链接: https://arxiv.org/abs/2509.09955
作者: Omar Erak,Omar Alhussein,Hatem Abou-Zeid,Mehdi Bennis,Sami Muhaidat
机构: KU 6G Research Center, Khalifa University, Abu Dhabi, UAE (阿联酋阿布扎比哈利法大学); Department of Electrical and Software Engineering, University of Calgary, Calgary, Alberta, Canada (加拿大阿尔伯塔省卡尔加里大学); Centre for Wireless Communications, University of Oulu, Finland (芬兰奥卢大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to IEEE Journals

点击查看摘要

Abstract:Large-scale transformers are central to modern semantic communication, yet their high computational and communication costs hinder deployment on resource-constrained edge devices. This paper introduces a training-free framework for adaptive token merging, a novel mechanism that compresses transformer representations at runtime by selectively merging semantically redundant tokens under per-layer similarity thresholds. Unlike prior fixed-ratio reduction, our approach couples merging directly to input redundancy, enabling data-dependent adaptation that balances efficiency and task relevance without retraining. We cast the discovery of merging strategies as a multi-objective optimization problem and leverage Bayesian optimization to obtain Pareto-optimal trade-offs between accuracy, inference cost, and communication cost. On ImageNet classification, we match the accuracy of the unmodified transformer with 30% fewer floating-point operations per second and under 20% of the original communication cost, while for visual question answering our method achieves performance competitive with the full LLaVA model at less than one-third of the compute and one-tenth of the bandwidth. Finally, we show that our adaptive merging is robust across varying channel conditions and provides inherent privacy benefits, substantially degrading the efficacy of model inversion attacks. Our framework provides a practical and versatile solution for deploying powerful transformer models in resource-limited edge intelligence scenarios.
zh

[CV-45] Chord: Chain of Rendering Decomposition for PBR Material Estimation from Generated Texture Images SIGGRAPH

【速读】：该论文旨在解决传统PBR（Physically Based Rendering，基于物理的渲染）材质创建与重建过程中对艺术家时间与专业技能依赖性强的问题。现有方法虽利用视觉基础模型从用户输入中合成PBR材质，但在质量、灵活性和用户控制方面仍存在不足。其解决方案的关键在于提出一种两阶段的“生成-估计”框架：第一阶段通过微调的扩散模型生成与用户输入对齐的阴影化、可平铺的纹理图像；第二阶段引入链式分解策略，通过单步图像条件扩散模型依次预测SVBRDF（Surface Vertex-Based Reflectance Distribution Function，表面顶点基础反射分布函数）通道，并将先前提取的特征作为输入，实现高效、高质量且灵活可控的材质生成与估计。

链接: https://arxiv.org/abs/2509.09952
作者: Zhi Ying,Boxiang Rong,Jingyu Wang,Maoyuan Xu
机构: Ubisoft La Forge (育碧上海实验室); ETH Zürich (苏黎世联邦理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH Asia 2025. Project page: this https URL

点击查看摘要

Abstract:Material creation and reconstruction are crucial for appearance modeling but traditionally require significant time and expertise from artists. While recent methods leverage visual foundation models to synthesize PBR materials from user-provided inputs, they often fall short in quality, flexibility, and user control. We propose a novel two-stage generate-and-estimate framework for PBR material generation. In the generation stage, a fine-tuned diffusion model synthesizes shaded, tileable texture images aligned with user input. In the estimation stage, we introduce a chained decomposition scheme that sequentially predicts SVBRDF channels by passing previously extracted representation as input into a single-step image-conditional diffusion model. Our method is efficient, high quality, and enables flexible user control. We evaluate our approach against existing material generation and estimation methods, demonstrating superior performance. Our material estimation method shows strong robustness on both generated textures and in-the-wild photographs. Furthermore, we highlight the flexibility of our framework across diverse applications, including text-to-material, image-to-material, structure-guided generation, and material editing.
zh

[CV-46] Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation ICCV

【速读】：该论文旨在解决多目标多摄像头跟踪（Multi-Target Multi-Camera Tracking, MTMC）系统在从2D空间扩展到3D空间时面临的挑战，即如何在不重构整个2D跟踪模块的前提下，利用深度信息实现高效且准确的3D空间目标跟踪。其解决方案的关键在于：首先通过深度信息将目标在点云空间中重建，并在跟踪后通过聚类与航向角（yaw）精修恢复3D边界框；其次引入一种增强的在线数据关联机制，利用目标局部ID的一致性来跨帧分配全局ID，从而实现无需离线处理的实时3D MTMC跟踪。该方法成功应用于2025 AI City Challenge的3D MTMC数据集，取得第三名的成绩。

链接: https://arxiv.org/abs/2509.09946
作者: Vu-Minh Le,Thao-Anh Tran,Duc Huy Do,Xuan Canh Do,Huong Ninh,Hai Tran
机构: Optoelectronics Center, Viettel Aerospace Institute, Viettel Group (光电中心，越通集团航天研究所，越通集团); University of Engineering and Technology, Vietnam National University (工程技术大学，越南国家大学); School of Electrical and Electronic Engineering, Hanoi University of Science and Technology (电气与电子工程学院，河内科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCVW 2025

点击查看摘要

Abstract:Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target’s local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge’s 3D MTMC dataset, achieving 3rd place on the leaderboard.
zh

[CV-47] Segment Anything for Cell Tracking

【速读】：该论文旨在解决时间序列显微成像中细胞追踪与有丝分裂事件检测的难题，尤其针对分割对象分裂、信噪比低、边界模糊、细胞密集簇集以及个体细胞视觉相似性高等挑战。传统基于深度学习的方法依赖于人工标注数据集进行训练，存在成本高、耗时长且泛化能力受限的问题。其解决方案的关键在于引入Segment Anything 2（SAM2）这一通用图像和视频分割大模型，构建一种零样本（zero-shot）细胞追踪框架，无需任何特定数据集的微调即可实现跨多种显微成像数据集的高精度追踪，从而显著提升方法的通用性和实用性。

链接: https://arxiv.org/abs/2509.09943
作者: Zhu Chen,Mert Edgü,Er Jin,Johannes Stegmaier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking cells and detecting mitotic events in time-lapse microscopy image sequences is a crucial task in biomedical research. However, it remains highly challenging due to dividing objects, low signal-tonoise ratios, indistinct boundaries, dense clusters, and the visually similar appearance of individual cells. Existing deep learning-based methods rely on manually labeled datasets for training, which is both costly and time-consuming. Moreover, their generalizability to unseen datasets remains limited due to the vast diversity of microscopy data. To overcome these limitations, we propose a zero-shot cell tracking framework by integrating Segment Anything 2 (SAM2), a large foundation model designed for general image and video segmentation, into the tracking pipeline. As a fully-unsupervised approach, our method does not depend on or inherit biases from any specific training dataset, allowing it to generalize across diverse microscopy datasets without finetuning. Our approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos while eliminating the need for dataset-specific adaptation.
zh

[CV-48] SCoDA: Self-supervised Continual Domain Adaptation

【速读】：该论文旨在解决无源域适应（Source-Free Domain Adaptation, SFDA）中的关键挑战，即在无法访问源域数据的情况下，如何有效迁移模型性能至目标域。现有方法通常依赖于监督预训练的源模型，并通过实例级特征对齐进行知识蒸馏，但这类方法基于余弦相似度对归一化特征向量进行匹配，会无意中丢失源模型潜在流形结构中的重要几何信息。本文提出自监督持续域适应（Self-supervised Continual Domain Adaptation, SCoDA），其核心创新在于：一是摒弃监督预训练，采用完全通过自监督学习（Self-Supervised Learning, SSL）初始化的教师模型；二是将几何流形对齐原则引入SFDA场景，通过组合实例级特征匹配与空间相似性损失（Space Similarity Loss）来优化学生模型训练；同时利用指数移动平均（Exponential Moving Average, EMA）机制更新教师参数以缓解灾难性遗忘问题。实验证明，SCoDA显著优于当前最先进的SFDA方法。

链接: https://arxiv.org/abs/2509.09935
作者: Chirayu Agrawal,Snehasis Mukherjee
机构: Shiv Nadar Institution of Eminence (Shiv Nadar 研究生院卓越机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICVGIP 2025

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) addresses the challenge of adapting a model to a target domain without access to the data of the source domain. Prevailing methods typically start with a source model pre-trained with full supervision and distill the knowledge by aligning instance-level features. However, these approaches, relying on cosine similarity over L2-normalized feature vectors, inadvertently discard crucial geometric information about the latent manifold of the source model. We introduce Self-supervised Continual Domain Adaptation (SCoDA) to address these limitations. We make two key departures from standard practice: first, we avoid the reliance on supervised pre-training by initializing the proposed framework with a teacher model pre-trained entirely via self-supervision (SSL). Second, we adapt the principle of geometric manifold alignment to the SFDA setting. The student is trained with a composite objective combining instance-level feature matching with a Space Similarity Loss. To combat catastrophic forgetting, the teacher’s parameters are updated via an Exponential Moving Average (EMA) of the student’s parameters. Extensive experiments on benchmark datasets demonstrate that SCoDA significantly outperforms state-of-the-art SFDA methods.
zh

[CV-49] LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

【速读】：该论文旨在解决长尾分布下的半监督学习（Long-Tailed Semi-Supervised Learning, LTSSL）中存在的模型过拟合与伪标签质量低的问题，尤其是在训练过程中从零开始训练模型时表现不佳。其核心解决方案是将LTSSL扩展至基础模型微调（foundation model fine-tuning）范式，提出LoFT（Long-tailed semi-supervised learning via parameter-efficient Fine-Tuning）框架，利用预训练基础模型生成更可靠的伪标签，从而提升类别不平衡场景下的学习效果；进一步地，在开放世界条件下（即未标记数据中包含分布外样本，Out-of-Distribution, OOD），提出LoFT-OW方法以增强模型的判别能力，实验证明该方法在仅使用1%未标记数据的情况下仍优于现有方法。

链接: https://arxiv.org/abs/2509.09926
作者: Jiahao Chen,Zhiyuan Huang,Yurou Liu,Bing Su
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-tailed learning has garnered increasing attention due to its wide applicability in real-world scenarios. Among existing approaches, Long-Tailed Semi-Supervised Learning (LTSSL) has emerged as an effective solution by incorporating a large amount of unlabeled data into the imbalanced labeled dataset. However, most prior LTSSL methods are designed to train models from scratch, which often leads to issues such as overconfidence and low-quality pseudo-labels. To address these challenges, we extend LTSSL into the foundation model fine-tuning paradigm and propose a novel framework: LoFT (Long-tailed semi-supervised learning via parameter-efficient Fine-Tuning). We demonstrate that fine-tuned foundation models can generate more reliable pseudolabels, thereby benefiting imbalanced learning. Furthermore, we explore a more practical setting by investigating semi-supervised learning under open-world conditions, where the unlabeled data may include out-of-distribution (OOD) samples. To handle this problem, we propose LoFT-OW (LoFT under Open-World scenarios) to improve the discriminative ability. Experimental results on multiple benchmarks demonstrate that our method achieves superior performance compared to previous approaches, even when utilizing only 1% of the unlabeled data compared with previous works.
zh

[CV-50] An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars

【速读】：该论文旨在解决深度学习模型在高风险法医应用（如牙科年龄估计）中因“黑箱”特性导致的可解释性不足问题。其核心挑战在于，尽管模型性能可能较高，但缺乏对不确定性来源的清晰诊断，从而限制了专家对其决策的信任与采纳。解决方案的关键在于提出一个融合卷积自编码器（Convolutional Autoencoder, AE）与视觉Transformer（Vision Transformer, ViT）的框架：该框架不仅提升了分类准确率（例如对第二磨牙37号从0.712提升至0.815，第三磨牙38号从0.462提升至0.543），还通过分析AE潜在空间指标和图像重建结果，揭示性能瓶颈源于数据层面——特别是38号牙存在显著的类内形态变异性，而非模型本身缺陷。这一多维度诊断机制弥补了单一注意力图等解释方式的局限，为法医年龄估计提供了更可靠且可信赖的AI辅助工具。

链接: https://arxiv.org/abs/2509.09911
作者: Barkin Buyukcakir,Jannick De Tobel,Patrick Thevissen,Dirk Vandermeulen,Peter Claes
机构: KU Leuven (鲁汶大学); Ghent University (根特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 11 figures, Scientific Reports

点击查看摘要

Abstract:The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the ‘black box’ nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE’s latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.
zh

[CV-51] Surrogate Supervision for Robust and Generalizable Deformable Image Registration

【速读】：该论文旨在解决深度学习驱动的可变形图像配准（deformable image registration）模型在面对输入图像特性变化时鲁棒性不足的问题，例如伪影、视场不匹配或模态差异等。其解决方案的关键在于引入**代理监督（surrogate supervision）**机制——通过将估计的空间变换应用于代理图像（surrogate images），从而将输入域与监督域解耦，使模型能够在异质输入条件下训练，同时确保监督信号在相似性定义明确的域中计算，进而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2509.09869
作者: Yihao Liu,Junyu Chen,Lianrui Zuo,Shuwen Wei,Brian D. Boyd,Carmen Andreescu,Olusola Ajilore,Warren D. Taylor,Aaron Carass,Bennett A. Landman
机构: Vanderbilt University (范德比尔特大学); Johns Hopkins University (约翰霍普金斯大学); University of Pittsburgh (匹兹堡大学); University of Illinois College of Medicine (伊利诺伊大学医学院); Vanderbilt University Medical Center (范德比尔特大学医学中心); Veterans Affairs Tennessee Valley Health System (田纳西州退伍军人事务健康系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Deep learning-based deformable image registration has achieved strong accuracy, but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. We aim to develop a general training paradigm that improves the robustness and generalizability of registration networks. Methods: We introduce surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This allows training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. We evaluate the framework through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Results: Across tasks, surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. Conclusions: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Significance: Surrogate supervision offers a practical pathway to more robust and generalizable medical image registration, enabling broader applicability in diverse biomedical imaging scenarios.
zh

[CV-52] WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector

【速读】：该论文旨在解决复杂环境条件下无人机（UAV）目标检测性能下降的问题，尤其在视觉信息受限时如何提升检测鲁棒性。解决方案的关键在于提出一种多模态WAVE-DETR检测器，融合可见光RGB图像与声学信号，基于Deformable DETR和Wav2Vec2架构构建统一的检测模型。通过四种不同的融合机制（门控、线性层、MLP和交叉注意力）将声学嵌入与Deformable DETR的多尺度特征映射进行融合，其中门控融合策略表现最优，在小尺寸无人机上mAP提升达11.1%–15.3%，中大型无人机也有显著增益（3.27%–5.84%），有效增强了模型在分布内和分布外数据上的泛化能力。

链接: https://arxiv.org/abs/2509.09859
作者: Razvan Stefanescu,Ethan Oh,Ruben Vazquez,Chris Mesterharm,Constantin Serban,Ritu Chadha
机构: Peraton Labs (Peraton 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 11 figures

点击查看摘要

Abstract:We introduce a multi-modal WAVE-DETR drone detector combining visible RGB and acoustic signals for robust real-life UAV object detection. Our approach fuses visual and acoustic features in a unified object detector model relying on the Deformable DETR and Wav2Vec2 architectures, achieving strong performance under challenging environmental conditions. Our work leverage the existing Drone-vs-Bird dataset and the newly generated ARDrone dataset containing more than 7,500 synchronized images and audio segments. We show how the acoustic information is used to improve the performance of the Deformable DETR object detector on the real ARDrone dataset. We developed, trained and tested four different fusion configurations based on a gated mechanism, linear layer, MLP and cross attention. The Wav2Vec2 acoustic embeddings are fused with the multi resolution feature mappings of the Deformable DETR and enhance the object detection performance over all drones dimensions. The best performer is the gated fusion approach, which improves the mAP of the Deformable DETR object detector on our in-distribution and out-of-distribution ARDrone datasets by 11.1% to 15.3% for small drones across all IoU thresholds between 0.5 and 0.9. The mAP scores for medium and large drones are also enhanced, with overall gains across all drone sizes ranging from 3.27% to 5.84%.
zh

[CV-53] Investigating the Impact of Various Loss Functions and Learnable Wiener Filter for Laparoscopic Image Desmoking

【速读】：该论文旨在解决腹腔镜图像去烟雾（laparoscopic image desmoking）中现有方法性能受限的问题，重点评估近期提出的ULW框架中各组件的有效性与必要性。其解决方案的关键在于构建一个集成U-Net主干网络、复合损失函数（包含均方误差MSE、结构相似性指数SSIM损失和感知损失）以及可微分的可学习维纳滤波模块（differentiable, learnable Wiener filter module）的端到端框架，并通过系统性消融实验量化每个组件对整体性能的贡献。

链接: https://arxiv.org/abs/2509.09849
作者: Chengyu Yang,Chengjun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To rigorously assess the effectiveness and necessity of individual components within the recently proposed ULW framework for laparoscopic image desmoking, this paper presents a comprehensive ablation study. The ULW approach combines a U-Net based backbone with a compound loss function that comprises mean squared error (MSE), structural similarity index (SSIM) loss, and perceptual loss. The framework also incorporates a differentiable, learnable Wiener filter module. In this study, each component is systematically ablated to evaluate its specific contribution to the overall performance of the whole framework. The analysis includes: (1) removal of the learnable Wiener filter, (2) selective use of individual loss terms from the composite loss function. All variants are benchmarked on a publicly available paired laparoscopic images dataset using quantitative metrics (SSIM, PSNR, MSE and CIEDE-2000) alongside qualitative visual comparisons.
zh

[CV-54] Privacy-Preserving Automated Rosacea Detection Based on Medically Inspired Region of Interest Selection

【速读】：该论文旨在解决玫瑰痤疮（Rosacea）自动化检测中面临的三大挑战：症状分布弥散、标注数据稀缺以及使用可识别面部图像引发的隐私问题。解决方案的关键在于结合临床先验知识与合成数据，提出一种隐私保护型检测方法：首先基于玫瑰痤疮主要表现为面部中央区域红斑的临床特征，构建一个固定红度引导掩膜（redness-informed mask），精准聚焦于脸颊、鼻部和额头等诊断相关区域，同时排除可能泄露身份的信息；其次，在该掩膜基础上训练ResNet-18深度学习模型，利用合成数据实现对真实世界测试数据的高准确率、高召回率和高F1分数表现，从而在保障隐私的前提下提升诊断性能。

链接: https://arxiv.org/abs/2509.09844
作者: Chengyu Yang,Rishik Reddy Yesgari,Chengjun Liu
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rosacea is a common but underdiagnosed inflammatory skin condition that primarily affects the central face and presents with subtle redness, pustules, and visible blood vessels. Automated detection remains challenging due to the diffuse nature of symptoms, the scarcity of labeled datasets, and privacy concerns associated with using identifiable facial images. A novel privacy-preserving automated rosacea detection method inspired by clinical priors and trained entirely on synthetic data is presented in this paper. Specifically, the proposed method, which leverages the observation that rosacea manifests predominantly through central facial erythema, first constructs a fixed redness-informed mask by selecting regions with consistently high red channel intensity across facial images. The mask thus is able to focus on diagnostically relevant areas such as the cheeks, nose, and forehead and exclude identity-revealing features. Second, the ResNet-18 deep learning method, which is trained on the masked synthetic images, achieves superior performance over the full-face baselines with notable gains in terms of accuracy, recall and F1 score when evaluated using the real-world test data. The experimental results demonstrate that the synthetic data and clinical priors can jointly enable accurate and ethical dermatological AI systems, especially for privacy sensitive applications in telemedicine and large-scale screening.
zh

[CV-55] Patch-based Automatic Rosacea Detection Using the ResNet Deep Learning Framework

【速读】：该论文旨在解决玫瑰痤疮（Rosacea）早期精准检测难题，以提升治疗效果。其解决方案的关键在于提出基于图像块（patch-based）的自动检测策略，利用ResNet-18深度学习框架，通过提取不同尺寸、形状和位置的局部图像块来聚焦临床相关区域，从而在保证高准确率与敏感度的同时，增强模型的鲁棒性和可解释性，并通过仅使用局部图像块避免包含可识别面部特征，实现患者隐私保护。

链接: https://arxiv.org/abs/2509.09841
作者: Chengyu Yang,Rishik Reddy Yesgari,Chengjun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rosacea, which is a chronic inflammatory skin condition that manifests with facial redness, papules, and visible blood vessels, often requirs precise and early detection for significantly improving treatment effectiveness. This paper presents new patch-based automatic rosacea detection strategies using the ResNet-18 deep learning framework. The contributions of the proposed strategies come from the following aspects. First, various image pateches are extracted from the facial images of people in different sizes, shapes, and locations. Second, a number of investigation studies are carried out to evaluate how the localized visual information influences the deep learing model performance. Third, thorough experiments are implemented to reveal that several patch-based automatic rosacea detection strategies achieve competitive or superior accuracy and sensitivity than the full-image based methods. And finally, the proposed patch-based strategies, which use only localized patches, inherently preserve patient privacy by excluding any identifiable facial features from the data. The experimental results indicate that the proposed patch-based strategies guide the deep learning model to focus on clinically relevant regions, enhance robustness and interpretability, and protect patient privacy. As a result, the proposed strategies offer practical insights for improving automated dermatological diagnostics.
zh

[CV-56] DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

【速读】：该论文旨在解决自动驾驶中多传感器融合感知在复杂环境下的性能瓶颈问题，尤其是现有方法对传感器数据进行空间均匀处理时，在深度变化剧烈或传感器可靠性不一致的场景中表现不佳。其解决方案的关键在于提出一种基于深度引导的多模态融合方法（DGFusion），通过将激光雷达（LiDAR）数据同时作为输入和深度监督信号，引入辅助深度头以学习深度感知特征，并生成空间可变的局部深度令牌（local depth tokens），结合全局条件令牌动态调节跨模态注意力机制，从而实现按场景深度变化自适应调整的传感器融合策略。此外，论文还设计了一种鲁棒的深度损失函数，有效应对激光雷达在恶劣条件下点云稀疏与噪声的问题，显著提升了语义分割与全景分割的准确性。

链接: https://arxiv.org/abs/2509.09828
作者: Tim Broedermannn,Christos Sakaridis,Luigi Piccinelli,Wim Abbeloos,Luc Van Gool
机构: ETH Zürich (苏黎世联邦理工学院); Toyota Motor Europe (丰田汽车欧洲公司); INSAIT, Sofia University St. Kliment Ohridski (INSAIT，索非亚大学圣克莱门特奥赫里德斯基)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Code and models will be available at this https URL

点击查看摘要

Abstract:Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model’s inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at this https URL
zh

[CV-57] Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test

【速读】：该论文旨在解决儿童视力异常早期筛查难的问题，特别是针对缺乏专业设备和医疗资源的场景下如何实现便捷、可靠的视力筛查。其解决方案的关键在于开发了一款名为KidsVisionCheck的免费移动应用，利用智能手机拍摄的红眼反射图像（red-eye reflex images），通过深度神经网络模型进行分析，该模型基于眼科医生标注的儿童瞳孔图像训练而成，在未见过的测试数据上达到了90%的准确率，从而实现了无需专业设备即可高可靠性的视力筛查，并能指导用户在最优条件下采集数据以提供即时反馈。

链接: https://arxiv.org/abs/2509.09808
作者: Judith Massmann,Alexander Lichtenstein,Francisco M. López
机构: Health Access LLC(健康访问有限责任公司); Frankfurt Institute for Advanced Studies(法兰克福先进研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE ICDL 2025. 6 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Numerous visual impairments can be detected in red-eye reflex images from young children. The so-called Bruckner test is traditionally performed by ophthalmologists in clinical settings. Thanks to the recent technological advances in smartphones and artificial intelligence, it is now possible to recreate the Bruckner test using a mobile device. In this paper, we present a first study conducted during the development of KidsVisionCheck, a free application that can perform vision screening with a mobile device using red-eye reflex images. The underlying model relies on deep neural networks trained on children’s pupil images collected and labeled by an ophthalmologist. With an accuracy of 90% on unseen test data, our model provides highly reliable performance without the necessity of specialist equipment. Furthermore, we can identify the optimal conditions for data collection, which can in turn be used to provide immediate feedback to the users. In summary, this work marks a first step toward accessible pediatric vision screenings and early intervention for vision abnormalities worldwide.
zh

[CV-58] Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors

【速读】：该论文旨在解决跨视角（ground-to-aerial）图像中精细定位问题，即准确估计地面图像的3自由度（3DoF）相机位姿，同时保持方法的高度可解释性。传统方法通常将地面图像变换为鸟瞰图（Bird’s-Eye View, BEV）表示后再与航拍图像对齐，但此过程易因透视失真或高度信息压缩导致特征信息丢失，从而降低匹配质量。其解决方案的关键在于：直接在地面图像与航拍图像之间建立局部特征对应关系，并仅将匹配的关键点通过单目深度先验提升至BEV空间。该策略避免了整体图像变换带来的信息损失；此外，方法支持使用度量深度（metric depth）和相对深度（relative depth），并引入尺度感知的Procrustes对齐算法从对应关系中估计位姿，当采用相对深度时还可选择性恢复尺度。实验表明，该方法在弱监督下即可学习鲁棒的局部特征对应，且在跨区域泛化和未知朝向等挑战场景中表现优异，同时兼容多种相对深度模型而无需针对每种模型单独微调。

链接: https://arxiv.org/abs/2509.09792
作者: Zimin Xia,Chenghao Xu,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL), Switzerland(瑞士洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose an accurate and highly interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image. Previous methods typically transform the ground image into a bird’s-eye view (BEV) representation and then align it with the aerial image for localization. However, this transformation often leads to information loss due to perspective distortion or compression of height information, thereby degrading alignment quality with the aerial view. In contrast, our method directly establishes correspondences between ground and aerial images and lifts only the matched keypoints to BEV space using monocular depth prior. Notably, modern depth predictors can provide reliable metric depth when the test samples are similar to the training data. When the depth distribution differs, they still produce consistent relative depth, i.e., depth accurate up to an unknown scale. Our method supports both metric and relative depth. It employs a scale-aware Procrustes alignment to estimate the camera pose from the correspondences and optionally recover the scale when using relative depth. Experimental results demonstrate that, with only weak supervision on camera pose, our method learns accurate local feature correspondences and achieves superior localization performance under challenging conditions, such as cross-area generalization and unknown orientation. Moreover, our method is compatible with various relative depth models without requiring per-model finetuning. This flexibility, combined with strong localization performance, makes it well-suited for real-world deployment.
zh

[CV-59] Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token Purging

【速读】：该论文旨在解决3D点云分类任务中因分布偏移（distribution shift）导致的性能下降问题，这是测试时适应（Test-time Adaptation, TTA）场景下的核心挑战。其解决方案的关键在于提出了一种无需反向传播（backpropagation-free）的令牌裁剪（Token Purging, PG）方法：在令牌进入注意力机制层之前，识别并移除受域偏移影响显著的令牌，从而提升模型对目标域的鲁棒性。PG通过在令牌层面进行干预，避免了迭代更新过程，同时提出了两种变体——基于源域统计信息的PG-SP和完全无源（source-free）的PG-SF，分别在有源与无源条件下实现了卓越的适应性能与效率优势。

链接: https://arxiv.org/abs/2509.09785
作者: Moslem Yazdanpanah,Ali Bahri,Mehrdad Noori,Sahar Dastani,Gustavo Adolfo Vargas Hakim,David Osowiechi,Ismail Ben Ayed,Christian Desrosiers
机构: LIVIA, ÉTS Montréal, Canada; International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4 times faster and 5.5 times more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at \hyperlinkthis https URLthis https URL
zh

[CV-60] A Co-Training Semi-Supervised Framework Using Faster R-CNN and YOLO Networks for Object Detection in Densely Packed Retail Images

【速读】：该论文旨在解决零售环境中密集排列商品的物体检测问题，其核心挑战在于标注数据稀缺与复杂场景（如遮挡和重叠物体）导致的检测精度下降。解决方案的关键在于提出一种半监督协同训练框架，通过融合Faster R-CNN（ResNet骨干网络）的精确定位能力与YOLO（Darknet骨干网络）的全局上下文感知能力，实现模型间伪标签的互换以提升鲁棒性；同时引入XGBoost、随机森林和SVM的集成分类器，利用多样化的特征表示增强分类准确性，并采用元启发式算法优化超参数，从而在减少人工标注依赖的前提下显著提高检测性能，适用于频繁更新商品种类与布局的零售场景。

链接: https://arxiv.org/abs/2509.09750
作者: Hossein Yazdanjouei,Arash Mansouri,Mohammad Shokouhifar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study proposes a semi-supervised co-training framework for object detection in densely packed retail environments, where limited labeled data and complex conditions pose major challenges. The framework combines Faster R-CNN (utilizing a ResNet backbone) for precise localization with YOLO (employing a Darknet backbone) for global context, enabling mutual pseudo-label exchange that improves accuracy in scenes with occlusion and overlapping objects. To strengthen classification, it employs an ensemble of XGBoost, Random Forest, and SVM, utilizing diverse feature representations for higher robustness. Hyperparameters are optimized using a metaheuristic-driven algorithm, enhancing precision and efficiency across models. By minimizing reliance on manual labeling, the approach reduces annotation costs and adapts effectively to frequent product and layout changes common in retail. Experiments on the SKU-110k dataset demonstrate strong performance, highlighting the scalability and practicality of the proposed framework for real-world retail applications such as automated inventory tracking, product monitoring, and checkout systems.
zh

[CV-61] Images in Motion?: A First Look into Video Leakage in Collaborative Deep Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中视频数据在梯度逆向攻击（gradient inversion attacks）下可能泄露的问题。当前研究已证实此类攻击可从共享的梯度中重构图像、文本和表格数据，但视频数据的隐私风险尚未被系统评估。论文的关键解决方案在于首次对两种主流视频分类方法——基于预训练特征提取器的方法与直接处理原始视频帧的方法——进行了对比实验，发现使用特征提取器能显著提升对梯度逆向攻击的抵抗能力；同时指出即使采用特征提取器，若分类器复杂度不足仍存在泄漏风险，并进一步验证了图像超分辨率技术可提升攻击者重建视频的质量，从而揭示出视频数据在FL场景下的隐私脆弱性及其影响因素。

链接: https://arxiv.org/abs/2509.09742
作者: Md Fazle Rasul,Alanood Alqobaisi,Bruhadeshwar Bezawada,Indrakshi Ray
机构: Colorado State University (科罗拉多州立大学); Southern Arkansas University (南方阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.
zh

[CV-62] World Modeling with Probabilistic Structure Integration

【速读】：该论文旨在解决如何从大规模数据中学习具有高度可控性和灵活提示能力的世界模型（world models）的问题。传统方法往往难以同时实现对复杂数据分布的精确建模与多样化的控制指令支持，而本文提出的Probabilistic Structure Integration (PSI) 系统通过一个三步循环机制实现了这一目标：首先构建基于概率图模型的自回归序列模型 Psi，以支持任意变量间的条件分布；其次利用因果推理从模型中零样本提取低维中间结构（intermediate structures），如光流、深度和物体分割等；最后将这些结构转化为新的标记类型并作为条件信号重新融入训练过程，从而持续增强模型性能并引入类大语言模型（LLM-like）的通用提示接口。其关键创新在于将结构提取与模型迭代训练有机结合，形成闭环优化机制，显著提升了世界模型的表达能力和可控性。

链接: https://arxiv.org/abs/2509.09737
作者: Klemen Kotar,Wanhee Lee,Rahul Venkatesh,Honglin Chen,Daniel Bear,Jared Watrous,Simon Kim,Khai Loong Aw,Lilian Naing Chen,Stefan Stojanov,Kevin Feigelis,Imran Thobani,Alex Durango,Khaled Jedoui,Atlas Kazemian,Dan Yamins
机构: Stanford NeuroAI Lab (斯坦福大学神经人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful “intermediate structures”, in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles – akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.
zh

[CV-63] Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs

【速读】：该论文旨在解决视觉语言模型（Vision Language Models, VLMs）在细粒度视觉分类和大规模层级标签空间中性能不足的问题，特别是探讨结构化树状推理是否能提升其表现。解决方案的关键在于引入一种基于决策树的框架，将分类任务分解为一系列可解释的决策步骤，并通过提示工程（prompt engineering）增强树结构的语义对齐，例如结合大语言模型（Large Language Models, LLMs）生成的类别名称与图像描述来优化提示内容。实验表明，尽管模型能高精度理解树知识（达98.2%），但树状推理仍逊于标准零样本提示方法，说明当前结构化推理机制在视觉分类任务中存在局限性。

链接: https://arxiv.org/abs/2509.09732
作者: Sary Elmansoury,Islam Mesabah,Gerrit Großmann,Peter Neigel,Raj Bhalwankar,Daniel Kondermann,Sebastian J. Vollmer
机构: 1: University of Stuttgart (斯图加特大学); 2: Fraunhofer Institute for Intelligent Analysis and Information Systems (弗劳恩霍夫智能分析与信息系统研究所); 3: University of Bonn (波恩大学); 4: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.
zh

[CV-64] MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance

【速读】：该论文旨在解决通用领域大模型（Large Multimodal Models, LMMs）在智能交通监控（Intelligent Traffic Surveillance, ITS）场景中性能受限的问题，其根本原因在于缺乏针对ITS任务的专用多模态数据集。解决方案的关键在于构建首个面向ITS的大规模多模态基准数据集MITS，该数据集包含170,400张真实世界交通监控图像，并细粒度标注了8个主类别和24个子类别对象与事件，同时通过系统化的数据生成流程，补充高质量图像描述和500万条指令遵循型视觉问答对，覆盖五大核心ITS任务：目标识别、事件识别、计数、定位、背景分析与事件推理。实验表明，基于MITS微调主流LMMs可显著提升其在ITS任务中的表现，验证了该数据集作为关键基础设施的价值。

链接: https://arxiv.org/abs/2509.09730
作者: Kaikai Zhao,Zhaoxiang Liu,Peng Wang,Xin Wang,Zhicheng Ma,Yajun Xu,Wenjing Zhang,Yibing Nan,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by Image and Vision Computing

点击查看摘要

Abstract:General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS’s effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5’s performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6’s from 0.678 to 0.921 (+35.8%), Qwen2-VL’s from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL’s from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.
zh

[CV-65] A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval

【速读】：该论文旨在解决自然灾害后房屋损毁评估的准确性问题，以支持保险理赔响应和资源规划。其核心解决方案是提出一种多模态检索增强生成（Multimodal Retrieval-Augmented Generation, MM-RAG）框架，关键创新在于设计了一个双分支编码结构：图像分支采用ResNet与Transformer组合的视觉编码器提取灾后建筑损伤特征，文本分支利用BERT检索器对社交媒体帖子及保险条款进行向量化并构建可检索的修复指数；同时引入跨模态交互模块通过多头注意力机制实现图像与文本语义对齐，并在生成模块中嵌入模态注意力门控机制，动态调控视觉证据与文本先验信息的贡献权重。整个模型采用端到端训练，结合对比损失、检索损失与生成损失构成多任务优化目标，从而实现图像理解与政策匹配的协同学习，最终在检索准确率和损伤严重程度分类指标上取得显著提升（Top-1检索准确率提高9.6%）。

链接: https://arxiv.org/abs/2509.09721
作者: Jiayi Miao,Dingxin Lu,Zhuqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:After natural disasters, accurate evaluations of damage to housing are important for insurance claims response and planning of resources. In this work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG) framework. On top of classical RAG architecture, we further the framework to devise a two-branch multimodal encoder structure that the image branch employs a visual encoder composed of ResNet and Transformer to extract the characteristic of building damage after disaster, and the text branch harnesses a BERT retriever for the text vectorization of posts as well as insurance policies and for the construction of a retrievable restoration index. To impose cross-modal semantic alignment, the model integrates a cross-modal interaction module to bridge the semantic representation between image and text via multi-head attention. Meanwhile, in the generation module, the introduced modal attention gating mechanism dynamically controls the role of visual evidence and text prior information during generation. The entire framework takes end-to-end training, and combines the comparison loss, the retrieval loss and the generation loss to form multi-task optimization objectives, and achieves image understanding and policy matching in collaborative learning. The results demonstrate superior performance in retrieval accuracy and classification index on damage severity, where the Top-1 retrieval accuracy has been improved by 9.6%.
zh

[CV-66] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision

【速读】：该论文旨在解决现有数据集在机器人技术和计算机视觉应用中普遍存在的局限性问题，即多数数据集依赖于合成模型或难以获取的特殊物体，缺乏真实世界物品的多样性与可访问性。其解决方案的关键在于构建澳大利亚超市物体集合（Australian Supermarket Object Set, ASOS），该数据集包含50种常见超市商品的高质量3D纹理网格，通过结构光重建（structure-from-motion）技术结合高分辨率成像获得，确保了网格的封闭性和真实性；同时，这些物体均来自澳大利亚主流超市，具有成本低、易获取的特点，从而显著提升了基准测试在实际场景中的适用性与普适性。

链接: https://arxiv.org/abs/2509.09720
作者: Akansel Cosgun,Lachlan Chumbley,Benjamin J. Meyer
机构: Deakin University (迪肯大学); Monash University (莫纳什大学); Coles Group (科尔斯集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This paper introduces the Australian Supermarket Object Set (ASOS), a comprehensive dataset comprising 50 readily available supermarket items with high-quality 3D textured meshes designed for benchmarking in robotics and computer vision applications. Unlike existing datasets that rely on synthetic models or specialized objects with limited accessibility, ASOS provides a cost-effective collection of common household items that can be sourced from a major Australian supermarket chain. The dataset spans 10 distinct categories with diverse shapes, sizes, and weights. 3D meshes are acquired by a structure-from-motion techniques with high-resolution imaging to generate watertight meshes. The dataset’s emphasis on accessibility and real-world applicability makes it valuable for benchmarking object detection, pose estimation, and robotics applications.
zh

[CV-67] Multi-pathology Chest X-ray Classification with Rejection Mechanisms

【速读】：该论文旨在解决深度学习模型在高风险医学影像任务中存在过度自信（overconfidence）的问题，特别是在胸部X光片的多标签分类场景下，模型需同时检测多种共存的病理特征，而传统模型往往无法准确评估自身预测的不确定性。解决方案的关键在于提出一种基于不确定性的诊断框架，其核心包括两个选择性预测机制：基于熵的拒识（entropy-based rejection）和基于置信区间（confidence interval-based rejection）的拒识策略，使模型能够在不确定时主动拒绝预测，从而将模糊病例转交临床专家处理；此外，通过分位数校准（quantile-based calibration）方法优化拒识阈值，采用全局或类别特定策略提升模型可靠性。实验表明，该方法显著改善了诊断准确率与覆盖率之间的权衡关系，尤其以熵基拒识策略在所有病种上获得最高平均AUC，验证了其在临床AI辅助诊断流程中部署的安全性和实用性。

链接: https://arxiv.org/abs/2509.10348
作者: Yehudit Aperstein,Amit Tzahar,Alon Gottlib,Tal Verber,Ravit Shagan Damti,Alexander Apartsin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Overconfidence in deep learning models poses a significant risk in high-stakes medical imaging tasks, particularly in multi-label classification of chest X-rays, where multiple co-occurring pathologies must be detected simultaneously. This study introduces an uncertainty-aware framework for chest X-ray diagnosis based on a DenseNet-121 backbone, enhanced with two selective prediction mechanisms: entropy-based rejection and confidence interval-based rejection. Both methods enable the model to abstain from uncertain predictions, improving reliability by deferring ambiguous cases to clinical experts. A quantile-based calibration procedure is employed to tune rejection thresholds using either global or class-specific strategies. Experiments conducted on three large public datasets (PadChest, NIH ChestX-ray14, and MIMIC-CXR) demonstrate that selective rejection improves the trade-off between diagnostic accuracy and coverage, with entropy-based rejection yielding the highest average AUC across all pathologies. These results support the integration of selective prediction into AI-assisted diagnostic workflows, providing a practical step toward safer, uncertainty-aware deployment of deep learning in clinical settings.
zh

[CV-68] Polarization Denoising and Demosaicking: Dataset and Baseline Method ICIP2025 WWW

【速读】：该论文旨在解决分焦平面（division-of-focal-plane, DoFP）偏振成像系统中噪声与去马赛克（demosaicking）的联合处理问题，即在存在噪声的情况下如何同时实现高质量的偏振图像重建。当前研究多集中于无噪声条件下的偏振去马赛克，缺乏合适的评估数据集和可靠的基线方法。其解决方案的关键在于：提出一个包含40个真实场景及三种噪声水平的新型数据集，并设计一种基于经典信号处理组件的“先去噪后去马赛克”（denoising-then-demosaicking）方法，从而提供可复现且性能优越的偏振图像重建基准。

链接: https://arxiv.org/abs/2509.10098
作者: Muhamad Daniel Ariff Bin Abdul Rahman,Yusuke Monno,Masayuki Tanaka,Masatoshi Okutomi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ICIP2025; Project page: this http URL

点击查看摘要

Abstract:A division-of-focal-plane (DoFP) polarimeter enables us to acquire images with multiple polarization orientations in one shot and thus it is valuable for many applications using polarimetric information. The image processing pipeline for a DoFP polarimeter entails two crucial tasks: denoising and demosaicking. While polarization demosaicking for a noise-free case has increasingly been studied, the research for the joint task of polarization denoising and demosaicking is scarce due to the lack of a suitable evaluation dataset and a solid baseline method. In this paper, we propose a novel dataset and method for polarization denoising and demosaicking. Our dataset contains 40 real-world scenes and three noise-level conditions, consisting of pairs of noisy mosaic inputs and noise-free full images. Our method takes a denoising-then-demosaicking approach based on well-accepted signal processing components to offer a reproducible method. Experimental results demonstrate that our method exhibits higher image reconstruction performance than other alternative methods, offering a solid baseline.
zh

[CV-69] Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms

【速读】：该论文旨在解决加州番茄产业面临的根寄生杂草——分枝列当（Phelipanche ramosa）日益加剧的威胁问题，该寄生植物因地下生活周期难以早期发现，且传统化学防治手段成本高、环境风险大且效果有限。解决方案的关键在于融合无人机多光谱遥感与长短期记忆（Long Short-Term Memory, LSTM）深度学习网络，并采用合成少数类过采样技术（Synthetic Minority Over-sampling Technique, SMOTE）处理类别不平衡问题，从而实现基于时间序列多光谱特征的早期检测。研究在加州伍德兰德的一个已知感染田块中，按积温日数（Growing Degree Days, GDD）划分五个关键生长期进行验证，结果显示，结合所有生长阶段并引入SMOTE增强的数据模型达到了88.37%的整体准确率和95.37%的召回率，显著优于单一阶段检测结果，证明了时序多光谱分析与LSTM架构在精准农业中用于早期分枝列当识别的可行性与优越性。

链接: https://arxiv.org/abs/2509.09972
作者: Mohammadreza Narimani,Alireza Pourreza,Ali Moghimi,Mohsen Mesgaran,Parastoo Farajpoor,Hamid Jafarbiglu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Author-accepted version (no publisher header/footer). 10 pages + presentation. Published in Proceedings of SPIE Defense + Commercial Sensing 2024, Vol. 13053, Paper 1305304. Event: National Harbor, Maryland, USA. Official version: this https URL

点击查看摘要

Abstract:This study addresses the escalating threat of branched broomrape (Phelipanche ramosa) to California’s tomato industry, which supplies over 90 percent of U.S. processing tomatoes. The parasite’s largely underground life cycle makes early detection difficult, while conventional chemical controls are costly, environmentally harmful, and often ineffective. To address this, we combined drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep learning networks, using the Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance. Research was conducted on a known broomrape-infested tomato farm in Woodland, Yolo County, CA, across five key growth stages determined by growing degree days (GDD). Multispectral images were processed to isolate tomato canopy reflectance. At 897 GDD, broomrape could be detected with 79.09 percent overall accuracy and 70.36 percent recall without integrating later stages. Incorporating sequential growth stages with LSTM improved detection substantially. The best-performing scenario, which integrated all growth stages with SMOTE augmentation, achieved 88.37 percent overall accuracy and 95.37 percent recall. These results demonstrate the strong potential of temporal multispectral analysis and LSTM networks for early broomrape detection. While further real-world data collection is needed for practical deployment, this study shows that UAV-based multispectral sensing coupled with deep learning could provide a powerful precision agriculture tool to reduce losses and improve sustainability in tomato production.
zh

[CV-70] Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining

【速读】：该论文旨在解决扩散模型（Diffusion Models）在求解逆问题（如加速磁共振成像重建）时，因数据保真度权重（data fidelity weights）设置不当而导致性能受限的问题，尤其是在快速采样调度（few denoising steps）和不规则时间步长（irregular timestep schedules）下，现有方法依赖启发式或固定权重难以泛化。解决方案的关键在于提出零样本自适应扩散采样（Zero-shot Adaptive Diffusion Sampling, ZADS），该方法在测试阶段通过仅使用欠采样测量数据进行自监督优化，动态调整保真度权重，无需重新训练扩散先验即可适配任意噪声调度，从而显著提升重建质量与鲁棒性。

链接: https://arxiv.org/abs/2509.09880
作者: Yaşar Utku Alçalar,Junno Yun,Mehmet Akçakaya
机构: University of Minnesota (明尼苏达大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2025

点击查看摘要

Abstract:Diffusion/score-based models have recently emerged as powerful generative priors for solving inverse problems, including accelerated MRI reconstruction. While their flexibility allows decoupling the measurement model from the learned prior, their performance heavily depends on carefully tuned data fidelity weights, especially under fast sampling schedules with few denoising steps. Existing approaches often rely on heuristics or fixed weights, which fail to generalize across varying measurement conditions and irregular timestep schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling (ZADS), a test-time optimization method that adaptively tunes fidelity weights across arbitrary noise schedules without requiring retraining of the diffusion prior. ZADS treats the denoising process as a fixed unrolled sampler and optimizes fidelity weights in a self-supervised manner using only undersampled measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS consistently outperforms both traditional compressed sensing and recent diffusion-based methods, showcasing its ability to deliver high-fidelity reconstructions across varying noise schedules and acquisition settings.
zh

人工智能

[AI-0] Mutual Information Tracks Policy Coherence in Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）代理在真实环境中部署时因传感器故障、执行器磨损和环境变化导致的性能退化问题，而现有方法缺乏内在机制来检测与诊断此类故障。解决方案的关键在于提出一个信息论框架，通过分析状态-动作互信息（state-action mutual information）的模式，揭示了成功学习过程中具有特征性的信息签名：例如，状态与动作之间的互信息从0.84比特增长至2.83比特（增长238%），即使状态熵上升，仍表明代理逐步聚焦于任务相关的模式；同时，状态-动作-下一状态联合互信息（MI(S,A;S’)）呈现倒U型曲线，早期学习阶段达到峰值后下降，反映从广义探索向高效利用的过渡。更重要的是，该框架能差异化诊断系统故障：观测空间噪声（如传感器故障）会导致所有信息通道普遍下降，显著削弱状态-动作耦合；而动作空间噪声（如执行器故障）则仅破坏动作结果的可预测性，保留状态-动作关系。这种差异化的诊断能力无需修改网络结构或造成性能损失，为构建具备自主故障检测与策略调整能力的自适应RL系统提供了理论基础。

链接: https://arxiv.org/abs/2509.10423
作者: Cameron Reid,Wael Hafez,Amirhossein Nazeri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 10 pages, 4 figures, 1 table

点击查看摘要

Abstract:Reinforcement Learning (RL) agents deployed in real-world environments face degradation from sensor faults, actuator wear, and environmental shifts, yet lack intrinsic mechanisms to detect and diagnose these failures. We present an information-theoretic framework that reveals both the fundamental dynamics of RL and provides practical methods for diagnosing deployment-time anomalies. Through analysis of state-action mutual information patterns in a robotic control task, we first demonstrate that successful learning exhibits characteristic information signatures: mutual information between states and actions steadily increases from 0.84 to 2.83 bits (238% growth) despite growing state entropy, indicating that agents develop increasingly selective attention to task-relevant patterns. Intriguingly, states, actions and next states joint mutual information, MI(S,A;S’), follows an inverted U-curve, peaking during early learning before declining as the agent specializes suggesting a transition from broad exploration to efficient exploitation. More immediately actionable, we show that information metrics can differentially diagnose system failures: observation-space, i.e., states noise (sensor faults) produces broad collapses across all information channels with pronounced drops in state-action coupling, while action-space noise (actuator faults) selectively disrupts action-outcome predictability while preserving state-action relationships. This differential diagnostic capability demonstrated through controlled perturbation experiments enables precise fault localization without architectural modifications or performance degradation. By establishing information patterns as both signatures of learning and diagnostic for system health, we provide the foundation for adaptive RL systems capable of autonomous fault detection and policy adjustment based on information-theoretic principles.
zh

[AI-1] Diversified recommendations of cultural activities with personalized determinantal point processes RECSYS

【速读】：该论文旨在解决推荐系统在提升用户参与度的同时，如何有效实现推荐结果的多样性且不损害核心业务指标这一行业难题。其解决方案的关键在于采用个性化确定性点过程（Determinantal Point Process, DPP）进行推荐采样，通过利用相似性核函数的质量-多样性分解策略，赋予用户偏好更高的权重，从而在保证相关性的同时增强推荐内容的多样性。

链接: https://arxiv.org/abs/2509.10392
作者: Carole Ibrahim,Hiba Bederina,Daniel Cuesta,Laurent Montier,Cyrille Delabre,Jill-Jênn Vie
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages, accepted at RecSys workshop RecSoGood 2025

点击查看摘要

Abstract:While optimizing recommendation systems for user engagement is a well-established practice, effectively diversifying recommendations without negatively impacting core business metrics remains a significant industry challenge. In line with our initiative to broaden our audience’s cultural practices, this study investigates using personalized Determinantal Point Processes (DPPs) to sample diverse and relevant recommendations. We rely on a well-known quality-diversity decomposition of the similarity kernel to give more weight to user preferences. In this paper, we present our implementations of the personalized DPP sampling, evaluate the trade-offs between relevance and diversity through both offline and online metrics, and give insights for practitioners on their use in a production environment. For the sake of reproducibility, we release the full code for our platform and experiments on GitHub.
zh

[AI-2] Improving Audio Event Recognition with Consistency Regularization

【速读】：该论文旨在解决音频事件识别（Audio Event Recognition）中因标注数据稀缺而导致模型性能受限的问题。其核心解决方案是引入一致性正则化（Consistency Regularization, CR），通过强制模型在不同增强视图下的预测保持一致，从而利用未标注数据提升模型泛化能力。关键创新在于将CR应用于AudioSet数据集，并通过大量消融实验验证其在小规模（约20k样本）和大规模（约1.8M样本）监督训练场景下的有效性，尤其在小样本情况下，结合更强的数据增强策略可进一步提升性能；同时在半监督设置下（20K标签样本 + 1.8M无标签样本）也取得了优于纯监督基线的性能提升。

链接: https://arxiv.org/abs/2509.10391
作者: Shanmuka Sadhu,Weiran Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Consistency regularization (CR), which enforces agreement between model predictions on augmented views, has found recent benefits in automatic speech recognition [1]. In this paper, we propose the use of consistency regularization for audio event recognition, and demonstrate its effectiveness on AudioSet. With extensive ablation studies for both small ( \sim 20k) and large ( \sim 1.8M) supervised training sets, we show that CR brings consistent improvement over supervised baselines which already heavily utilize data augmentation, and CR using stronger augmentation and multiple augmentations leads to additional gain for the small training set. Furthermore, we extend the use of CR into the semi-supervised setup with 20K labeled samples and 1.8M unlabeled samples, and obtain performance improvement over our best model trained on the small set.
zh

[AI-3] Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

【速读】：该论文旨在解决对比学习（Contrastive Learning）在医学影像预训练中对数据集组成依赖性强的问题，特别是不同人群特征和健康状态分布如何影响下游任务的性能表现。研究表明，预训练数据集的分布特性（如人口学特征和健康状况）显著影响模型的下游预测准确性，且多中心、人口多样化的预训练虽能提升同分布（in-distribution）性能，但会因编码特定队列特有的伪影而降低跨分布（out-of-distribution, OOD）泛化能力。为应对这一问题，作者提出同分布批次策略（In-Distribution Batch, IDB），通过在预训练过程中保持队列内一致性，有效增强模型的OOD鲁棒性，从而推动临床公平性和通用性更强的基础模型发展。

链接: https://arxiv.org/abs/2509.10369
作者: Gul Rukh Khattak,Konstantinos Patlatzoglou,Joseph Barker,Libor Pastika,Boroumand Zeidaabadi,Ahmed El-Medany,Hesham Aggour,Yixiu Liang,Antonio H. Ribeiro,Jeffrey Annis,Antonio Luiz Pinho Ribeiro,Junbo Ge,Daniel B. Kramer,Jonathan W. Waks,Evan Brittain,Nicholas Peters,Fu Siong Ng,Arunashis Sau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Tissues and Organs (q-bio.TO)
备注: Currently under review at npj Digital Medicine

点击查看摘要

Abstract:Contrastive learning is a widely adopted self-supervised pretraining strategy, yet its dependence on cohort composition remains underexplored. We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts (n = 5,203,352), from diverse populations across three continents (North America, South America, Asia). We systematically assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks also including two additional cohorts from another continent (Europe). We find that downstream performance depends on the distributional properties of the pretraining cohort, including demographics and health status. Moreover, while pretraining with a multi-centre, demographically diverse cohort improves in-distribution accuracy, it reduces out-of-distribution (OOD) generalisation of our contrastive approach by encoding cohort-specific artifacts. To address this, we propose the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances OOD robustness. This work provides important insights for developing clinically fair and generalisable foundation models.
zh

[AI-4] State Algebra for Propositional Logic

【速读】：该论文旨在解决如何利用代数方法高效表示与操作命题逻辑的问题，其核心挑战在于如何在保持语义清晰的同时提升计算效率。解决方案的关键在于提出了一种名为“状态代数”（State Algebra）的框架，该框架通过三层结构——集合（Set）、坐标（Coordinate）和行分解（Row Decomposition）——将命题逻辑映射到代数空间中，并借助强大的代数引擎实现灵活而高效的计算。其中，一个关键创新是：虽然默认的状态向量约简不具唯一性（non-canonical），但通过固定变量顺序进行约简可获得唯一规范形式（canonical form），从而在灵活性与规范性之间取得平衡，使得某些问题能以更紧凑的表示方式表达，同时为搜索算法和知识编译算法提供统一工具，并自然扩展至概率逻辑与加权模型计数（Weighted Model Counting）。

链接: https://arxiv.org/abs/2509.10326
作者: Dmitry Lesnik,Tobias Schäfer
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 47 pages

点击查看摘要

Abstract:This paper presents State Algebra, a novel framework designed to represent and manipulate propositional logic using algebraic methods. The framework is structured as a hierarchy of three representations: Set, Coordinate, and Row Decomposition. These representations anchor the system in well-known semantics while facilitating the computation using a powerful algebraic engine. A key aspect of State Algebra is its flexibility in representation. We show that although the default reduction of a state vector is not canonical, a unique canonical form can be obtained by applying a fixed variable order during the reduction process. This highlights a trade-off: by foregoing guaranteed canonicity, the framework gains increased flexibility, potentially leading to more compact representations of certain classes of problems. We explore how this framework provides tools to articulate both search-based and knowledge compilation algorithms and discuss its natural extension to probabilistic logic and Weighted Model Counting.
zh

[AI-5] Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data

【速读】：该论文旨在解决作业车间调度问题（Job-Shop Scheduling Problem, JSP）与柔性作业车间调度问题（Flexible Job-Shop Scheduling Problem, FJSP）中在线强化学习（Online Reinforcement Learning, RL）方法存在的两大局限：一是需要大量与仿真环境的交互，难以模拟真实工业场景的复杂性；二是随机策略初始化导致样本效率低下。为此，作者提出了一种新型离线强化学习算法——保守离散分位数演员-评论家（Conservative Discrete Quantile Actor-Critic, CDQAC），其核心创新在于将基于分位数的评论家与延迟策略更新相结合，通过估计每个机器-操作对的回报分布而非直接选择操作，从而从历史数据中高效学习高质量调度策略。该方法无需昂贵的在线交互，且能有效改进低质量训练数据，展现出卓越的样本效率和性能超越性。

链接: https://arxiv.org/abs/2509.10303
作者: Jesse van Remmerden,Zaharah Bukhsh,Yingqian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Job-Shop Scheduling Problem (JSP) and Flexible Job-Shop Scheduling Problem (FJSP), are canonical combinatorial optimization problems with wide-ranging applications in industrial operations. In recent years, many online reinforcement learning (RL) approaches have been proposed to learn constructive heuristics for JSP and FJSP. Although effective, these online RL methods require millions of interactions with simulated environments that may not capture real-world complexities, and their random policy initialization leads to poor sample efficiency. To address these limitations, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL algorithm that learns effective scheduling policies directly from historical data, eliminating the need for costly online interactions, while maintaining the ability to improve upon suboptimal training data. CDQAC couples a quantile-based critic with a delayed policy update, estimating the return distribution of each machine-operation pair rather than selecting pairs outright. Our extensive experiments demonstrate CDQAC’s remarkable ability to learn from diverse data sources. CDQAC consistently outperforms the original data-generating heuristics and surpasses state-of-the-art offline and online RL baselines. In addition, CDQAC is highly sample efficient, requiring only 10-20 training instances to learn high-quality policies. Surprisingly, we find that CDQAC performs better when trained on data generated by a random heuristic than when trained on higher-quality data from genetic algorithms and priority dispatching rules.
zh

[AI-6] he Morality of Probability: How Implicit Moral Biases in LLM s May Shape the Future of Human-AI Symbiosis

【速读】：该论文旨在解决如何将机器决策与人类道德价值观对齐的问题，特别是探究当前最先进的大语言模型（Large Language Models, LLMs）在面对道德困境时隐含的价值偏好及其影响因素。其核心问题包括：LLMs在道德抉择中倾向于哪些价值取向？模型架构、文化来源和可解释性如何影响这些偏好。解决方案的关键在于通过一项大规模定量实验，对六种不同LLM在18个代表性道德困境中的判断进行评分与排序，从而揭示出Care（关怀）和Virtue（美德）价值被普遍优先认可，而自由主义（libertarian）选项则持续受到惩罚；同时发现具备推理能力的模型表现出更强的情境敏感性和更丰富的解释力，而非推理模型则判断更趋一致但缺乏透明度。研究据此提出，可解释性与文化意识应成为AI设计的核心原则，以推动人机协同走向透明、对齐且可持续的未来。

链接: https://arxiv.org/abs/2509.10297
作者: Eoin O’Doherty,Nicole Weinrauch,Andrew Talone,Uri Klempner,Xiaoyuan Yi,Xing Xie,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Artificial intelligence (AI) is advancing at a pace that raises urgent questions about how to align machine decision-making with human moral values. This working paper investigates how leading AI systems prioritize moral outcomes and what this reveals about the prospects for human-AI symbiosis. We address two central questions: (1) What moral values do state-of-the-art large language models (LLMs) implicitly favour when confronted with dilemmas? (2) How do differences in model architecture, cultural origin, and explainability affect these moral preferences? To explore these questions, we conduct a quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks. Our findings uncover strikingly consistent value biases. Across all models, Care and Virtue values outcomes were rated most moral, while libertarian choices were consistently penalized. Reasoning-enabled models exhibited greater sensitivity to context and provided richer explanations, whereas non-reasoning models produced more uniform but opaque judgments. This research makes three contributions: (i) Empirically, it delivers a large-scale comparison of moral reasoning across culturally distinct LLMs; (ii) Theoretically, it links probabilistic model behaviour with underlying value encodings; (iii) Practically, it highlights the need for explainability and cultural awareness as critical design principles to guide AI toward a transparent, aligned, and symbiotic future.
zh

[AI-7] We Need a New Ethics for a World of AI Agents

【速读】：该论文旨在解决日益普及的智能AI代理（AI agents）所带来的安全性、人机关系及社会协调性等新兴问题。其核心挑战在于确保人类与AI代理之间以及AI代理相互之间的交互仍能保持广泛有益性。解决方案的关键在于加强科学家、学者、工程师和政策制定者对AI代理广泛应用后果的深入参与和系统性思考，推动跨学科协作以构建安全、可控且符合社会价值的AI代理生态系统。

链接: https://arxiv.org/abs/2509.10289
作者: Iason Gabriel,Geoff Keeling,Arianna Manzini,James Evans
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 6 pages, no figures

点击查看摘要

Abstract:The deployment of capable AI agents raises fresh questions about safety, human-machine relationships and social coordination. We argue for greater engagement by scientists, scholars, engineers and policymakers with the implications of a world increasingly populated by AI agents. We explore key challenges that must be addressed to ensure that interactions between humans and agents, and among agents themselves, remain broadly beneficial.
zh

[AI-8] Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在推理任务中表现不足的问题，尤其是在本体工程（ontology engineering）领域的应用局限。其解决方案的关键在于引入形式化方法，通过使用更紧凑的逻辑语言替代自然语言（Natural Language, NL）来表达逻辑问题，从而提升SLMs在预定义推理任务上的性能。实验表明，这种语言形式的转换能够在保持强推理能力的同时，为SLMs在本体构建中的角色优化提供可行路径。

链接: https://arxiv.org/abs/2509.10249
作者: Hanna Abi Akl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted for the International Joint Conference on Rules and Reasoning (RuleML+RR) 2025

点击查看摘要

Abstract:Recent advances in Language Models (LMs) have failed to mask their shortcomings particularly in the domain of reasoning. This limitation impacts several tasks, most notably those involving ontology engineering. As part of a PhD research, we investigate the consequences of incorporating formal methods on the performance of Small Language Models (SLMs) on reasoning tasks. Specifically, we aim to orient our work toward using SLMs to bootstrap ontology construction and set up a series of preliminary experiments to determine the impact of expressing logical problems with different grammars on the performance of SLMs on a predefined reasoning task. Our findings show that it is possible to substitute Natural Language (NL) with a more compact logical language while maintaining a strong performance on reasoning tasks and hope to use these results to further refine the role of SLMs in ontology engineering.
zh

[AI-9] Compartmentalised Agent ic Reasoning for Clinical NLI

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在临床自然语言推理（Clinical Natural Language Inference, Clinical NLI）中因推理过程缺乏结构化与可审计性而导致的可靠性不足问题。现有假设认为，单纯扩大数据和参数规模即可提升模型内部表征的结构化与泛化能力，但本文通过引入CARENLI（Compartmentalised Agentic Reasoning for Clinical NLI）框架对此提出质疑并提供实证支持。其解决方案的关键在于将知识获取与原则性推理分离：CARENLI基于四类推理任务（因果归因、组合 grounding、认识论验证和风险状态抽象）构建家族特异性求解器，并由规划器、验证器和修正器共同执行可追溯的推理流程，从而显著提升推理准确性（最高达42个百分点），尤其在因果归因任务中达到98.0%的精度，同时识别出路由错误是当前主要瓶颈。这一设计使模型从依赖启发式策略转向显式结构化推理，为更安全、可审计的临床推理提供了新范式。

链接: https://arxiv.org/abs/2509.10222
作者: Maël Jullien,Lei Xu,Marco Valentino,André Freitas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common assumption holds that scaling data and parameters yields increasingly structured, generalisable internal representations. We interrogate this assumption in clinical natural language inference (NLI) by adopting a benchmark decomposed into four reasoning families, Causal Attribution, Compositional Grounding, Epistemic Verification, and Risk State Abstraction, and introducing CARENLI, a Compartmentalised Agentic Reasoning for Clinical NLI that separates knowledge access from principled inference. CARENLI routes each premise, statement pair to a family specific solver and enforces auditable procedures via a planner, verifier, and refiner. Across four LLMs, CARENLI improves fidelity by up to 42 points, reaching 98.0% in Causal Attribution and 81.2% in Risk State Abstraction. Verifiers flag violations with near-ceiling reliability, while refiners correct a substantial share of epistemic errors. Remaining failures cluster in routing, identifying family classification as the main bottleneck. These results show that LLMs often retain relevant facts but default to heuristics when inference is underspecified, a dissociation CARENLI makes explicit while offering a framework for safer, auditable reasoning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.10222 [cs.AI] (or arXiv:2509.10222v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.10222 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-10] Openness in AI and downstream governance: A global value chain approach

【速读】：该论文试图解决的问题是：在人工智能（AI）产业高度集中于少数“大科技”企业、形成数据权力与平台资本主义格局的背景下，AI领域日益兴起的“开放性”（如开源模型、免费数据集和工具链）是否能够促进技术转移与后发国家或企业的追赶能力，以及这种开放性如何嵌入到全球AI价值链条中并影响其治理结构。解决方案的关键在于将AI中的“开放性”概念化为一种独特的企业间关系，并引入价值链分析框架，从而揭示基础型AI企业如何通过外包模式影响下游应用生态的控制权与治理机制，进而拓展对AI作为生产性部门的理解，同时指出开放性可能因全球技术领导权竞争而带来潜在溢出效应。

链接: https://arxiv.org/abs/2509.10220
作者: Christopher Foster
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of AI has been rapid, becoming a leading sector for investment and promising disruptive impacts across the economy. Within the critical analysis of the economic impacts, AI has been aligned to the critical literature on data power and platform capitalism - further concentrating power and value capture amongst a small number of “big tech” leaders. The equally rapid rise of openness in AI (here taken to be claims made by AI firms about openness, “open source” and free provision) signals an interesting development. It highlights an emerging ecosystem of open AI models, datasets and toolchains, involving massive capital investment. It poses questions as to whether open resources can support technological transfer and the ability for catch-up, even in the face of AI industry power. This work seeks to add conceptual clarity to these debates by conceptualising openness in AI as a unique type of interfirm relation and therefore amenable to value chain analysis. This approach then allows consideration of the capitalist dynamics of “outsourcing” of foundational firms in value chains, and consequently the types of governance and control that might emerge downstream as AI is adopted. This work, therefore, extends previous mapping of AI value chains to build a framework which links foundational AI with downstream value chains. Overall, this work extends our understanding of AI as a productive sector. While the work remains critical of the power of leading AI firms, openness in AI may lead to potential spillovers stemming from the intense competition for global technological leadership in AI. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) ACMclasses: K.4.1; K.4.3 Cite as: arXiv:2509.10220 [cs.CY] (or arXiv:2509.10220v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2509.10220 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Christopher Foster [view email] [v1] Fri, 12 Sep 2025 13:12:09 UTC (464 KB)
zh

[AI-11] owards Fully Automated Molecular Simulations: Multi-Agent Framework for Simulation Setup and Force Field Extraction

【速读】：该论文旨在解决多孔材料自动化表征中因模拟设置复杂性和力场选择困难而导致的效率瓶颈问题。其解决方案的关键在于提出了一种基于大语言模型（Large Language Model, LLM）的多智能体框架，该框架能够自主理解表征任务、规划仿真流程、提取文献驱动的力场参数、自动执行RASPA模拟并解析结果以指导后续步骤，从而实现高正确率与可重复性的自动化材料表征。

链接: https://arxiv.org/abs/2509.10210
作者: Marko Petković,Vlado Menkovski,Sofía Calero
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automated characterization of porous materials has the potential to accelerate materials discovery, but it remains limited by the complexity of simulation setup and force field selection. We propose a multi-agent framework in which LLM-based agents can autonomously understand a characterization task, plan appropriate simulations, assemble relevant force fields, execute them and interpret their results to guide subsequent steps. As a first step toward this vision, we present a multi-agent system for literature-informed force field extraction and automated RASPA simulation setup. Initial evaluations demonstrate high correctness and reproducibility, highlighting this approach’s potential to enable fully autonomous, scalable materials characterization.
zh

[AI-12] Online Robust Planning under Model Uncertainty: A Sample-Based Approach

【速读】：该论文旨在解决在线规划中因模型不确定性导致的性能下降或不安全行为问题，尤其是在使用有限数据学习的生成模型（generative model）时。现有方法如稀疏采样（Sparse Sampling）和蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）虽能有效近似最优动作，但对模型误差敏感。为此，作者提出鲁棒稀疏采样（Robust Sparse Sampling, RSS），其核心在于利用样本平均近似（Sample Average Approximation, SAA）理论构建鲁棒值函数（robust value function），从而在在线环境中实现可计算且具有有限样本理论保证的鲁棒策略。RSS不仅适用于无限或连续状态空间，且其样本复杂度和计算复杂度与状态空间规模无关，显著提升了实际应用中的效率与可靠性。

链接: https://arxiv.org/abs/2509.10162
作者: Tamir Shazman,Idan Lev-Yehudi,Ron Benchetit,Vadim Indelman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online planning in Markov Decision Processes (MDPs) enables agents to make sequential decisions by simulating future trajectories from the current state, making it well-suited for large-scale or dynamic environments. Sample-based methods such as Sparse Sampling and Monte Carlo Tree Search (MCTS) are widely adopted for their ability to approximate optimal actions using a generative model. However, in practical settings, the generative model is often learned from limited data, introducing approximation errors that can degrade performance or lead to unsafe behaviors. To address these challenges, Robust MDPs (RMDPs) offer a principled framework for planning under model uncertainty, yet existing approaches are typically computationally intensive and not suited for real-time use. In this work, we introduce Robust Sparse Sampling (RSS), the first online planning algorithm for RMDPs with finite-sample theoretical performance guarantees. Unlike Sparse Sampling, which estimates the nominal value function, RSS computes a robust value function by leveraging the efficiency and theoretical properties of Sample Average Approximation (SAA), enabling tractable robust policy computation in online settings. RSS is applicable to infinite or continuous state spaces, and its sample and computational complexities are independent of the state space size. We provide theoretical performance guarantees and empirically show that RSS outperforms standard Sparse Sampling in environments with uncertain dynamics.
zh

[AI-13] BenchECG and xECG: a benchmark and baseline for ECG foundation models

【速读】：该论文旨在解决当前心电图（Electrocardiogram, ECG）表示学习领域中缺乏统一评估标准的问题，即以往研究常采用任务选择狭窄、数据集不一致的评价方式，导致模型性能难以公平比较。其解决方案的关键在于提出BenchECG——一个标准化的基准测试平台，包含多个公开可用的ECG数据集和多样化的下游任务；并设计了基于xLSTM架构、利用SimDINOv2自监督学习训练的xECG模型，该模型在所有任务和数据集上均表现优异，成为首个在BenchECG上全面领先的公开ECG基础模型，从而为ECG表示学习建立了新的性能基线。

链接: https://arxiv.org/abs/2509.10151
作者: Riccardo Lunelli,Angus Nicolson,Samuel Martin Pröll,Sebastian Johannes Reinstadler,Axel Bauer,Clemens Dlaska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures, 22 tables

点击查看摘要

Abstract:Electrocardiograms (ECGs) are inexpensive, widely used, and well-suited to deep learning. Recently, interest has grown in developing foundation models for ECGs - models that generalise across diverse downstream tasks. However, consistent evaluation has been lacking: prior work often uses narrow task selections and inconsistent datasets, hindering fair comparison. Here, we introduce BenchECG, a standardised benchmark comprising a comprehensive suite of publicly available ECG datasets and versatile tasks. We also propose xECG, an xLSTM-based recurrent model trained with SimDINOv2 self-supervised learning, which achieves the best BenchECG score compared to publicly available state-of-the-art models. In particular, xECG is the only publicly available model to perform strongly on all datasets and tasks. By standardising evaluation, BenchECG enables rigorous comparison and aims to accelerate progress in ECG representation learning. xECG achieves superior performance over earlier approaches, defining a new baseline for future ECG foundation models.
zh

[AI-14] Virtual Agent Economies

【速读】：该论文试图解决的问题是：随着自主AI代理（autonomous AI agents）的快速采用，一种新型经济层正在涌现，其交易与协调规模和速度已超出人类直接监管能力，可能带来系统性经济风险和不平等加剧等挑战。解决方案的关键在于提出“沙盒经济”（sandbox economy）框架，通过两个维度——起源（自发生成 vs. 有意设计）和与人类经济的分离程度（可渗透 vs. 不可渗透）——来分析这一新兴系统，并主张通过主动设计可调控的AI代理市场来引导技术变革方向，具体包括公平资源分配的拍卖机制、围绕集体目标构建AI“使命经济”（mission economies），以及支撑信任、安全与问责的社会技术基础设施。

链接: https://arxiv.org/abs/2509.10147
作者: Nenad Tomasev,Matija Franklin,Joel Z. Leibo,Julian Jacobs,William A. Cunningham,Iason Gabriel,Simon Osindero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the “sandbox economy” as a framework for analyzing this emergent system, characterizing it along two key dimensions: its origins (emergent vs. intentional) and its degree of separateness from the established human economy (permeable vs. impermeable). Our current trajectory points toward a spontaneous emergence of a vast and highly permeable AI agent economy, presenting us with opportunities for an unprecedented degree of coordination as well as significant challenges, including systemic economic risk and exacerbated inequality. Here we discuss a number of possible design choices that may lead to safely steerable AI agent markets. In particular, we consider auction mechanisms for fair resource allocation and preference resolution, the design of AI “mission economies” to coordinate around achieving collective goals, and socio-technical infrastructure needed to ensure trust, safety, and accountability. By doing this, we argue for the proactive design of steerable agent markets to ensure the coming technological shift aligns with humanity’s long-term collective flourishing.
zh

[AI-15] Efficient Learning-Based Control of a Legged Robot in Lunar Gravity

【速读】：该论文旨在解决低重力环境下足式机器人因能源与热管理预算受限而难以实现高效运动控制的问题。解决方案的关键在于提出一种基于强化学习（Reinforcement Learning, RL）的控制方法，其核心是设计重力缩放的功率优化奖励函数（gravity-scaled power-optimized reward functions），使控制器能够在从月球重力（1.62 m/s²）到假设超级地球（19.62 m/s²）的多引力环境中实现可迁移的能耗优化。通过该方法，研究团队开发并验证了适用于不同重力环境的运动控制和基座姿态控制策略，在地球重力下实现了23.4 W的运动功耗（相比基线策略提升23%），并在月球重力下进一步降至12.2 W（较未优化基线降低36%），同时结合恒定力弹簧卸载系统实现了真实月球重力环境下的实验验证，证明了该方法在跨重力场景中具备良好的可扩展性和实用性。

链接: https://arxiv.org/abs/2509.10128
作者: Philip Arm,Oliver Fischer,Joseph Church,Adrian Fuhrer,Hendrik Kolvenbach,Marco Hutter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legged robots are promising candidates for exploring challenging areas on low-gravity bodies such as the Moon, Mars, or asteroids, thanks to their advanced mobility on unstructured terrain. However, as planetary robots’ power and thermal budgets are highly restricted, these robots need energy-efficient control approaches that easily transfer to multiple gravity environments. In this work, we introduce a reinforcement learning-based control approach for legged robots with gravity-scaled power-optimized reward functions. We use our approach to develop and validate a locomotion controller and a base pose controller in gravity environments from lunar gravity (1.62 m/s2) to a hypothetical super-Earth (19.62 m/s2). Our approach successfully scales across these gravity levels for locomotion and base pose control with the gravity-scaled reward functions. The power-optimized locomotion controller reached a power consumption for locomotion of 23.4 W in Earth gravity on a 15.65 kg robot at 0.4 m/s, a 23 % improvement over the baseline policy. Additionally, we designed a constant-force spring offload system that allowed us to conduct real-world experiments on legged locomotion in lunar gravity. In lunar gravity, the power-optimized control policy reached 12.2 W, 36 % less than a baseline controller which is not optimized for power efficiency. Our method provides a scalable approach to developing power-efficient locomotion controllers for legged robots across multiple gravity levels.
zh

[AI-16] AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework

【速读】：该论文旨在解决当前人工智能（Artificial Intelligence, AI）风险评估模型过度关注内部合规性、忽视多元利益相关者视角及现实世界后果的问题。其核心解决方案是提出一种以人为本、基于危害严重程度自适应的新型评估范式——AI Harmonics，其中关键创新在于引入一种新颖的AI危害评估指标（AI Harm Assessment Metric, AIH），该指标利用有序严重程度数据捕捉相对影响，无需精确数值估计；同时结合数据驱动与利益相关者感知的框架，实现对AI危害的系统识别与优先级排序，实验证明政治危害和物理危害集中度最高，亟需优先干预，从而为政策制定者和组织提供精准有效的风险缓解策略。

链接: https://arxiv.org/abs/2509.10104
作者: Sofia Vei,Paolo Giudici,Pavlos Sermpezis,Athena Vakali,Adelaide Emma Bernardelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:The absolute dominance of Artificial Intelligence (AI) introduces unprecedented societal harms and risks. Existing AI risk assessment models focus on internal compliance, often neglecting diverse stakeholder perspectives and real-world consequences. We propose a paradigm shift to a human-centric, harm-severity adaptive approach grounded in empirical incident data. We present AI Harmonics, which includes a novel AI harm assessment metric (AIH) that leverages ordinal severity data to capture relative impact without requiring precise numerical estimates. AI Harmonics combines a robust, generalized methodology with a data-driven, stakeholder-aware framework for exploring and prioritizing AI harms. Experiments on annotated incident data confirm that political and physical harms exhibit the highest concentration and thus warrant urgent mitigation: political harms erode public trust, while physical harms pose serious, even life-threatening risks, underscoring the real-world relevance of our approach. Finally, we demonstrate that AI Harmonics consistently identifies uneven harm distributions, enabling policymakers and organizations to target their mitigation efforts effectively.
zh

[AI-17] Generating Energy-Efficient Code via Large-Language Models – Where are we now?

【速读】：该论文旨在解决生成式 AI（Generative AI）在代码生成过程中对能源效率影响的问题，特别是评估大型语言模型（Large Language Models, LLMs）生成的 Python 代码相较于人类开发者及绿色软件专家编写的代码在能耗上的差异。其解决方案的关键在于通过实证方法，基于 EvoEval 基准测试中的 9 个编程问题，使用 6 种主流 LLM 和 4 种提示策略生成代码，并在服务器、PC 和树莓派三种硬件平台上进行能量消耗测量，总时长约 881 小时（36.7 天），从而系统性地比较不同来源代码的能效表现。结果表明，尽管 LLM 在某些场景下表现良好，但无一能超越具备绿色软件开发经验的专业人员所撰写的代码，凸显了当前 LLM 在能源效率优化方面的局限性，强调了人类专家知识在实现可持续软件开发中的不可替代作用。

链接: https://arxiv.org/abs/2509.10099
作者: Radu Apsan,Vincenzo Stoico,Michel Albonico,Rudra Dhar,Karthik Vaidhyanathan,Ivano Malavolta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context. The rise of Large Language Models (LLMs) has led to their widespread adoption in development pipelines. Goal. We empirically assess the energy efficiency of Python code generated by LLMs against human-written code and code developed by a Green software expert. Method. We test 363 solutions to 9 coding problems from the EvoEval benchmark using 6 widespread LLMs with 4 prompting techniques, and comparing them to human-developed solutions. Energy consumption is measured on three different hardware platforms: a server, a PC, and a Raspberry Pi for a total of ~881h (36.7 days). Results. Human solutions are 16% more energy-efficient on the server and 3% on the Raspberry Pi, while LLMs outperform human developers by 25% on the PC. Prompting does not consistently lead to energy savings, where the most energy-efficient prompts vary by hardware platform. The code developed by a Green software expert is consistently more energy-efficient by at least 17% to 30% against all LLMs on all hardware platforms. Conclusions. Even though LLMs exhibit relatively good code generation capabilities, no LLM-generated code was more energy-efficient than that of an experienced Green software developer, suggesting that as of today there is still a great need of human expertise for developing energy-efficient Python code.
zh

[AI-18] Predictive Spike Timing Enables Distributed Shortest Path Computation in Spiking Neural Networks

【速读】：该论文旨在解决生物神经系统中如何实现高效最短路径计算的问题，传统方法如Dijkstra算法或A*依赖全局状态信息和不具生物学可行性的回溯操作，而强化学习则因梯度更新缓慢难以解释自然系统中的快速行为适应。其解决方案的关键在于提出一种基于局部脉冲（spike-based）消息传递的生物可实现算法，通过神经元间抑制-兴奋信号对的时间 coincidence（时间一致性）识别最优路径节点：接收早于预期的抑制-兴奋输入的神经元会降低响应延迟，从而在目标到源点方向上产生时间压缩效应，最终仅依靠时序机制收敛至所有最短路径。这一机制揭示了仅凭短期时间动态即可完成复杂计算的可能性，为理解生物网络的分布式计算提供了新视角，并对计算神经科学、生成式AI（Generative AI）、强化学习及类脑计算系统具有重要意义。

链接: https://arxiv.org/abs/2509.10077
作者: Simen Storesund,Kristian Valset Aars,Robin Dietrich,Nicolai Waniek
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient planning and sequence selection are central to intelligence, yet current approaches remain largely incompatible with biological computation. Classical graph algorithms like Dijkstra’s or A* require global state and biologically implausible operations such as backtracing, while reinforcement learning methods rely on slow gradient-based policy updates that appear inconsistent with rapid behavioral adaptation observed in natural systems. We propose a biologically plausible algorithm for shortest-path computation that operates through local spike-based message-passing with realistic processing delays. The algorithm exploits spike-timing coincidences to identify nodes on optimal paths: Neurons that receive inhibitory-excitatory message pairs earlier than predicted reduce their response delays, creating a temporal compression that propagates backwards from target to source. Through analytical proof and simulations on random spatial networks, we demonstrate that the algorithm converges and discovers all shortest paths using purely timing-based mechanisms. By showing how short-term timing dynamics alone can compute shortest paths, this work provides new insights into how biological networks might solve complex computational problems through purely local computation and relative spike-time prediction. These findings open new directions for understanding distributed computation in biological and artificial systems, with possible implications for computational neuroscience, AI, reinforcement learning, and neuromorphic systems. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2509.10077 [cs.NE] (or arXiv:2509.10077v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2509.10077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-19] winTac: A Wide-Range Highly Sensitive Tactile Sensor with Real-to-Sim Digital Twin Sensor Model IROS2025

【速读】：该论文旨在解决强化学习驱动的机器人技能获取过程中因缺乏触觉传感器仿真模型而导致触觉感知策略难以有效训练的问题。其关键解决方案是提出TwinTac系统，该系统通过设计高灵敏度与宽量程的物理触觉传感器，并基于真实-仿真（real-to-sim）方法构建其数字孪生（digital twin）模型：首先采集同步的跨域数据（包括有限元分析结果与物理传感器输出），再利用神经网络学习从仿真数据到真实传感器响应的映射关系，从而实现高保真仿真，使触觉数据可在模拟环境中有效增强现实数据，提升任务性能。

链接: https://arxiv.org/abs/2509.10063
作者: Xiyan Huang,Zhe Xu,Chenxi Xiao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 9 figures, 1 table, to be published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

Abstract:Robot skill acquisition processes driven by reinforcement learning often rely on simulations to efficiently generate large-scale interaction data. However, the absence of simulation models for tactile sensors has hindered the use of tactile sensing in such skill learning processes, limiting the development of effective policies driven by tactile perception. To bridge this gap, we present TwinTac, a system that combines the design of a physical tactile sensor with its digital twin model. Our hardware sensor is designed for high sensitivity and a wide measurement range, enabling high quality sensing data essential for object interaction tasks. Building upon the hardware sensor, we develop the digital twin model using a real-to-sim approach. This involves collecting synchronized cross-domain data, including finite element method results and the physical sensor’s outputs, and then training neural networks to map simulated data to real sensor responses. Through experimental evaluation, we characterized the sensitivity of the physical sensor and demonstrated the consistency of the digital twin in replicating the physical sensor’s output. Furthermore, by conducting an object classification task, we showed that simulation data generated by our digital twin sensor can effectively augment real-world data, leading to improved accuracy. These results highlight TwinTac’s potential to bridge the gap in cross-domain learning tasks.
zh

[AI-20] XAgents : A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在处理高度复杂且存在不确定性的任务时，因缺乏有效任务规划而导致误导性或错误输出的问题。解决方案的关键在于提出XAgents框架，其核心创新是基于多极任务处理图（multipolar task processing graph）实现动态任务规划，并结合领域特定的IF-THEN规则与全局规则来约束智能体行为并增强跨智能体协作能力，从而提升复杂任务执行的准确性与鲁棒性。

链接: https://arxiv.org/abs/2509.10054
作者: Hailong Yang,Mingxian Gu,Jianqi Wang,Guanjin Wang,Zhaohong Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi-Agent Systems (MAS) in supporting humans with complex, real-world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi-agent cooperative framework built on a multipolar task processing graph and IF-THEN rules. XAgents uses the multipolar task processing graph to enable dynamic task planning and handle task uncertainty. During subtask processing, it integrates domain-specific IF-THEN rules to constrain agent behaviors, while global rules enhance inter-agent collaboration. We evaluate the performance of XAgents across three distinct datasets, demonstrating that it consistently surpasses state-of-the-art single-agent and multi-agent approaches in both knowledge-typed and logic-typed question-answering tasks. The codes for XAgents are available at: this https URL.
zh

[AI-21] Exploring Expert Specialization through Unsupervised Training in Sparse Mixture of Experts

【速读】：该论文旨在解决深度学习可解释性中的核心挑战——理解神经网络内部组织结构。其解决方案的关键在于提出一种新型的稀疏专家混合变分自编码器（Sparse Mixture of Experts Variational Autoencoder, SMoE-VAE）架构，通过无监督的专家路由机制实现对数据潜在结构的自动发现。实验表明，相较于依赖真实标签的监督基线，无监督路由在重建性能上表现更优，且专家能够识别出超越人类定义类别边界的有意义子类别结构，揭示了模型目标驱动下的本质数据组织方式。

链接: https://arxiv.org/abs/2509.10025
作者: Strahinja Nikolic,Ilker Oguz,Demetri Psaltis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Understanding the internal organization of neural networks remains a fundamental challenge in deep learning interpretability. We address this challenge by exploring a novel Sparse Mixture of Experts Variational Autoencoder (SMoE-VAE) architecture. We test our model on the QuickDraw dataset, comparing unsupervised expert routing against a supervised baseline guided by ground-truth labels. Surprisingly, we find that unsupervised routing consistently achieves superior reconstruction performance. The experts learn to identify meaningful sub-categorical structures that often transcend human-defined class boundaries. Through t-SNE visualizations and reconstruction analysis, we investigate how MoE models uncover fundamental data structures that are more aligned with the model’s objective than predefined labels. Furthermore, our study on the impact of dataset size provides insights into the trade-offs between data quantity and expert specialization, offering guidance for designing efficient MoE architectures.
zh

[AI-22] GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在多智能体系统（Multi-Agent System, MAS）中处理隐私数据时的安全性问题。由于高性能LLM通常部署于公共远程服务器，直接使用会导致敏感信息泄露，因此亟需在不牺牲任务性能的前提下实现隐私保护。解决方案的关键在于提出通用匿名化多智能体系统（General Anonymizing Multi-Agent system, GAMA），其核心机制是将智能体工作空间划分为私有空间和公共空间：私有空间用于处理原始敏感数据，公共空间仅使用匿名化后的数据进行交互。为缓解匿名化带来的语义损失，GAMA引入两个关键模块——基于领域规则的知识增强（Domain-Rule-based Knowledge Enhancement, DRKE）和基于反证的逻辑增强（Disproof-based Logic Enhancement, DLE），从而在保障隐私的同时维持任务性能。

链接: https://arxiv.org/abs/2509.10018
作者: Hailong Yang,Renhuo Zhao,Guanjin Wang,Zhaohong Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Model (LLM), LLM-based agents exhibit exceptional abilities in understanding and generating natural language, facilitating human-like collaboration and information transmission in LLM-based Multi-Agent System (MAS). High-performance LLMs are often hosted on remote servers in public spaces. When tasks involve privacy data, MAS cannot securely utilize these LLMs without implementing privacy-preserving mechanisms. To address this challenge, we propose a General Anonymizing Multi-Agent system (GAMA), which divides the agents’ workspace into private and public spaces and protects privacy through the anonymizing mechanism. In the private space, agents handle sensitive data, while in the public space, only anonymized data is utilized. GAMA incorporates two key modules to mitigate semantic loss caused by anonymization: Domain-Rule-based Knowledge Enhancement (DRKE) and Disproof-based Logic Enhancement (DLE). We evaluate GAMA on two public question-answering datasets: Trivia Creative Writing and Logic Grid Puzzle. The results demonstrate that GAMA has superior performance compared to the state-of-the-art models. To further assess its privacy-preserving capabilities, we designed two new datasets: Knowledge Privacy Preservation and Logic Privacy Preservation. The final results highlight GAMA’s exceptional effectiveness in both task processing and privacy preservation.
zh

[AI-23] Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss

【速读】：该论文旨在解决高维数据中内在维度（intrinsic dimension）估计与高质量重建的联合问题，尤其针对样本分布在线性或非线性流形上的数据集。传统方法往往只能估计维度而无法保留原始结构，导致后续分析或压缩性能受限。解决方案的关键在于提出一种名为Intrinsic Dimension Estimating Autoencoder (IDEA) 的新型自编码器架构，其核心创新是引入投影重构损失项（projected reconstruction loss term），该损失在训练过程中持续评估当移除一个潜在维度时的重构质量，从而引导模型学习最优低维表示。此外，IDEA通过重加权双CancelOut层构建结构化的潜在空间，实现了从高维到低维的有效投影及对原数据的精确重构，实验表明其在理论基准和物理模拟数据上均展现出高精度与强泛化能力。

链接: https://arxiv.org/abs/2509.10011
作者: Antoine Orioua,Philipp Krah,Julian Koellermeier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Preprint with 12 pages and 12 figures

点击查看摘要

Abstract:This paper introduces the Intrinsic Dimension Estimating Autoencoder (IDEA), which identifies the underlying intrinsic dimension of a wide range of datasets whose samples lie on either linear or nonlinear manifolds. Beyond estimating the intrinsic dimension, IDEA is also able to reconstruct the original dataset after projecting it onto the corresponding latent space, which is structured using re-weighted double CancelOut layers. Our key contribution is the introduction of the projected reconstruction loss term, guiding the training of the model by continuously assessing the reconstruction quality under the removal of an additional latent dimension. We first assess the performance of IDEA on a series of theoretical benchmarks to validate its robustness. These experiments allow us to test its reconstruction ability and compare its performance with state-of-the-art intrinsic dimension estimators. The benchmarks show good accuracy and high versatility of our approach. Subsequently, we apply our model to data generated from the numerical solution of a vertically resolved one-dimensional free-surface flow, following a pointwise discretization of the vertical velocity profile in the horizontal direction, vertical direction, and time. IDEA succeeds in estimating the dataset’s intrinsic dimension and then reconstructs the original solution by working directly within the projection space identified by the network.
zh

[AI-24] Evaluation of Black-Box XAI Approaches for Predictors of Values of Boolean Formulae ECAI

【速读】：该论文旨在解决可解释人工智能（Explainable AI, XAI）方法在评估过程中因解释的主观性而带来的挑战，特别是在表格数据和布尔函数预测任务中的变量重要性评估问题。其解决方案的关键在于提出一种基于实际因果关系（actual causality）的正式且精确的变量重要性度量方法，并以此作为基准来评估当前最先进的XAI工具；同时，作者进一步提出了一个名为B-ReX的新XAI工具，该工具基于现有ReX方法改进而来，在大规模基准测试中表现出优于其他黑盒XAI工具的性能，具体表现为在随机10值布尔公式上的Jensen-Shannon散度为0.072 ± 0.012。

链接: https://arxiv.org/abs/2509.09982
作者: Stav Armoni-Friedmann,Hana Chockler,David A. Kelly
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ECAI-EXCD Workshop, 8 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Evaluating explainable AI (XAI) approaches is a challenging task in general, due to the subjectivity of explanations. In this paper, we focus on tabular data and the specific use case of AI models predicting the values of Boolean functions. We extend the previous work in this domain by proposing a formal and precise measure of importance of variables based on actual causality, and we evaluate state-of-the-art XAI tools against this measure. We also present a novel XAI tool B-ReX, based on the existing tool ReX, and demonstrate that it is superior to other black-box XAI tools on a large-scale benchmark. Specifically, B-ReX achieves a Jensen-Shannon divergence of 0.072 \pm 0.012 on random 10-valued Boolean formulae
zh

[AI-25] Securing LLM -Generated Embedded Firmware through AI Agent -Driven Validation and Patching

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成嵌入式系统固件时存在的安全漏洞和实时性能不达标问题。其解决方案的关键在于提出一种三阶段方法论：首先利用结构化提示词驱动LLM（如GPT-4）生成针对网络与控制任务的固件，并在QEMU虚拟环境中部署于FreeRTOS；其次通过模糊测试、静态分析和运行时监控对固件进行自动化安全验证，识别缓冲区溢出（CWE-120）、竞态条件（CWE-362）及拒绝服务攻击（CWE-400）等漏洞；最后引入专用AI代理（威胁检测、性能优化与合规验证）协同改进检测与修复效率，并基于CWE分类结果迭代生成针对性补丁，形成闭环优化机制。实验表明该方法显著提升了固件安全性与实时性表现。

链接: https://arxiv.org/abs/2509.09970
作者: Seyed Moein Abtahi,Akramul Azim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise in generating firmware for embedded systems, but often introduce security flaws and fail to meet real-time performance constraints. This paper proposes a three-phase methodology that combines LLM-based firmware generation with automated security validation and iterative refinement in a virtualized environment. Using structured prompts, models like GPT-4 generate firmware for networking and control tasks, deployed on FreeRTOS via QEMU. These implementations are tested using fuzzing, static analysis, and runtime monitoring to detect vulnerabilities such as buffer overflows (CWE-120), race conditions (CWE-362), and denial-of-service threats (CWE-400). Specialized AI agents for Threat Detection, Performance Optimization, and Compliance Verification collaborate to improve detection and remediation. Identified issues are categorized using CWE, then used to prompt targeted LLM-generated patches in an iterative loop. Experiments show a 92.4% Vulnerability Remediation Rate (37.3% improvement), 95.8% Threat Model Compliance, and 0.87 Security Coverage Index. Real-time metrics include 8.6ms worst-case execution time and 195\mus jitter. This process enhances firmware security and performance while contributing an open-source dataset for future research.
zh

[AI-26] Limited Reference Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes

【速读】：该论文旨在解决现有合成表格数据生成方法在领域特定数据库中因参考数据稀缺而导致性能受限的问题，尤其是基于提示的大型语言模型（Large Language Models, LLMs）难以捕捉数据集特有的特征-标签依赖关系并产生冗余样本，从而损害下游任务表现。解决方案的关键在于提出ReFine框架：首先通过可解释模型提取符号化的“if-then”规则并嵌入提示中，显式引导生成过程以匹配领域特定的特征分布；其次采用双粒度过滤策略，抑制过采样模式并选择性精炼稀有但信息丰富的样本，从而缓解分布不平衡问题。

链接: https://arxiv.org/abs/2509.09960
作者: Mingxuan Jiang,Yongxin Wang,Ziyue Dai,Yicun Liu,Hongyi Nie,Sen Liu,Hongfeng Chai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic tabular data generation is increasingly essential in data management, supporting downstream applications when real-world and high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs), diffusion models, and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific databases with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often fail to capture dataset-specific feature-label dependencies and generate redundant data, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) derives symbolic “if-then” rules from interpretable models and embeds them into prompts to explicitly guide generation toward domain-specific feature distribution, and (ii) applies a dual-granularity filtering strategy that suppresses over-sampling patterns and selectively refines rare but informative samples to reduce distributional imbalance. Extensive experiments on various regression and classification benchmarks demonstrate that ReFine consistently outperforms state-of-the-art methods, achieving up to 0.44 absolute improvement in R-squared for regression and 10.0 percent relative improvement in F1 score for classification tasks.
zh

[AI-27] SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

【速读】：该论文旨在解决智能合约（Smart Contract）生成过程中存在的两大核心问题：一是模型作为“黑箱”缺乏可解释性，难以审计；二是生成的代码普遍存在严重安全漏洞，可能导致巨额财务损失。解决方案的关键在于提出一种名为 SmartCoder-R1 的新框架，其创新性体现在三个阶段：首先通过持续预训练（Continual Pre-training, CPT）使模型专业化；其次采用长链式思维监督微调（Long Chain-of-Thought Supervised Fine-Tuning, L-CoT SFT），利用专家验证的推理-代码样本训练模型模拟人类安全分析逻辑；最后引入安全感知组相对策略优化（Security-Aware Group Relative Policy Optimization, S-GRPO），通过加权奖励信号优化生成策略，以提升编译成功率、安全合规性和格式正确性。该方法在真实函数基准上显著优于17个基线模型，尤其在完整率（FullRate）上实现45.79%相对提升，同时生成的推理过程也获得高分的人类评价，体现了其在安全性与可解释性上的双重突破。

链接: https://arxiv.org/abs/2509.09942
作者: Lei Yu,Jingyuan Zhang,Xin Wang,Jiajia Ma,Li Yang,Fengjun Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable “black boxes” lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).
zh

[AI-28] A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments

【速读】：该论文旨在解决程序化内容生成（Procedural Content Generation, PCG）中同时优化设计者指定目标与由基础瓦片集隐含施加的邻接约束的问题。传统方法通常采用联合优化策略，即在进化算法中同时调整全局指标与局部瓦片放置，但随着任务复杂度提升，这种方法性能下降且难以收敛。论文的关键解决方案是将波函数坍缩（WaveFunctionCollapse, WFC）重新形式化为马尔可夫决策过程（Markov Decision Process, MDP），从而将局部约束满足交由WFC的传播机制处理，而外部优化算法仅专注于全局目标最大化，实现了局部约束与全局目标的解耦优化。实验表明，该方法在多个难度不同的领域中均显著优于联合优化策略。

链接: https://arxiv.org/abs/2509.09919
作者: Franklin Yiu,Mohan Lu,Nina Li,Kevin Joseph,Tianxu Zhang,Julian Togelius,Timothy Merino,Sam Earle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Procedural content generation often requires satisfying both designer-specified objectives and adjacency constraints implicitly imposed by the underlying tile set. To address the challenges of jointly optimizing both constraints and objectives, we reformulate WaveFunctionCollapse (WFC) as a Markov Decision Process (MDP), enabling external optimization algorithms to focus exclusively on objective maximization while leveraging WFC’s propagation mechanism to enforce constraint satisfaction. We empirically compare optimizing this MDP to traditional evolutionary approaches that jointly optimize global metrics and local tile placement. Across multiple domains with various difficulties, we find that joint optimization not only struggles as task complexity increases, but consistently underperforms relative to optimization over the WFC-MDP, underscoring the advantages of decoupling local constraint satisfaction from global objective optimization.
zh

[AI-29] WALL: A Web Application for Automated Quality Assurance using Large Language Models

【速读】：该论文旨在解决软件项目日益复杂背景下，代码文件中问题数量和类型激增所带来的挑战，尤其是如何高效地检测、修复与评估代码缺陷。解决方案的关键在于构建一个名为WALL的Web应用，其核心是集成SonarQube静态分析工具与大型语言模型（Large Language Models, LLMs），如GPT-3.5 Turbo和GPT-4o，形成三个模块化组件：问题提取工具、代码修订工具和代码对比工具，从而实现从问题识别到自动化修复再到修复准确性验证的全流程闭环。实验表明，该混合方法结合成本较低与性能先进的LLM，在显著降低人力投入的同时保持高质量的代码修正效果。

链接: https://arxiv.org/abs/2509.09918
作者: Seyed Moein Abtahi,Akramul Azim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As software projects become increasingly complex, the volume and variety of issues in code files have grown substantially. Addressing this challenge requires efficient issue detection, resolution, and evaluation tools. This paper presents WALL, a web application that integrates SonarQube and large language models (LLMs) such as GPT-3.5 Turbo and GPT-4o to automate these tasks. WALL comprises three modules: an issue extraction tool, code issues reviser, and code comparison tool. Together, they enable a seamless pipeline for detecting software issues, generating automated code revisions, and evaluating the accuracy of revisions. Our experiments, conducted on 563 files with over 7,599 issues, demonstrate WALL’s effectiveness in reducing human effort while maintaining high-quality revisions. Results show that employing a hybrid approach of cost-effective and advanced LLMs can significantly lower costs and improve revision rates. Future work aims to enhance WALL’s capabilities by integrating open-source LLMs and eliminating human intervention, paving the way for fully automated code quality management.
zh

[AI-30] he ®evolution of Scientific Workflows in the Agent ic AI Era: Towards Autonomous Science

【速读】：该论文试图解决当前科学发现过程中因分布式设施与异构资源协调困难，导致研究人员被迫充当手动工作流协调者而非专注于科学探索的问题。解决方案的关键在于提出一个双维度演化框架：一方面从静态工作流向智能工作流演进（intelligence），另一方面从单一工作流向群体协作式工作流演进（composition），从而构建从现有工作流管理系统到完全自主、分布式科学实验室的进化路径，并提供一套可实施的架构蓝图，以实现100倍加速的科学发现和变革性的工作流范式。

链接: https://arxiv.org/abs/2509.09915
作者: Woong Shin,Renan Souza,Daniel Rosendo,Frédéric Suter,Feiyi Wang,Prasanna Balaprakash,Rafael Ferreira da Silva
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Modern scientific discovery increasingly requires coordinating distributed facilities and heterogeneous resources, forcing researchers to act as manual workflow coordinators rather than scientists. Advances in AI leading to AI agents show exciting new opportunities that can accelerate scientific discovery by providing intelligence as a component in the ecosystem. However, it is unclear how this new capability would materialize and integrate in the real world. To address this, we propose a conceptual framework where workflows evolve along two dimensions which are intelligence (from static to intelligent) and composition (from single to swarm) to chart an evolutionary path from current workflow management systems to fully autonomous, distributed scientific laboratories. With these trajectories in mind, we present an architectural blueprint that can help the community take the next steps towards harnessing the opportunities in autonomous science with the potential for 100x discovery acceleration and transformational scientific workflows.
zh

[AI-31] ackling One Health Risks: How Large Language Models are leverag ed for Risk Negotiation and Consensus-building

【速读】：该论文旨在解决当前全球性挑战中因复杂相互依赖关系而难以通过传统风险分析框架有效应对的问题，尤其是由于常规方法将复杂系统简化为可管理模块，导致部门间信息孤岛和利益相关方协调困难。其解决方案的关键在于构建一个以人工智能（AI）辅助的协商框架，该框架整合了大语言模型（Large Language Models, LLMs）与基于AI的自主代理（AI-based autonomous agents），嵌入到以协商为核心的風險分析流程中，从而支持利益相关方模拟谈判、系统建模动态、预测妥协方案并评估解决方案影响，同时利用LLMs的语义分析能力缓解信息过载，在时间约束下增强决策质量。

链接: https://arxiv.org/abs/2509.09906
作者: Alexandra Fetsch,Iurii Savvateev,Racem Ben Romdhane,Martin Wiedmann,Artemiy Dimov,Maciej Durkalec,Josef Teichmann,Jakob Zinsstag,Konstantinos Koutsoumanis,Andreja Rajkovic,Jason Mann,Mauro Tonolla,Monika Ehling-Schulz,Matthias Filter,Sophia Johler
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Key global challenges of our times are characterized by complex interdependencies and can only be effectively addressed through an integrated, participatory effort. Conventional risk analysis frameworks often reduce complexity to ensure manageability, creating silos that hinder comprehensive solutions. A fundamental shift towards holistic strategies is essential to enable effective negotiations between different sectors and to balance the competing interests of stakeholders. However, achieving this balance is often hindered by limited time, vast amounts of information, and the complexity of integrating diverse perspectives. This study presents an AI-assisted negotiation framework that incorporates large language models (LLMs) and AI-based autonomous agents into a negotiation-centered risk analysis workflow. The framework enables stakeholders to simulate negotiations, systematically model dynamics, anticipate compromises, and evaluate solution impacts. By leveraging LLMs’ semantic analysis capabilities we could mitigate information overload and augment decision-making process under time constraints. Proof-of-concept implementations were conducted in two real-world scenarios: (i) prudent use of a biopesticide, and (ii) targeted wild animal population control. Our work demonstrates the potential of AI-assisted negotiation to address the current lack of tools for cross-sectoral engagement. Importantly, the solution’s open source, web based design, suits for application by a broader audience with limited resources and enables users to tailor and develop it for their own needs.
zh

[AI-32] Self-Augmented Robot Trajectory: Efficient Imitation Learning via Safe Self-augmentation with Demonstrator-annotated Precision

【速读】：该论文旨在解决模仿学习（Imitation Learning）在机器人任务中因数据采集成本高、安全性差而导致的训练效率低问题，尤其是在空间受限场景（如插销入孔任务）中，传统方法依赖大量示范或随机探索，易引发碰撞并需人工重置环境，增加人力负担。其解决方案的关键在于提出Self-Augmented Robot Trajectory (SART) 框架，通过两个阶段实现：首先由人类提供一次示范并标注关键路径点的精度边界（以球形区域表示），随后机器人在保证安全的前提下自主生成多样化的无碰撞轨迹，并与原始示范重新连接，从而在最小化人类干预的同时高效扩充高质量训练数据集，显著提升策略成功率。

链接: https://arxiv.org/abs/2509.09893
作者: Hanbit Oh,Masaki Murooka,Tomohiro Motoda,Ryoichi Nakajo,Yukiyasu Domae
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Imitation learning is a promising paradigm for training robot agents; however, standard approaches typically require substantial data acquisition – via numerous demonstrations or random exploration – to ensure reliable performance. Although exploration reduces human effort, it lacks safety guarantees and often results in frequent collisions – particularly in clearance-limited tasks (e.g., peg-in-hole) – thereby, necessitating manual environmental resets and imposing additional human burden. This study proposes Self-Augmented Robot Trajectory (SART), a framework that enables policy learning from a single human demonstration, while safely expanding the dataset through autonomous augmentation. SART consists of two stages: (1) human teaching only once, where a single demonstration is provided and precision boundaries – represented as spheres around key waypoints – are annotated, followed by one environment reset; (2) robot self-augmentation, where the robot generates diverse, collision-free trajectories within these boundaries and reconnects to the original demonstration. This design improves the data collection efficiency by minimizing human effort while ensuring safety. Extensive evaluations in simulation and real-world manipulation tasks show that SART achieves substantially higher success rates than policies trained solely on human-collected demonstrations. Video results available at this https URL .
zh

[AI-33] From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

【速读】：Model call failure

链接: https://arxiv.org/abs/2509.09873
作者: James Jewitt,Hao Li,Bram Adams,Gopi Krishnan Rajbahadur,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, 5 tables, pre-print

点击查看摘要

Abstract:Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.
zh

[AI-34] SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints

【速读】：该论文旨在解决当前软件工程领域中AI系统评估仅关注解决方案准确性（solution accuracy），而忽视在资源受限环境下实际有效性（effectiveness）的问题。现有基准（如SWE-bench）未能衡量AI系统在token消耗和时间成本上的效率，导致高准确率但资源浪费的模型被误判为优秀。为此，作者提出SWE-Effi指标体系，将有效性定义为结果准确性（如问题修复率）与资源消耗（token数和时间）之间的平衡。解决方案的关键在于：AI系统的有效性不仅取决于其架构设计，更取决于其与基础大语言模型（large language model, LLM）的集成程度——良好的集成可实现资源高效且高性能的运行。此外，研究识别出“token雪球效应”和“昂贵失败”等系统性挑战，后者指代理在无法解决的任务上持续消耗大量资源，严重影响部署可行性并增加强化学习训练成本。最后，论文揭示了token预算与时间预算之间存在明确权衡关系，这对项目成本控制和可扩展强化学习至关重要。

链接: https://arxiv.org/abs/2509.09853
作者: Zhiyu Fan,Kirill Vasilevski,Dayi Lin,Boyuan Chen,Yihao Chen,Zhiqing Zhong,Jie M. Zhang,Pinjia He,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system’s effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the “token snowball” effect and, more significantly, a pattern of “expensive failures”. In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.
zh

[AI-35] owards an AI-based knowledge assistant for goat farmers based on Retrieval-Augmented Generation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在畜禽养殖领域，特别是山羊健康管理系统中应用受限的问题，其核心挑战在于知识来源的可用性、多样性与复杂性。解决方案的关键在于提出一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的智能知识辅助系统，通过两种结构化知识处理方法——表格文本化（table textualization）和决策树文本化（decision-tree textualization），提升LLM对异构数据格式的理解能力；同时构建了一个涵盖五大领域的专业化山羊养殖知识库，并集成在线搜索模块以实现实时信息更新，从而显著增强模型在跨场景下的泛化能力与准确性。

链接: https://arxiv.org/abs/2509.09848
作者: Nana Han,Dong Liu,Tomas Norton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being recognised as valuable knowledge communication tools in many industries. However, their application in livestock farming remains limited, being constrained by several factors not least the availability, diversity and complexity of knowledge sources. This study introduces an intelligent knowledge assistant system designed to support health management in farmed goats. Leveraging the Retrieval-Augmented Generation (RAG), two structured knowledge processing methods, table textualization and decision-tree textualization, were proposed to enhance large language models’ (LLMs) understanding of heterogeneous data formats. Based on these methods, a domain-specific goat farming knowledge base was established to improve LLM’s capacity for cross-scenario generalization. The knowledge base spans five key domains: Disease Prevention and Treatment, Nutrition Management, Rearing Management, Goat Milk Management, and Basic Farming Knowledge. Additionally, an online search module is integrated to enable real-time retrieval of up-to-date information. To evaluate system performance, six ablation experiments were conducted to examine the contribution of each component. The results demonstrated that heterogeneous knowledge fusion method achieved the best results, with mean accuracies of 87.90% on the validation set and 84.22% on the test set. Across the text-based, table-based, decision-tree based QA tasks, accuracy consistently exceeded 85%, validating the effectiveness of structured knowledge fusion within a modular design. Error analysis identified omission as the predominant error category, highlighting opportunities to further improve retrieval coverage and context integration. In conclusion, the results highlight the robustness and reliability of the proposed system for practical applications in goat farming.
zh

[AI-36] HGEN: Heterogeneous Graph Ensemble Networks IJCAI

【速读】：该论文旨在解决异质图（heterogeneous graph）中集成学习（ensemble learning）面临的挑战，特别是由于节点类型、节点特征和局部邻域拓扑结构的异质性导致不同图学习器难以有效融合的问题。解决方案的关键在于提出HGEN框架，其核心创新包括：1）基于元路径（meta-path）与随机丢弃策略构建等位图神经网络（Allele Graph Neural Networks, GNNs），并通过残差注意力机制校准不同元路径下的GNN模型，使节点嵌入聚焦于更具信息量的子图以提升基学习器性能；2）引入相关性正则化项以扩大不同元路径生成的嵌入矩阵之间的差异，从而增强基学习器的多样性。该方法在五个异质网络数据集上显著优于现有先进方法。

链接: https://arxiv.org/abs/2509.09843
作者: Jiajun Shen,Yufei Jin,Yi He,Xingquan Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The paper is in proceedings of the 34th IJCAI Conference, 2025

点击查看摘要

Abstract:This paper presents HGEN that pioneers ensemble learning for heterogeneous graphs. We argue that the heterogeneity in node types, nodal features, and local neighborhood topology poses significant challenges for ensemble learning, particularly in accommodating diverse graph learners. Our HGEN framework ensembles multiple learners through a meta-path and transformation-based optimization pipeline to uplift classification accuracy. Specifically, HGEN uses meta-path combined with random dropping to create Allele Graph Neural Networks (GNNs), whereby the base graph learners are trained and aligned for later ensembling. To ensure effective ensemble learning, HGEN presents two key components: 1) a residual-attention mechanism to calibrate allele GNNs of different meta-paths, thereby enforcing node embeddings to focus on more informative graphs to improve base learner accuracy, and 2) a correlation-regularization term to enlarge the disparity among embedding matrices generated from different meta-paths, thereby enriching base learner diversity. We analyze the convergence of HGEN and attest its higher regularization magnitude over simple voting. Experiments on five heterogeneous networks validate that HGEN consistently outperforms its state-of-the-art competitors by substantial margin.
zh

[AI-37] Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

【速读】：该论文旨在解决离散动作环境下基于策略的强化学习方法（如SAC）在实际性能上不如基于价值的方法（如DQN）的问题。其关键解决方案在于识别出策略与价值函数之间的熵耦合是导致离散SAC（DSAC）表现不佳的根本原因，并通过解耦这两个组件显著提升性能；在此基础上，作者提出了一种灵活的离线策略Actor-Critic框架，支持使用m步贝尔曼算子更新价值函数，并允许将标准策略优化方法与熵正则化相结合来构造策略目标，从而在理论和实践中均实现了与DQN相当甚至更优的性能表现，且无需显式探索机制或熵正则化。

链接: https://arxiv.org/abs/2509.09838
作者: Reza Asad,Reza Babanezhad,Sharan Vaswani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.
zh

[AI-38] CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio

【速读】：该论文旨在解决现有音频自编码器在压缩表示中面临的两大挑战：一是难以同时支持连续嵌入（continuous embeddings）与离散标记（discrete tokens），二是难以在保持高音频保真度的前提下实现高压缩比。其解决方案的关键在于提出 CoDiCodec，一种新型音频自编码器，通过引入总结嵌入（summary embeddings）高效编码全局特征，并结合有限标量量化（Finite Scalar Quantization, FSQ）及创新的 FSQ-dropout 技术，在同一训练模型下同时生成约 11 Hz 的连续嵌入和速率为 2.38 kbps 的离散标记，从而在不增加额外损失项的情况下实现端到端训练，且支持自回归解码与一种新的并行解码策略，显著提升了重建音频质量与解码效率。

链接: https://arxiv.org/abs/2509.09836
作者: Marco Pasini,Stefan Lattner,George Fazekas
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to ISMIR 2025

点击查看摘要

Abstract:Efficiently representing audio signals in a compressed latent space is critical for latent generative modelling. However, existing autoencoders often force a choice between continuous embeddings and discrete tokens. Furthermore, achieving high compression ratios while maintaining audio fidelity remains a challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes these limitations by both efficiently encoding global features via summary embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz and discrete tokens at a rate of 2.38 kbps from the same trained model, offering unprecedented flexibility for different downstream generative tasks. This is achieved through Finite Scalar Quantization (FSQ) and a novel FSQ-dropout technique, and does not require additional loss terms beyond the single consistency loss used for end-to-end training. CoDiCodec supports both autoregressive decoding and a novel parallel decoding strategy, with the latter achieving superior audio quality and faster decoding. CoDiCodec outperforms existing continuous and discrete autoencoders at similar bitrates in terms of reconstruction audio quality. Our work enables a unified approach to audio compression, bridging the gap between continuous and discrete generative modelling paradigms.
zh

[AI-39] SoilSound: Smartphone-based Soil Moisture Estimation

【速读】：该论文旨在解决土壤湿度监测中现有方法依赖侵入式探头或专用设备、难以普及的问题，尤其针对资源受限环境下的应用需求。其核心解决方案是提出了一种基于智能手机的声学传感系统SoilSound，通过利用手机内置扬声器和麦克风进行垂直扫描采集土壤反射声波信号，并结合表面粗糙度效应建模声波在土壤中的反射机制，从而实现无需校准、不扰动土壤的高精度湿度估计。该方案的关键创新在于采用反射式声学感知而非传统透射式方法，并借助轻量级卷积神经网络（Convolutional Neural Network, CNN）在设备端完成实时推理，显著降低计算、内存与功耗开销，最终在多种土壤类型与户外场景下实现了平均绝对误差（Mean Absolute Error, MAE）仅为2.39%的准确测量性能。

链接: https://arxiv.org/abs/2509.09823
作者: Yixuan Gao,Tanvir Ahmed,Shuang He,Zhongqi Cheng,Rajalakshmi Nandakumar
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Soil moisture monitoring is essential for agriculture and environmental management, yet existing methods require either invasive probes disturbing the soil or specialized equipment, limiting access to the public. We present SoilSound, an ubiquitous accessible smartphone-based acoustic sensing system that can measure soil moisture without disturbing the soil. We leverage the built-in speaker and microphone to perform a vertical scan mechanism to accurately measure moisture without any calibration. Unlike existing work that use transmissive properties, we propose an alternate model for acoustic reflections in soil based on the surface roughness effect to enable moisture sensing without disturbing the soil. The system works by sending acoustic chirps towards the soil and recording the reflections during a vertical scan, which are then processed and fed to a convolutional neural network for on-device soil moisture estimation with negligible computational, memory, or power overhead. We evaluated the system by training with curated soils in boxes in the lab and testing in the outdoor fields and show that SoilSound achieves a mean absolute error (MAE) of 2.39% across 10 different locations. Overall, the evaluation shows that SoilSound can accurately track soil moisture levels ranging from 15.9% to 34.0% across multiple soil types, environments, and users; without requiring any calibration or disturbing the soil, enabling widespread moisture monitoring for home gardeners, urban farmers, citizen scientists, and agricultural communities in resource-limited settings.
zh

[AI-40] owards a Common Framework for Autoformalization

【速读】：该论文旨在解决当前研究中对“自动形式化”（autoformalization）概念分散、缺乏统一框架的问题。尽管不同领域如数学形式化、推理、规划和知识表示均涉及将非形式化输入转化为形式逻辑表达，但这些研究往往独立发展，导致方法论、基准测试和理论体系难以共享，从而制约了整体进展。论文的关键解决方案是系统性回顾并识别隐含或显式的自动形式化实例，并提出一个统一的框架，以促进跨学科交流与协作，推动下一代人工智能系统的演进。

链接: https://arxiv.org/abs/2509.09810
作者: Agnieszka Mensfelt,David Tena Cucala,Santiago Franco,Angeliki Koutsoukou-Argyraki,Vince Trencsenyi,Kostas Stathis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoformalization has emerged as a term referring to the automation of formalization - specifically, the formalization of mathematics using interactive theorem provers (proof assistants). Its rapid development has been driven by progress in deep learning, especially large language models (LLMs). More recently, the term has expanded beyond mathematics to describe the broader task of translating informal input into formal logical representations. At the same time, a growing body of research explores using LLMs to translate informal language into formal representations for reasoning, planning, and knowledge representation - often without explicitly referring to this process as autoformalization. As a result, despite addressing similar tasks, the largely independent development of these research areas has limited opportunities for shared methodologies, benchmarks, and theoretical frameworks that could accelerate progress. The goal of this paper is to review - explicit or implicit - instances of what can be considered autoformalization and to propose a unified framework, encouraging cross-pollination between different fields to advance the development of next generation AI systems.
zh

[AI-41] A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes

【速读】：该论文旨在解决能源建模研究中对大量数据的依赖问题，尤其是那些难以获取、成本高昂或涉及隐私风险的数据。其解决方案的关键在于提出了一种模块化的多模态框架，利用生成式 AI（Generative AI）从公开可得的住宅信息和图像中生成真实且带标签的数据，从而减少对昂贵或受限数据源的依赖，提升研究的可访问性与可复现性。

链接: https://arxiv.org/abs/2509.09794
作者: Jackson Eshbaugh,Chetan Tiwari,Jorge Silveyra
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 44 pages; 2 appendices; 9 figures; 1 table. Code available at this https URL

点击查看摘要

Abstract:Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which is inaccessible, expensive, or raises privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible residential information and images using generative artificial intelligence (AI). Additionally, we provide a pipeline demonstrating this framework, and we evaluate its generative AI components. Our experiments show that our framework’s use of AI avoids common issues with generative models. Our framework produces realistic, labeled data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible and reproducible research.
zh

[AI-42] How well can LLM s provide planning feedback in grounded environments?

【速读】：该论文旨在解决在具身环境（grounded environments）中学习规划时对精心设计奖励函数或高质量标注示范的依赖问题。其解决方案的关键在于利用预训练的基础模型（foundation models），如大语言模型（LLMs）和视觉语言模型（VLMs），这些模型蕴含的背景知识可有效减少政策学习所需的奖励设计与示范数据量。研究表明，基础模型能在符号、语言和连续控制等多种环境中提供多样且高质量的反馈形式，包括二元反馈、偏好反馈、动作建议、目标建议及增量动作反馈，并且模型规模越大、推理能力越强，反馈准确性越高、偏差越小，同时更受益于增强的推理方法（如上下文学习、思维链等）。

链接: https://arxiv.org/abs/2509.09790
作者: Yuxuan Li,Victor Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces.
zh

[AI-43] ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version) CCS2025

【速读】：该论文针对Split Learning (SL)中恶意客户端通过发送污染的中间梯度来植入后门攻击的问题，提出了一种名为ZORRO的私有、可验证且鲁棒的防御方案。其关键在于利用交互式零知识证明（interactive zero-knowledge proofs, ZKPs），使客户端能够证明其正确执行了本地防御算法，从而生成计算完整性证明以验证本地训练模型片段的良性性质；同时借助模型分区的频域表示，实现对未受信环境中局部模型的深度检测，确保每个客户端向下一客户端转发的检查点均为无害状态。

链接: https://arxiv.org/abs/2509.09787
作者: Nojan Sheybani,Alessandro Pegoraro,Jonathan Knauer,Phillip Rieger,Elissa Mollakuqe,Farinaz Koushanfar,Ahmad-Reza Sadeghi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Full version of CCS 2025 paper

点击查看摘要

Abstract:Split Learning (SL) is a distributed learning approach that enables resource-constrained clients to collaboratively train deep neural networks (DNNs) by offloading most layers to a central server while keeping in- and output layers on the client-side. This setup enables SL to leverage server computation capacities without sharing data, making it highly effective in resource-constrained environments dealing with sensitive data. However, the distributed nature enables malicious clients to manipulate the training process. By sending poisoned intermediate gradients, they can inject backdoors into the shared DNN. Existing defenses are limited by often focusing on server-side protection and introducing additional overhead for the server. A significant challenge for client-side defenses is enforcing malicious clients to correctly execute the defense algorithm. We present ZORRO, a private, verifiable, and robust SL defense scheme. Through our novel design and application of interactive zero-knowledge proofs (ZKPs), clients prove their correct execution of a client-located defense algorithm, resulting in proofs of computational integrity attesting to the benign nature of locally trained DNN portions. Leveraging the frequency representation of model partitions enables ZORRO to conduct an in-depth inspection of the locally trained models in an untrusted environment, ensuring that each client forwards a benign checkpoint to its succeeding client. In our extensive evaluation, covering different model architectures as well as various attack strategies and data scenarios, we show ZORRO’s effectiveness, as it reduces the attack success rate to less than 6% while causing even for models storing \numprint1000000 parameters on the client-side an overhead of less than 10 seconds. Comments: Full version of CCS 2025 paper Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.09787 [cs.CR] (or arXiv:2509.09787v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.09787 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-44] LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在长文本推理过程中因KV Cache内存消耗过高而导致的性能瓶颈问题，现有缓存压缩方法多依赖启发式策略且缺乏动态预算分配机制。其解决方案的关键在于提出一种统一的缓存压缩框架——LAVa，通过最小化Transformer残差流中的信息损失来指导压缩决策；在此基础上，作者首次引入基于层注意力输出损失的新度量标准，实现跨注意力头的比较与层内动态头预算分配，并通过跨层信息对比进一步获得动态层预算分配机制。该方法无需训练或多种策略组合即可实现缓存淘汰与动态预算分配的统一优化，在生成类任务（如代码补全）中强调动态层预算的重要性，而在抽取类任务（如抽取式问答）中则凸显动态头预算的关键作用，从而在不同任务类型下均保持最优性能。

链接: https://arxiv.org/abs/2509.09754
作者: Yiqun Shen,Song Yuan,Zhengze Zhang,Xiaoliang Wang,Daxin Jiang,Nguyen Cam-Tu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at this https URL.
zh

[AI-45] Meta-Learning Reinforcement Learning for Crypto-Return Prediction

【速读】：该论文旨在解决加密货币收益率预测的难题，其核心挑战在于价格波动受链上活动、新闻流和社交情绪等动态因素驱动，且标注训练数据稀缺昂贵。解决方案的关键在于提出Meta-RL-Crypto，一个融合元学习（meta-learning）与强化学习（reinforcement learning, RL）的统一Transformer架构，通过闭环结构让代理（agent）在无额外人工监督的情况下，迭代交替扮演“执行者（actor）”、“裁判（judge）”和“元裁判（meta-judge）”三种角色，从而同时优化交易策略与评估标准，实现自我改进。该方法可利用多模态市场输入及内部偏好反馈，在多种市场环境下展现出优于其他基于大语言模型（LLM）基线的方法的技术指标表现。

链接: https://arxiv.org/abs/2509.09751
作者: Junqiao Wang,Zhaoyang Guan,Guanyu Liu,Tianze Xia,Xianzhi Li,Shuo Yin,Xinyuan Song,Chuhan Cheng,Tianyu Shi,Alex Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.
zh

[AI-46] D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference

【速读】：该论文旨在解决多模态分类模型在资源受限环境中部署困难的问题，特别是现有跨模态迁移学习方法依赖训练和推理阶段均需配对传感器数据（如IMU、视频和音频），导致硬件冗余与成本增加。其解决方案的关键在于提出Decoupled Cross-Attention Transfer (D-CAT) 框架，通过引入一种新型的交叉注意力对齐损失（cross-attention alignment loss），实现模态特异性表征的解耦对齐——无需在推理时同时使用所有传感器模态即可完成跨模态知识迁移，从而支持单传感器推理的同时保持高精度，显著降低感知系统的硬件依赖性，适用于传感器可用性不稳定的场景（如家庭助老机器人）。

链接: https://arxiv.org/abs/2509.09747
作者: Leen Daher,Zhaobo Wang,Malcolm Mielle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors’ feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn’t overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at this https URL.
zh

[AI-47] Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis

【速读】：该论文旨在解决精神疾病诊断中因标注脑网络数据稀缺而导致的准确性和可解释性不足的问题。现有自监督学习（Self-Supervised Learning, SSL）方法常依赖可能破坏脑图结构语义的增强策略，从而限制了模型性能。其解决方案的关键在于提出一种两阶段框架SAM-BG，通过在预训练阶段利用边掩码器（edge masker）从少量标注数据中捕获关键结构语义，在SSL阶段将提取的结构先验用于指导结构感知的增强过程，从而学习更具语义意义和鲁棒性的脑图表示，显著提升小样本场景下的诊断性能并增强临床可解释性。

链接: https://arxiv.org/abs/2509.09744
作者: Mujie Liu,Chenze Wang,Liping Chen,Nguyen Linh Dan Le,Niharika Tewari,Ting Dang,Jiangang Ma,Feng Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The limited availability of labeled brain network data makes it challenging to achieve accurate and interpretable psychiatric diagnoses. While self-supervised learning (SSL) offers a promising solution, existing methods often rely on augmentation strategies that can disrupt crucial structural semantics in brain graphs. To address this, we propose SAM-BG, a two-stage framework for learning brain graph representations with structural semantic preservation. In the pre-training stage, an edge masker is trained on a small labeled subset to capture key structural semantics. In the SSL stage, the extracted structural priors guide a structure-aware augmentation process, enabling the model to learn more semantically meaningful and robust representations. Experiments on two real-world psychiatric datasets demonstrate that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability. Our code is available at this https URL.
zh

[AI-48] Human-AI Collaboration Increases Efficiency in Regulatory Writing

【速读】：该论文旨在解决临床前药物研发中申报资料（如新药临床试验申请，IND）撰写耗时长、依赖专家经验的问题，从而加速早期临床开发进程。其解决方案的关键在于利用大型语言模型（Large Language Model, LLM）平台AutoIND自动生成IND非临床总结部分（eCTD模块2.6.2、2.6.4、2.6.6）的初稿，显著缩短撰写时间（相较传统人工方式减少约97%），同时通过盲法质量评估确保核心合规性未受影响——即未发现可能改变监管判断的关键错误，但识别出强调、简洁性和清晰度等方面的系统性不足，为后续模型优化提供方向。

链接: https://arxiv.org/abs/2509.09738
作者: Umut Eser,Yael Gozin,L. Jay Stallons,Ari Caroline,Martin Preusse,Brandon Rice,Scott Wright,Andrew Robertson
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Background: Investigational New Drug (IND) application preparation is time-intensive and expertise-dependent, slowing early clinical development. Objective: To evaluate whether a large language model (LLM) platform (AutoIND) can reduce first-draft composition time while maintaining document quality in regulatory submissions. Methods: Drafting times for IND nonclinical written summaries (eCTD modules 2.6.2, 2.6.4, 2.6.6) generated by AutoIND were directly recorded. For comparison, manual drafting times for IND summaries previously cleared by the U.S. FDA were estimated from the experience of regulatory writers ( \geq 6 years) and used as industry-standard benchmarks. Quality was assessed by a blinded regulatory writing assessor using seven pre-specified categories: correctness, completeness, conciseness, consistency, clarity, redundancy, and emphasis. Each sub-criterion was scored 0-3 and normalized to a percentage. A critical regulatory error was defined as any misrepresentation or omission likely to alter regulatory interpretation (e.g., incorrect NOAEL, omission of mandatory GLP dose-formulation analysis). Results: AutoIND reduced initial drafting time by \sim 97% (from \sim 100 h to 3.7 h for 18,870 pages/61 reports in IND-1; and to 2.6 h for 11,425 pages/58 reports in IND-2). Quality scores were 69.6% and 77.9% for IND-1 and IND-2. No critical regulatory errors were detected, but deficiencies in emphasis, conciseness, and clarity were noted. Conclusions: AutoIND can dramatically accelerate IND drafting, but expert regulatory writers remain essential to mature outputs to submission-ready quality. Systematic deficiencies identified provide a roadmap for targeted model improvements.
zh

[AI-49] Wave-Based Semantic Memory with Resonance-Based Retrieval: A Phase-Aware Alternative to Vector Embedding Stores

【速读】：该论文旨在解决传统基于向量的内存系统在语义表示中因忽略相位信息而导致的局限性问题，尤其是在处理相位变化、否定和组合查询等复杂语义场景时表现不足。其解决方案的关键在于提出一种基于波的语义记忆（Wave-Based Semantic Memory）框架，将知识建模为复值波函数 ψ(x) = A(x)e^iϕ(x)，通过共振干涉机制进行检索，从而同时保留幅度和相位信息，显著提升语义相似性的表达能力和鲁棒性。

链接: https://arxiv.org/abs/2509.09691
作者: Aleksandr Listopad
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Conventional vector-based memory systems rely on cosine or inner product similarity within real-valued embedding spaces. While computationally efficient, such approaches are inherently phase-insensitive and limited in their ability to capture resonance phenomena crucial for meaning representation. We propose Wave-Based Semantic Memory, a novel framework that models knowledge as wave patterns \psi(x) = A(x) e^i\phi(x) and retrieves it through resonance-based interference. This approach preserves both amplitude and phase information, enabling more expressive and robust semantic similarity. We demonstrate that resonance-based retrieval achieves higher discriminative power in cases where vector methods fail, including phase shifts, negations, and compositional queries. Our implementation, ResonanceDB, shows scalability to millions of patterns with millisecond latency, positioning wave-based memory as a viable alternative to vector stores for AGI-oriented reasoning and knowledge representation.
zh

[AI-50] GeoGPT .RAG Technical Report

【速读】：该论文旨在解决通用大语言模型在地学领域（geoscience）中知识准确性不足、上下文相关性弱的问题，从而限制其在专业科研场景中的可靠应用。解决方案的关键在于构建一个面向地学领域的开源大语言模型系统GeoGPT，并引入检索增强生成（Retrieval Augmented Generation, RAG）机制，通过整合专用知识库——GeoGPT Library，以及支持用户上传个性化文献资料的方式，实现精准的信息检索与生成；同时，为提升检索质量与领域适配度，对嵌入模型（embedding model）和重排序模型（ranking model）进行微调，使系统能够输出更准确、可信的地学问答结果。

链接: https://arxiv.org/abs/2509.09686
作者: Fei Huang,Fan Wu,Zeqing Zhang,Qihao Wang,Long Zhang,Grant Michael Boquet,Hongyang Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures, 10 tables

点击查看摘要

Abstract:GeoGPT is an open large language model system built to advance research in the geosciences. To enhance its domain-specific capabilities, we integrated Retrieval Augmented Generation(RAG), which augments model outputs with relevant information retrieved from an external knowledge source. GeoGPT uses RAG to draw from the GeoGPT Library, a specialized corpus curated for geoscientific content, enabling it to generate accurate, context-specific answers. Users can also create personalized knowledge bases by uploading their own publication lists, allowing GeoGPT to retrieve and respond using user-provided materials. To further improve retrieval quality and domain alignment, we fine-tuned both the embedding model and a ranking model that scores retrieved passages by relevance to the query. These enhancements optimize RAG for geoscience applications and significantly improve the system’s ability to deliver precise and trustworthy outputs. GeoGPT reflects a strong commitment to open science through its emphasis on collaboration, transparency, and community driven development. As part of this commitment, we have open-sourced two core RAG components-GeoEmbedding and GeoReranker-to support geoscientists, researchers, and professionals worldwide with powerful, accessible AI tools.
zh

[AI-51] alkPlayData 2: An Agent ic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

【速读】：该论文旨在解决生成式推荐模型在音乐推荐场景中缺乏高质量、多模态对话数据的问题，从而限制了模型对用户意图理解与个性化交互能力的提升。其解决方案的关键在于构建了一个名为TalkPlayData 2的合成数据集，通过一个由多个角色化大语言模型（Large Language Model, LLM）组成的智能数据生成流水线（agentic data pipeline），实现对多样化音乐推荐对话场景的模拟。该流水线中的LLM代理具备音频和图像的多模态输入能力，并通过设定微调后的对话目标条件，使生成的对话具备结构化语义与真实交互特征，最终在LLM-as-a-judge和主观评估实验中验证了该数据集在训练生成式音乐推荐模型方面的有效性。

链接: https://arxiv.org/abs/2509.09685
作者: Keunwoo Choi,Seungheon Doh,Juhan Nam
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In TalkPlayData 2 pipeline, multiple large language model (LLM) agents are created under various roles with specialized prompts and access to different parts of information, and the chat data is acquired by logging the conversation between the Listener LLM and the Recsys LLM. To cover various conversation scenarios, for each conversation, the Listener LLM is conditioned on a finetuned conversation goal. Finally, all the LLMs are multimodal with audio and images, allowing a simulation of multimodal recommendation and conversation. In the LLM-as-a-judge and subjective evaluation experiments, TalkPlayData 2 achieved the proposed goal in various aspects related to training a generative recommendation model for music. TalkPlayData 2 and its generation code are open-sourced at this https URL.
zh

[AI-52] Forecasting Clicks in Digital Advertising: Multimodal Inputs and Interpretable Outputs

【速读】：该论文旨在解决数字广告中点击量预测（click volume forecasting）的准确性问题，传统时间序列模型仅依赖数值数据，忽略了文本信息（如关键词更新）所蕴含的丰富上下文特征。其解决方案的关键在于提出一种多模态预测框架，将点击数据与来自真实广告活动的文本日志相结合，并利用强化学习提升对文本信息的理解能力，从而增强不同模态之间的融合效果，最终在大规模工业数据集上实现了更高的预测精度和可解释性。

链接: https://arxiv.org/abs/2509.09683
作者: Briti Gangopadhyay,Zhao Wang,Shingo Takamatsu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting click volume is a key task in digital advertising, influencing both revenue and campaign strategy. Traditional time series models rely solely on numerical data, often overlooking rich contextual information embedded in textual elements, such as keyword updates. We present a multimodal forecasting framework that combines click data with textual logs from real-world ad campaigns and generates human-interpretable explanations alongside numeric predictions. Reinforcement learning is used to improve comprehension of textual information and enhance fusion of modalities. Experiments on a large-scale industry dataset show that our method outperforms baselines in both accuracy and reasoning quality.
zh

[AI-53] AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings

【速读】：该论文旨在解决学术文献快速增长背景下，研究人员、资助机构及学术组织在文献发现过程中面临的时间消耗与效率瓶颈问题。其解决方案的关键在于构建一个端到端自动化系统，通过专用AI代理（Agent-E）实现从数据发现到直接执行任务的闭环流程：Agent-E能够精准识别特定地理区域的会议论文，并结合机器人流程自动化（Robotic Process Automation, RPA）技术自动完成如提交提名表单等预定义操作。实验验证表明，该系统在586篇来自五个会议的论文中实现了100%召回率和99.4%的高准确率，凸显了面向任务的AI代理在主动参与并加速学术工作流中的巨大潜力。

链接: https://arxiv.org/abs/2509.09470
作者: Om Vishesh,Harshad Khadilkar,Deepak Akkil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Keeping pace with the rapid growth of academia literature presents a significant challenge for researchers, funding bodies, and academic societies. To address the time-consuming manual effort required for scholarly discovery, we present a novel, fully automated system that transitions from data discovery to direct action. Our pipeline demonstrates how a specialized AI agent, ‘Agent-E’, can be tasked with identifying papers from specific geographic regions within conference proceedings and then executing a Robotic Process Automation (RPA) to complete a predefined action, such as submitting a nomination form. We validated our system on 586 papers from five different conferences, where it successfully identified every target paper with a recall of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the potential of task-oriented AI agents to not only filter information but also to actively participate in and accelerate the workflows of the academic community.
zh

[AI-54] Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective

【速读】：该论文旨在解决如何通过标准化元数据（metadata）建设来提升生物医学数据的AI就绪性（AI-readiness），从而支持生成式AI（Generative AI）和机器学习（Machine Learning, ML）方法在复杂生物医学研究中的有效应用。其解决方案的关键在于明确并实施一套涵盖FAIR原则（可发现性、可访问性、可互操作性、可重用性）、数据溯源（provenance）、表征程度（degree of characterization）、可解释性（explainability）、可持续性（sustainability）及计算可行性（computability）的多维标准，并辅以伦理数据实践文档，以确保数据集在结构化、语义清晰和合规的前提下服务于AI/ML建模与预测任务。

链接: https://arxiv.org/abs/2509.10432
作者: Harry Caufield,Satrajit Ghosh,Sek Wong Kong,Jillian Parker,Nathan Sheffield,Bhavesh Patel,Andrew Williams,Timothy Clark,Monica C. Munoz-Torres
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-readiness describes the degree to which data may be optimally and ethically used for subsequent AI and Machine Learning (AI/ML) methods, where those methods may involve some combination of model training, data classification, and ethical, explainable prediction. The Bridge2AI consortium has defined the particular criteria a biomedical dataset may possess to render it AI-ready: in brief, a dataset’s readiness is related to its FAIRness, provenance, degree of characterization, explainability, sustainability, and computability, in addition to its accompaniment with documentation about ethical data practices. To ensure AI-readiness and to clarify data structure and relationships within Bridge2AI’s Grand Challenges (GCs), particular types of metadata are necessary. The GCs within the Bridge2AI initiative include four data-generating projects focusing on generating AI/ML-ready datasets to tackle complex biomedical and behavioral research problems. These projects develop standardized, multimodal data, tools, and training resources to support AI integration, while addressing ethical data practices. Examples include using voice as a biomarker, building interpretable genomic tools, modeling disease trajectories with diverse multimodal data, and mapping cellular and molecular health indicators across the human body. This report assesses the state of metadata creation and standardization in the Bridge2AI GCs, provides guidelines where required, and identifies gaps and areas for improvement across the program. New projects, including those outside the Bridge2AI consortium, would benefit from what we have learned about creating metadata as part of efforts to promote AI readiness. Subjects: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.10432 [q-bio.OT] (or arXiv:2509.10432v1 [q-bio.OT] for this version) https://doi.org/10.48550/arXiv.2509.10432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Reinforcement learning for spin torque oscillator tasks

【速读】：该论文旨在解决自旋电子振荡器（Spintronic Oscillator, STO）的自动同步问题，即如何通过强化学习（Reinforcement Learning, RL）使STO在固定步数内精确同步到目标频率。解决方案的关键在于利用宏观自旋朗道-利夫希茨-吉尔伯特-斯隆切夫斯基方程（macrospin Landau-Lifschitz-Gilbert-Slonczewski equation）进行数值模拟，并训练两类强化学习智能体完成同步任务；同时通过调整基础任务设定，显著提升了同步过程的收敛速度与能量效率，且该优化策略在仿真环境中易于实现。

链接: https://arxiv.org/abs/2509.10057
作者: Jakub Mojsiejuk,Sławomir Ziętek,Witold Skowroński
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 3 figures, 6 pages

点击查看摘要

Abstract:We address the problem of automatic synchronisation of the spintronic oscillator (STO) by means of reinforcement learning (RL). A numerical solution of the macrospin Landau-Lifschitz-Gilbert-Slonczewski equation is used to simulate the STO and we train the two types of RL agents to synchronise with a target frequency within a fixed number of steps. We explore modifications to this base task and show an improvement in both convergence and energy efficiency of the synchronisation that can be easily achieved in the simulated environment.
zh

机器学习

[LG-0] Understanding Outer Optimizers in Local SGD: Learning Rates Momentum and Acceleration

链接: https://arxiv.org/abs/2509.10439
作者: Ahmed Khaled,Satyen Kale,Arthur Douillard,Chi Jin,Rob Fergus,Manzil Zaheer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in such settings but methods like Local Stochastic Gradient Descent (Local SGD) show great promise in reducing this additional communication overhead. Local SGD consists of three parts: a local optimization process, an aggregation mechanism, and an outer optimizer that uses the aggregated updates from the nodes to produce a new model. While there exists an extensive literature on understanding the impact of hyperparameters in the local optimization process, the choice of outer optimizer and its hyperparameters is less clear. We study the role of the outer optimizer in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than 1 . We extend our results to settings where we use momentum in the outer optimizer, and we show a similar role for the momentum-adjusted outer learning rate. We also study acceleration in the outer optimizer and show that it improves the convergence rate as a function of the number of communication rounds, improving upon the convergence rate of prior algorithms that apply acceleration locally. Finally, we also introduce a novel data-dependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory.

[LG-1] Run-Time Monitoring of ERTMS/ETCS Control Flow by Process Mining

链接: https://arxiv.org/abs/2509.10419
作者: Francesco Vitale,Tommaso Zoppi,Francesco Flammini,Nicola Mazzocca
类目: Machine Learning (cs.LG)
*备注: Accepted to the 6th International Conference on Reliability, Safety, and Security of Railway Systems (RSSRail2025)

点击查看摘要

Abstract:Ensuring the resilience of computer-based railways is increasingly crucial to account for uncertainties and changes due to the growing complexity and criticality of those systems. Although their software relies on strict verification and validation processes following well-established best-practices and certification standards, anomalies can still occur at run-time due to residual faults, system and environmental modifications that were unknown at design-time, or other emergent cyber-threat scenarios. This paper explores run-time control-flow anomaly detection using process mining to enhance the resilience of ERTMS/ETCS L2 (European Rail Traffic Management System / European Train Control System Level 2). Process mining allows learning the actual control flow of the system from its execution traces, thus enabling run-time monitoring through online conformance checking. In addition, anomaly localization is performed through unsupervised machine learning to link relevant deviations to critical system components. We test our approach on a reference ERTMS/ETCS L2 scenario, namely the RBC/RBC Handover, to show its capability to detect and localize anomalies with high accuracy, efficiency, and explainability.

[LG-2] Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

链接: https://arxiv.org/abs/2509.10406
作者: Rupert Mitchell,Kristian Kersting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Multipole Semantic Attention (MuSe), an efficient approximation of softmax attention that combines semantic clustering with multipole expansions from computational physics. Our method addresses the quadratic computational complexity of transformers in the context length by clustering queries and keys separately in their learned representation spaces, enabling a hierarchical two-stage attention mechanism. Unlike prior clustering approaches that group only keys or use unified clustering, we maintain separate clusterings that respect attention’s asymmetric treatment of these spaces. We augment centroid-based (monopole) approximations with dipole corrections that capture directional variance within clusters, preserving richer information during training. The method operates as a drop-in replacement for standard attention, requiring only hyperparameter specification without architectural modifications. Our approach achieves \mathcalO(NCD) complexity for acausal attention with C clusters and \mathcalO(NCD \log N) for causal attention. On isolated attention layers, we demonstrate 3\times speedup over CUDNN Flash Attention at 8k context length, with relative squared errors below 20%. For causal attention, we develop a hierarchical block decomposition that combines exact local computation with efficient long-range approximation. In end-to-end pretraining of a 30M parameter model on book-length texts with 16k context, we achieve 12.2% runtime reduction with only 0.36% loss degradation, establishing the viability of multipole approximations for efficient transformer pretraining.

[LG-3] Inpainting-Guided Policy Optimization for Diffusion Large Language Models

链接: https://arxiv.org/abs/2509.10396
作者: Siyan Zhao,Mengchen Liu,Jing Huang,Miao Liu,Chenyu Wang,Bo Liu,Yuandong Tian,Guan Pang,Sean Bell,Aditya Grover,Feiyu Chen
类目: Machine Learning (cs.LG)
*备注: preprint; 21 pages

点击查看摘要

Abstract:Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity–their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks–GSM8K, Math500, and AMC–achieving new state-of-the-art results for full-attention masked dLLMs.

[LG-4] Vendi Information Gain for Active Learning and its Application to Ecology

链接: https://arxiv.org/abs/2509.10390
作者: Quan Nguyen,Adji Bousso Dieng
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning – a machine learning paradigm that selects the most informative data to label and train a predictive model – offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. Applied to the Snapshot Serengeti dataset, VIG achieves impressive predictive accuracy close to full supervision using less than 10% of the labels. It consistently outperforms standard baselines across metrics and batch sizes, collecting more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.

[LG-5] Flow Straight and Fast in Hilbert Space: Functional Rectified Flow

链接: https://arxiv.org/abs/2509.10384
作者: Jianxin Zhang,Clayton Scott
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many generative models originally developed in finite-dimensional Euclidean space have functional generalizations in infinite-dimensional settings. However, the extension of rectified flow to infinite-dimensional spaces remains unexplored. In this work, we establish a rigorous functional formulation of rectified flow in an infinite-dimensional Hilbert space. Our approach builds upon the superposition principle for continuity equations in an infinite-dimensional space. We further show that this framework extends naturally to functional flow matching and functional probability flow ODEs, interpreting them as nonlinear generalizations of rectified flow. Notably, our extension to functional flow matching removes the restrictive measure-theoretic assumptions in the existing theory of \citetkerrigan2024functional. Furthermore, we demonstrate experimentally that our method achieves superior performance compared to existing functional generative models.

[LG-6] Characterizing the Efficiency of Distributed Training: A Power Performance and Thermal Perspective

链接: https://arxiv.org/abs/2509.10371
作者: Seokjin Go,Joongun Park,Spandan More,Hanjiang Wu,Irene Wang,Aaron Jezghani,Tushar Krishna,Divya Mahajan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms, including NVIDIA H100/H200 and AMD MI250 GPUs. We analyze dense and sparse models under various parallelism strategies – tensor, pipeline, data, and expert – and evaluate their effects on hardware utilization, power consumption, and thermal behavior. We further evaluate the effectiveness of optimizations such as activation recomputation and compute-communication overlap. Our findings show that performance is not determined solely by scaling hardware capacity. Scale-up systems with fewer, higher-memory GPUs can outperform scale-out systems in communication-bound regimes, but only under carefully tuned configurations; in other cases, scale-out deployments achieve superior throughput. We also show that certain parallelism combinations, such as tensor with pipeline, lead to bandwidth underutilization due to inefficient data chunking, while increasing microbatch sizes beyond a certain point induces bursty execution and peak power excursions that worsen thermal throttling. These insights reveal how training performance is shaped by complex interactions between hardware, system topology, and model execution. We conclude by offering recommendations for system and hardware design to improve the scalability and reliability of future LLM systems and workloads. The source code of this project is available at this https URL.

[LG-7] A Discrepancy-Based Perspective on Dataset Condensation

链接: https://arxiv.org/abs/2509.10367
作者: Tong Chen,Raghavendra Selvan
类目: Machine Learning (cs.LG)
*备注: 30 pages, 4 tables, 1 figure

点击查看摘要

Abstract:Given a dataset of finitely many elements \mathcalT = \mathbfx_i_i = 1^N , the goal of dataset condensation (DC) is to construct a synthetic dataset \mathcalS = \tilde\mathbfx_j_j = 1^M which is significantly smaller ( M \ll N ) such that a model trained from scratch on \mathcalS achieves comparable or even superior generalization performance to a model trained on \mathcalT . Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by \mathcalT with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.

[LG-8] Physics-informed sensor coverag e through structure preserving machine learning

链接: https://arxiv.org/abs/2509.10363
作者: Benjamin David Shaffer,Brooks Kinch,Joseph Klobusicky,M. Ani Hsieh,Nathaniel Trask
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present a machine learning framework for adaptive source localization in which agents use a structure-preserving digital twin of a coupled hydrodynamic-transport system for real-time trajectory planning and data assimilation. The twin is constructed with conditional neural Whitney forms (CNWF), coupling the numerical guarantees of finite element exterior calculus (FEEC) with transformer-based operator learning. The resulting model preserves discrete conservation, and adapts in real time to streaming sensor data. It employs a conditional attention mechanism to identify: a reduced Whitney-form basis; reduced integral balance equations; and a source field, each compatible with given sensor measurements. The induced reduced-order environmental model retains the stability and consistency of standard finite-element simulation, yielding a physically realizable, regular mapping from sensor data to the source field. We propose a staggered scheme that alternates between evaluating the digital twin and applying Lloyd’s algorithm to guide sensor placement, with analysis providing conditions for monotone improvement of a coverage functional. Using the predicted source field as an importance function within an optimal-recovery scheme, we demonstrate recovery of point sources under continuity assumptions, highlighting the role of regularity as a sufficient condition for localization. Experimental comparisons with physics-agnostic transformer architectures show improved accuracy in complex geometries when physical constraints are enforced, indicating that structure preservation provides an effective inductive bias for source identification.

[LG-9] ARMA Block: A CNN-Based Autoregressive and Moving Averag e Module for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2509.10324
作者: Myung Jin Kim,YeongHyeon Park,Il Dong Yun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a simple yet effective convolutional module for long-term time series forecasting. The proposed block, inspired by the Auto-Regressive Integrated Moving Average (ARIMA) model, consists of two convolutional components: one for capturing the trend (autoregression) and the other for refining local variations (moving average). Unlike conventional ARIMA, which requires iterative multi-step forecasting, the block directly performs multi-step forecasting, making it easily extendable to multivariate settings. Experiments on nine widely used benchmark datasets demonstrate that our method ARMA achieves competitive accuracy, particularly on datasets exhibiting strong trend variations, while maintaining architectural simplicity. Furthermore, analysis shows that the block inherently encodes absolute positional information, suggesting its potential as a lightweight replacement for positional embeddings in sequential models.

[LG-10] Robot guide with multi-agent control and automatic scenario generation with LLM

链接: https://arxiv.org/abs/2509.10317
作者: Elizaveta D. Moskovskaya,Anton D. Moscowsky
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 2 tables, 1 demo-video and repository link

点击查看摘要

Abstract:The work describes the development of a hybrid control architecture for an anthropomorphic tour guide robot, combining a multi-agent resource management system with automatic behavior scenario generation based on large language models. The proposed approach aims to overcome the limitations of traditional systems, which rely on manual tuning of behavior scenarios. These limitations include manual configuration, low flexibility, and lack of naturalness in robot behavior. The process of preparing tour scenarios is implemented through a two-stage generation: first, a stylized narrative is created, then non-verbal action tags are integrated into the text. The multi-agent system ensures coordination and conflict resolution during the execution of parallel actions, as well as maintaining default behavior after the completion of main operations, contributing to more natural robot behavior. The results obtained from the trial demonstrate the potential of the proposed approach for automating and scaling social robot control systems.

[LG-11] GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction

链接: https://arxiv.org/abs/2509.10308
作者: Joshua Dimasaka,Christian Geiß,Robert Muir-Wood,Emily So
类目: Machine Learning (cs.LG)
*备注: Accepted full paper at the 8th International Disaster and Risk Conference, IDRC 2025 | Keywords: weakly supervised, graph deep learning, categorical distribution, physical vulnerability, remote sensing, spatiotemporal disaster risk, transition matrix | The data and code are respectively available at this https URL and this https URL

点击查看摘要

Abstract:In the aftermath of disasters, many institutions worldwide face challenges in continually monitoring changes in disaster risk, limiting the ability of key decision-makers to assess progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a novel probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and prior expert belief systems. We introduce a weakly supervised first-order transition matrix that reflects the changes in the spatiotemporal distribution of physical vulnerability in two disaster-stricken and socioeconomically disadvantaged areas: (1) the cyclone-impacted coastal Khurushkul community in Bangladesh and (2) the mudslide-affected city of Freetown in Sierra Leone. Our work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.

[LG-12] Proof of AutoML: SDN based Secure Energy Trading with Blockchain in Disaster Case

链接: https://arxiv.org/abs/2509.10291
作者: Salih Toprak,Muge Erel-Ozcevik
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 3 figures, 7th International Conference on Blockchain Computing and Applications (BCCA 2025), \c{opyright}2025 IEEE

点击查看摘要

Abstract:In disaster scenarios where conventional energy infrastructure is compromised, secure and traceable energy trading between solar-powered households and mobile charging units becomes a necessity. To ensure the integrity of such transactions over a blockchain network, robust and unpredictable nonce generation is vital. This study proposes an SDN-enabled architecture where machine learning regressors are leveraged not for their accuracy, but for their potential to generate randomized values suitable as nonce candidates. Therefore, it is newly called Proof of AutoML. Here, SDN allows flexible control over data flows and energy routing policies even in fragmented or degraded networks, ensuring adaptive response during emergencies. Using a 9000-sample dataset, we evaluate five AutoML-selected regression models - Gradient Boosting, LightGBM, Random Forest, Extra Trees, and K-Nearest Neighbors - not by their prediction accuracy, but by their ability to produce diverse and non-deterministic outputs across shuffled data inputs. Randomness analysis reveals that Random Forest and Extra Trees regressors exhibit complete dependency on randomness, whereas Gradient Boosting, K-Nearest Neighbors and LightGBM show strong but slightly lower randomness scores (97.6%, 98.8% and 99.9%, respectively). These findings highlight that certain machine learning models, particularly tree-based ensembles, may serve as effective and lightweight nonce generators within blockchain-secured, SDN-based energy trading infrastructures resilient to disaster conditions.

[LG-13] argeted Test Selection Approach in Continuous Integration

链接: https://arxiv.org/abs/2509.10279
作者: Pavel Plyusnin,Aleksey Antonov,Vasilii Ermakov,Aleksandr Khaybriev,Margarita Kikot,Ilseyar Alimova,Stanislav Moiseev
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at ICSME 2025

点击查看摘要

Abstract:In modern software development change-based testing plays a crucial role. However, as codebases expand and test suites grow, efficiently managing the testing process becomes increasingly challenging, especially given the high frequency of daily code commits. We propose Targeted Test Selection (T-TS), a machine learning approach for industrial test selection. Our key innovation is a data representation that represent commits as Bags-of-Words of changed files, incorporates cross-file and additional predictive features, and notably avoids the use of coverage maps. Deployed in production, T-TS was comprehensively evaluated against industry standards and recent methods using both internal and public datasets, measuring time efficiency and fault detection. On live industrial data, T-TS selects only 15% of tests, reduces execution time by 5.9\times , accelerates the pipeline by 5.6\times , and detects over 95% of test failures. The implementation is publicly available to support further research and practical adoption.

[LG-14] Property prediction for ionic liquids without prior structural knowledge using limited experimental data: A data-driven neural recommender system leverag ing transfer learning

链接: https://arxiv.org/abs/2509.10273
作者: Sahil Sethi,Kai Sundmacher,Caroline Ganzer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ionic liquids (ILs) have emerged as versatile replacements for traditional solvents because their physicochemical properties can be precisely tailored to various applications. However, accurately predicting key thermophysical properties remains challenging due to the vast chemical design space and the limited availability of experimental data. In this study, we present a data-driven transfer learning framework that leverages a neural recommender system (NRS) to enable reliable property prediction for ILs using sparse experimental datasets. The approach involves a two-stage process: first, pre-training NRS models on COSMO-RS-based simulated data at fixed temperature and pressure to learn property-specific structural embeddings for cations and anions; and second, fine-tuning simple feedforward neural networks using these embeddings with experimental data at varying temperatures and pressures. In this work, five essential IL properties are considered: density, viscosity, surface tension, heat capacity, and melting point. The framework supports both within-property and cross-property knowledge transfer. Notably, pre-trained models for density, viscosity, and heat capacity are used to fine-tune models for all five target properties, achieving improved performance by a substantial margin for four of them. The model exhibits robust extrapolation to previously unseen ILs. Moreover, the final trained models enable property prediction for over 700,000 IL combinations, offering a scalable solution for IL screening in process design. This work highlights the effectiveness of combining simulated data and transfer learning to overcome sparsity in the experimental data.

[LG-15] Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications

链接: https://arxiv.org/abs/2509.10248
作者: Janis Keuper
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such “attacks” - although seen by some commentators as “self-defense” - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.

[LG-16] Model-agnostic post-hoc explainability for recommender systems

链接: https://arxiv.org/abs/2509.10245
作者: Irina Arévalo,Jose L Salmeron
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems often benefit from complex feature embeddings and deep learning algorithms, which deliver sophisticated recommendations that enhance user experience, engagement, and revenue. However, these methods frequently reduce the interpretability and transparency of the system. In this research, we develop a systematic application, adaptation, and evaluation of deletion diagnostics in the recommender setting. The method compares the performance of a model to that of a similar model trained without a specific user or item, allowing us to quantify how that observation influences the recommender, either positively or negatively. To demonstrate its model-agnostic nature, the proposal is applied to both Neural Collaborative Filtering (NCF), a widely used deep learning-based recommender, and Singular Value Decomposition (SVD), a classical collaborative filtering technique. Experiments on the MovieLens and Amazon Reviews datasets provide insights into model behavior and highlight the generality of the approach across different recommendation paradigms.

[LG-17] A Certifiable Machine Learning-Based Pipeline to Predict Fatigue Life of Aircraft Structures

链接: https://arxiv.org/abs/2509.10227
作者: Ángel Ladrón,Miguel Sánchez-Domínguez,Javier Rozalén,Fernando R. Sánchez,Javier de Vicente,Lucas Lacasa,Eusebio Valero,Gonzalo Rubio
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 29 pages, 15 figures

点击查看摘要

Abstract:Fatigue life prediction is essential in both the design and operational phases of any aircraft, and in this sense safety in the aerospace industry requires early detection of fatigue cracks to prevent in-flight failures. Robust and precise fatigue life predictors are thus essential to ensure safety. Traditional engineering methods, while reliable, are time consuming and involve complex workflows, including steps such as conducting several Finite Element Method (FEM) simulations, deriving the expected loading spectrum, and applying cycle counting techniques like peak-valley or rainflow counting. These steps often require collaboration between multiple teams and tools, added to the computational time and effort required to achieve fatigue life predictions. Machine learning (ML) offers a promising complement to traditional fatigue life estimation methods, enabling faster iterations and generalization, providing quick estimates that guide decisions alongside conventional simulations. In this paper, we present a ML-based pipeline that aims to estimate the fatigue life of different aircraft wing locations given the flight parameters of the different missions that the aircraft will be operating throughout its operational life. We validate the pipeline in a realistic use case of fatigue life estimation, yielding accurate predictions alongside a thorough statistical validation and uncertainty quantification. Our pipeline constitutes a complement to traditional methodologies by reducing the amount of costly simulations and, thereby, lowering the required computational and human resources. Comments: 29 pages, 15 figures Subjects: Machine Learning (cs.LG); Applied Physics (physics.app-ph) Cite as: arXiv:2509.10227 [cs.LG] (or arXiv:2509.10227v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.10227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] RFSeek and Ye Shall Find

链接: https://arxiv.org/abs/2509.10216
作者: Noga H. Rotman,Tiago Ferreira,Hila Peleg,Mark Silberstein,Alexandra Silva
类目: Networking and Internet Architecture (cs.NI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Requests for Comments (RFCs) are extensive specification documents for network protocols, but their prose-based format and their considerable length often impede precise operational understanding. We present RFSeek, an interactive tool that automatically extracts visual summaries of protocol logic from RFCs. RFSeek leverages large language models (LLMs) to generate provenance-linked, explorable diagrams, surfacing both official state machines and additional logic found only in the RFC text. Compared to existing RFC visualizations, RFSeek’s visual summaries are more transparent and easier to audit against their textual source. We showcase the tool’s potential through a series of use cases, including guided knowledge extraction and semantic diffing, applied to protocols such as TCP, QUIC, PPTP, and DCCP. In practice, RFSeek not only reconstructs the RFC diagrams included in some specifications, but, more interestingly, also uncovers important logic such as nodes or edges described in the text but missing from those diagrams. RFSeek further derives new visualization diagrams for complex RFCs, with QUIC as a representative case. Our approach, which we term \emphSummary Visualization, highlights a promising direction: combining LLMs with formal, user-customized visualizations to enhance protocol comprehension and support robust implementations. Comments: 7 pages Subjects: Networking and Internet Architecture (cs.NI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2509.10216 [cs.NI] (or arXiv:2509.10216v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2509.10216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] Investigating Feature Attribution for 5G Network Intrusion Detection

链接: https://arxiv.org/abs/2509.10206
作者: Federica Uccello,Simin Nadjm-Tehrani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rise of fifth-generation (5G) networks in critical applications, it is urgent to move from detection of malicious activity to systems capable of providing a reliable verdict suitable for mitigation. In this regard, understanding and interpreting machine learning (ML) models’ security alerts is crucial for enabling actionable incident response orchestration. Explainable Artificial Intelligence (XAI) techniques are expected to enhance trust by providing insights into why alerts are raised. A dominant approach statistically associates feature sets that can be correlated to a given alert. This paper starts by questioning whether such attribution is relevant for future generation communication systems, and investigates its merits in comparison with an approach based on logical explanations. We extensively study two methods, SHAP and VoTE-XAI, by analyzing their interpretations of alerts generated by an XGBoost model in three different use cases with several 5G communication attacks. We identify three metrics for assessing explanations: sparsity, how concise they are; stability, how consistent they are across samples from the same attack type; and efficiency, how fast an explanation is generated. As an example, in a 5G network with 92 features, 6 were deemed important by VoTE-XAI for a Denial of Service (DoS) variant, ICMPFlood, while SHAP identified over 20. More importantly, we found a significant divergence between features selected by SHAP and VoTE-XAI. However, none of the top-ranked features selected by SHAP were missed by VoTE-XAI. When it comes to efficiency of providing interpretations, we found that VoTE-XAI is significantly more responsive, e.g. it provides a single explanation in under 0.002 seconds, in a high-dimensional setting (478 features).

[LG-20] Hadamard-Riemannian Optimization for Margin-Variance Ensemble

链接: https://arxiv.org/abs/2509.10189
作者: Zexu Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble learning has been widely recognized as a pivotal technique for boosting predictive performance by combining multiple base models. Nevertheless, conventional margin-based ensemble methods predominantly focus on maximizing the expected margin while neglecting the critical role of margin variance, which inherently restricts the generalization capability of the model and heightens its vulnerability to overfitting, particularly in noisy or imbalanced datasets. Additionally, the conventional approach of optimizing ensemble weights within the probability simplex often introduces computational inefficiency and scalability challenges, complicating its application to large-scale problems. To tackle these limitations, this paper introduces a novel ensemble learning framework that explicitly incorporates margin variance into the loss function. Our method jointly optimizes the negative expected margin and its variance, leading to enhanced robustness and improved generalization performance. Moreover, by reparameterizing the ensemble weights onto the unit sphere, we substantially simplify the optimization process and improve computational efficiency. Extensive experiments conducted on multiple benchmark datasets demonstrate that the proposed approach consistently outperforms traditional margin-based ensemble techniques, underscoring its effectiveness and practical utility.

[LG-21] P3D: Scalable Neural Surrogates for High-Resolution 3D Physics Simulations with Global Context

链接: https://arxiv.org/abs/2509.10186
作者: Benjamin Holzschuh,Georg Kohl,Florian Redinger,Nils Thuerey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a scalable framework for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations. We introduce a hybrid CNN-Transformer backbone architecture targeted for 3D physics simulations, which significantly outperforms existing architectures in terms of speed and accuracy. Our proposed network can be pretrained on small patches of the simulation domain, which can be fused to obtain a global solution, optionally guided via a fast and scalable sequence-to-sequence model to include long-range dependencies. This setup allows for training large-scale models with reduced memory and compute requirements for high-resolution datasets. We evaluate our backbone architecture against a large set of baseline methods with the objective to simultaneously learn the dynamics of 14 different types of PDEs in 3D. We demonstrate how to scale our model to high-resolution isotropic turbulence with spatial resolutions of up to 512^3 . Finally, we demonstrate the versatility of our network by training it as a diffusion model to produce probabilistic samples of highly turbulent 3D channel flows across varying Reynolds numbers, accurately capturing the underlying flow statistics.

[LG-22] he Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

链接: https://arxiv.org/abs/2509.10167
作者: Lénaïc Chizat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that with a diverging depth L , a fixed embedding dimension D , and an arbitrary hidden width M , the training dynamics converges to a Neural Mean ODE training dynamics. Remarkably, the limit is independent of the scaling of M , covering practical cases of, say, Transformers, where M (the number of hidden units or attention heads per layer) is typically of the order of D . For a residual scale \Theta_D\big(\frac\alphaLM\big) , we obtain the error bound O_D\big(\frac1L+ \frac\alpha\sqrtLM\big) between the model’s output and its limit after a fixed number gradient of steps, and we verify empirically that this rate is tight. When \alpha=\Theta(1) , the limit exhibits complete feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that \alpha \to \infty yields a \lazy ODE regime where the Mean ODE is linearly parameterized. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension D . We show that for this model, the only residual scale that leads to complete feature learning is \Theta\big(\frac\sqrtDLM\big) . In this regime, we prove the error bound O\big(\frac1L+ \frac\sqrtD\sqrtLM\big) between the ResNet and its limit after a fixed number of gradient steps, which is also empirically tight. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics.

[LG-23] A Symmetry-Integrated Approach to Surface Code Decoding

链接: https://arxiv.org/abs/2509.10164
作者: Hoshitaro Ohnishi,Hideo Mukai
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Quantum error correction, which utilizes logical qubits that are encoded as redundant multiple physical qubits to find and correct errors in physical qubits, is indispensable for practical quantum computing. Surface code is considered to be a promising encoding method with a high error threshold that is defined by stabilizer generators. However, previous methods have suffered from the problem that the decoder acquires solely the error probability distribution because of the non-uniqueness of correct prediction obtained from the input. To circumvent this problem, we propose a technique to reoptimize the decoder model by approximating syndrome measurements with a continuous function that is mathematically interpolated by neural network. We evaluated the improvement in accuracy of a multilayer perceptron based decoder for code distances of 5 and 7 as well as for decoders based on convolutional and recurrent neural networks and transformers for a code distance of 5. In all cases, the reoptimized decoder gave better accuracy than the original models, demonstrating the universal effectiveness of the proposed method that is independent of code distance or network architecture. These results suggest that re-framing the problem of surface code decoding into a regression problem that can be tackled by deep learning is a useful strategy.

[LG-24] Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks

链接: https://arxiv.org/abs/2509.10163
作者: Francisco Javier Esono Nkulu Andong,Qi Min
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:As sixth-generation (6G) networks move toward ultra-dense, intelligent edge environments, efficient resource management under stringent privacy, mobility, and energy constraints becomes critical. This paper introduces a novel Federated Multi-Agent Reinforcement Learning (Fed-MARL) framework that incorporates cross-layer orchestration of both the MAC layer and application layer for energy-efficient, privacy-preserving, and real-time resource management across heterogeneous edge devices. Each agent uses a Deep Recurrent Q-Network (DRQN) to learn decentralized policies for task offloading, spectrum access, and CPU energy adaptation based on local observations (e.g., queue length, energy, CPU usage, and mobility). To protect privacy, we introduce a secure aggregation protocol based on elliptic curve Diffie Hellman key exchange, which ensures accurate model updates without exposing raw data to semi-honest adversaries. We formulate the resource management problem as a partially observable multi-agent Markov decision process (POMMDP) with a multi-objective reward function that jointly optimizes latency, energy efficiency, spectral efficiency, fairness, and reliability under 6G-specific service requirements such as URLLC, eMBB, and mMTC. Simulation results demonstrate that Fed-MARL outperforms centralized MARL and heuristic baselines in task success rate, latency, energy efficiency, and fairness, while ensuring robust privacy protection and scalability in dynamic, resource-constrained 6G edge networks.

[LG-25] FedBiF: Communication-Efficient Federated Learning via Bits Freezing

链接: https://arxiv.org/abs/2509.10161
作者: Shiwei Li,Qunwei Li,Haozhao Wang,Ruixuan Li,Jianbin Lin,Wenliang Zhong
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by TPDS

点击查看摘要

Abstract:Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this paper, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication. The code is available at this https URL.

[LG-26] Cost-Free Personalization via Information-Geometric Projection in Bayesian Federated Learning

链接: https://arxiv.org/abs/2509.10132
作者: Nour Jamoussi,Giuseppe Serra,Photios A. Stavrou,Marios Kountouris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Federated Learning (BFL) combines uncertainty modeling with decentralized training, enabling the development of personalized and reliable models under data heterogeneity and privacy constraints. Existing approaches typically rely on Markov Chain Monte Carlo (MCMC) sampling or variational inference, often incorporating personalization mechanisms to better adapt to local data distributions. In this work, we propose an information-geometric projection framework for personalization in parametric BFL. By projecting the global model onto a neighborhood of the user’s local model, our method enables a tunable trade-off between global generalization and local specialization. Under mild assumptions, we show that this projection step is equivalent to computing a barycenter on the statistical manifold, allowing us to derive closed-form solutions and achieve cost-free personalization. We apply the proposed approach to a variational learning setup using the Improved Variational Online Newton (IVON) optimizer and extend its application to general aggregation schemes in BFL. Empirical evaluations under heterogeneous data distributions confirm that our method effectively balances global and local performance with minimal computational overhead.

[LG-27] KAN-SR: A Kolmogorov-Arnold Network Guided Symbolic Regression Framework

链接: https://arxiv.org/abs/2509.10089
作者: Marco Andrea Bühler,Gonzalo Guillén-Gosálbez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel symbolic regression framework, namely KAN-SR, built on Kolmogorov Arnold Networks (KANs) which follows a divide-and-conquer approach. Symbolic regression searches for mathematical equations that best fit a given dataset and is commonly solved with genetic programming approaches. We show that by using deep learning techniques, more specific KANs, and combining them with simplification strategies such as translational symmetries and separabilities, we are able to recover ground-truth equations of the Feynman Symbolic Regression for Scientific Discovery (SRSD) dataset. Additionally, we show that by combining the proposed framework with neural controlled differential equations, we are able to model the dynamics of an in-silico bioprocess system precisely, opening the door for the dynamic modeling of other engineering systems.

[LG-28] Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

链接: https://arxiv.org/abs/2509.10074
作者: Christos Sgouropoulos,Christos Nikou,Stefanos Vlachos,Vasileios Theiou,Christos Foukanelis,Theodoros Giannakopoulos
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted and Presented at IEEE International Workshop on Machine Learning for Signal Processing, Aug.\ 31-- Sep.\ 3, 2025, Istanbul, Turkey , 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Few-shot learning has emerged as a powerful paradigm for training models with limited labeled data, addressing challenges in scenarios where large-scale annotation is impractical. While extensive research has been conducted in the image domain, few-shot learning in audio classification remains relatively underexplored. In this work, we investigate the effect of integrating supervised contrastive loss into prototypical few shot training for audio classification. In detail, we demonstrate that angular loss further improves the performance compared to the standard contrastive loss. Our method leverages SpecAugment followed by a self-attention mechanism to encapsulate diverse information of augmented input versions into one unified embedding. We evaluate our approach on MetaAudio, a benchmark including five datasets with predefined splits, standardized preprocessing, and a comprehensive set of few-shot learning models for comparison. The proposed approach achieves state-of-the-art performance in a 5-way, 5-shot setting.

[LG-29] Uncertainty-Aware Tabular Prediction: Evaluating VBLL-Enhanced TabPFN in Safety-Critical Medical Data

链接: https://arxiv.org/abs/2509.10048
作者: Madhushan Ramalingam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive models are being increasingly used across a wide range of domains, including safety-critical applications such as medical diagnosis and criminal justice. Reliable uncertainty estimation is a crucial task in such settings. Tabular Prior-data Fitted Network (TabPFN) is a recently proposed machine learning foundation model for tabular dataset, which uses a generative transformer architecture. Variational Bayesian Last Layers (VBLL) is a state-of-the-art lightweight variational formulation that effectively improves uncertainty estimation with minimal computational overhead. In this work we aim to evaluate the performance of VBLL integrated with the recently proposed TabPFN in uncertainty calibration. Our experiments, conducted on three benchmark medical tabular datasets, compare the performance of the original TabPFN and the VBLL-integrated version. Contrary to expectations, we observed that original TabPFN consistently outperforms VBLL integrated TabPFN in uncertainty calibration across all datasets.

[LG-30] FedRP: A Communication-Efficient Approach for Differentially Private Federated Learning Using Random Projection

链接: https://arxiv.org/abs/2509.10041
作者: Mohammad Hasan Narimani,Mostafa Tavassolipour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) offers an innovative paradigm for collaborative model training across decentralized devices, such as smartphones, balancing enhanced predictive performance with the protection of user privacy in sensitive areas like Internet of Things (IoT) and medical data analysis. Despite its advantages, FL encounters significant challenges related to user privacy protection against potential attacks and the management of communication costs. This paper introduces a novel federated learning algorithm called FedRP, which integrates random projection techniques with the Alternating Direction Method of Multipliers (ADMM) optimization framework. This approach enhances privacy by employing random projection to reduce the dimensionality of model parameters prior to their transmission to a central server, reducing the communication cost. The proposed algorithm offers a strong (\epsilon, \delta) -differential privacy guarantee, demonstrating resilience against data reconstruction attacks. Experimental results reveal that FedRP not only maintains high model accuracy but also outperforms existing methods, including conventional differential privacy approaches and FedADMM, in terms of both privacy preservation and communication efficiency.

[LG-31] Symbolic Feedforward Networks for Probabilistic Finite Automata: Exact Simulation and Learnability

链接: https://arxiv.org/abs/2509.10034
作者: Sahil Rajesh Dhayalkar
类目: Machine Learning (cs.LG)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:We present a formal and constructive theory showing that probabilistic finite automata (PFAs) can be exactly simulated using symbolic feedforward neural networks. Our architecture represents state distributions as vectors and transitions as stochastic matrices, enabling probabilistic state propagation via matrix-vector products. This yields a parallel, interpretable, and differentiable simulation of PFA dynamics using soft updates-without recurrence. We formally characterize probabilistic subset construction, \varepsilon -closure, and exact simulation via layered symbolic computation, and prove equivalence between PFAs and specific classes of neural networks. We further show that these symbolic simulators are not only expressive but learnable: trained with standard gradient descent-based optimization on labeled sequence data, they recover the exact behavior of ground-truth PFAs. This learnability, formalized in Proposition 5.1, is the crux of this work. Our results unify probabilistic automata theory with neural architectures under a rigorous algebraic framework, bridging the gap between symbolic computation and deep learning.

[LG-32] Sparse Coding Representation of 2-way Data

链接: https://arxiv.org/abs/2509.10033
作者: Boya Ma,Abram Magner,Maxwell McNeil,Petko Bogdanov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse dictionary coding represents signals as linear combinations of a few dictionary atoms. It has been applied to images, time series, graph signals and multi-way spatio-temporal data by jointly employing temporal and spatial dictionaries. Data-agnostic analytical dictionaries, such as the discrete Fourier transform, wavelets and graph Fourier, have seen wide adoption due to efficient implementations and good practical performance. On the other hand, dictionaries learned from data offer sparser and more accurate solutions but require learning of both the dictionaries and the coding coefficients. This becomes especially challenging for multi-dictionary scenarios since encoding coefficients correspond to all atom combinations from the dictionaries. To address this challenge, we propose a low-rank coding model for 2-dictionary scenarios and study its data complexity. Namely, we establish a bound on the number of samples needed to learn dictionaries that generalize to unseen samples from the same distribution. We propose a convex relaxation solution, called AODL, whose exact solution we show also solves the original problem. We then solve this relaxation via alternating optimization between the sparse coding matrices and the learned dictionaries, which we prove to be convergent. We demonstrate its quality for data reconstruction and missing value imputation in both synthetic and real-world datasets. For a fixed reconstruction quality, AODL learns up to 90% sparser solutions compared to non-low-rank and analytical (fixed) dictionary baselines. In addition, the learned dictionaries reveal interpretable insights into patterns present within the samples used for training.

[LG-33] Neural Scaling Laws for Deep Regression

链接: https://arxiv.org/abs/2509.10000
作者: Tilen Cadez,Kyoung-Min Kim
类目: Machine Learning (cs.LG); Other Condensed Matter (cond-mat.other)
*备注: Supplementary Information will be provided with the published manuscript

点击查看摘要

Abstract:Neural scaling laws–power-law relationships between generalization errors and characteristics of deep learning models–are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures–including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.

[LG-34] Data-Driven Energy Estimation for Virtual Servers Using Combined System Metrics and Machine Learning

链接: https://arxiv.org/abs/2509.09991
作者: Amandip Sangha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a machine learning-based approach to estimate the energy consumption of virtual servers without access to physical power measurement interfaces. Using resource utilization metrics collected from guest virtual machines, we train a Gradient Boosting Regressor to predict energy consumption measured via RAPL on the host. We demonstrate, for the first time, guest-only resource-based energy estimation without privileged host access with experiments across diverse workloads, achieving high predictive accuracy and variance explained ( 0.90 \leq R^2 \leq 0.97 ), indicating the feasibility of guest-side energy estimation. This approach can enable energy-aware scheduling, cost optimization and physical host independent energy estimates in virtualized environments. Our approach addresses a critical gap in virtualized environments (e.g. cloud) where direct energy measurement is infeasible.

[LG-35] DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition

链接: https://arxiv.org/abs/2509.09940
作者: Yifei Wang,Wenbin Wang,Yong Luo
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Though Multimodal Intent Recognition (MIR) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential for intent-irrelevant and conflicting information across modalities may hinder performance from being further improved. Most current models attempt to fuse modalities by applying mechanisms like multi-head attention to unimodal feature sequences and then adding the result back to the original representation. This process risks corrupting the primary linguistic features with noisy or irrelevant non-verbal signals, as it often fails to capture the fine-grained, token-level influence where non-verbal cues should modulate, not just augment, textual meaning. To address this, we introduce DyKen-Hyena, which reframes the problem from feature fusion to processing modulation. Our model translates audio-visual cues into dynamic, per-token convolutional kernels that directly modulate textual feature extraction. This fine-grained approach achieves state-of-the-art results on the MIntRec and MIntRec2.0 benchmarks. Notably, it yields a +10.46% F1-score improvement in out-of-scope detection, validating that our method creates a fundamentally more robust intent representation.

[LG-36] SciML Agents : Write the Solver Not the Solution

链接: https://arxiv.org/abs/2509.09936
作者: Saarth Gaonkar,Xiang Zheng,Haocheng Xi,Rishabh Tiwari,Kurt Keutzer,Dmitriy Morozov,Michael W. Mahoney,Amir Gholami
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics-informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain-aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural-language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial “misleading” problems; and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non-stiffness; and the large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction-following models achieve high accuracy on both criteria. In many cases, recent open-source systems perform strongly without fine-tuning, while older or smaller models still benefit from fine-tuning. Overall, our preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.

[LG-37] Multi-Play Combinatorial Semi-Bandit Problem

链接: https://arxiv.org/abs/2509.09933
作者: Shintaro Nakamura,Yuko Kuroki,Wei Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack this http URL overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains O(\log T) distribution-dependent regret in the stochastic regime, where T is the time horizon. The other is a best-of-both-worlds algorithm, which achieves O(\log T) variance-dependent regret in the stochastic regime and the worst-case \tilde\mathcalO\left( \sqrtT \right) regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.

[LG-38] Variational Neural Networks for Observable Thermodynamics (V-NOTS)

链接: https://arxiv.org/abs/2509.09899
作者: Christopher Eldred,François Gay-Balmaz,Vakhtang Putkaradze
类目: Machine Learning (cs.LG)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:Much attention has recently been devoted to data-based computing of evolution of physical systems. In such approaches, information about data points from past trajectories in phase space is used to reconstruct the equations of motion and to predict future solutions that have not been observed before. However, in many cases, the available data does not correspond to the variables that define the system’s phase space. We focus our attention on the important example of dissipative dynamical systems. In that case, the phase space consists of coordinates, momenta and entropies; however, the momenta and entropies cannot, in general, be observed directly. To address this difficulty, we develop an efficient data-based computing framework based exclusively on observable variables, by constructing a novel approach based on the \emphthermodynamic Lagrangian, and constructing neural networks that respect the thermodynamics and guarantees the non-decreasing entropy evolution. We show that our network can provide an efficient description of phase space evolution based on a limited number of data points and a relatively small number of parameters in the system.

[LG-39] Off Policy Lyapunov Stability in Reinforcement Learning

链接: https://arxiv.org/abs/2509.09863
作者: Sarvan Gill,Daniela Constantinescu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Conference on Robot Learning (CORL) 2025

点击查看摘要

Abstract:Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.

[LG-40] Distinguishing Startle from Surprise Events Based on Physiological Signals

链接: https://arxiv.org/abs/2509.09799
作者: Mansi Sharma,Alexandre Duchevet,Florian Daiber,Jean-Paul Imbert,Maurice Rekrut
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Unexpected events can impair attention and delay decision-making, posing serious safety risks in high-risk environments such as aviation. In particular, reactions like startle and surprise can impact pilot performance in different ways, yet are often hard to distinguish in practice. Existing research has largely studied these reactions separately, with limited focus on their combined effects or how to differentiate them using physiological data. In this work, we address this gap by distinguishing between startle and surprise events based on physiological signals using machine learning and multi-modal fusion strategies. Our results demonstrate that these events can be reliably predicted, achieving a highest mean accuracy of 85.7% with SVM and Late Fusion. To further validate the robustness of our model, we extended the evaluation to include a baseline condition, successfully differentiating between Startle, Surprise, and Baseline states with a highest mean accuracy of 74.9% with XGBoost and Late Fusion.

[LG-41] From the Gradient-Step Denoiser to the Proximal Denoiser and their associated convergent Plug-and-Play algorithms

链接: https://arxiv.org/abs/2509.09793
作者: Vincent Herfeld,Baudouin Denis de Senneville,Arthur Leclaire,Nicolas Papadakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we analyze the Gradient-Step Denoiser and its usage in Plug-and-Play algorithms. The Plug-and-Play paradigm of optimization algorithms uses off the shelf denoisers to replace a proximity operator or a gradient descent operator of an image prior. Usually this image prior is implicit and cannot be expressed, but the Gradient-Step Denoiser is trained to be exactly the gradient descent operator or the proximity operator of an explicit functional while preserving state-of-the-art denoising capabilities.

[LG-42] One Head Many Models: Cross-Attention Routing for Cost-Aware LLM Selection

链接: https://arxiv.org/abs/2509.09782
作者: Roshini Pulishetty,Mani Kishan Ghantasala,Keerthy Kaushik Dasoju,Niti Mangwani,Vishal Garimella,Aditya Mate,Somya Chatterjee,Yue Kang,Ehi Nosakhare,Sadid Hasan,Soundar Srinivasan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) with varying computational costs and performance profiles presents a critical challenge for scalable, cost-effective deployment in real-world applications. We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings, enabling dynamic selection of the optimal LLM for each input query. Our approach is evaluated on RouterBench, a large-scale, publicly available benchmark encompassing diverse LLM pools and domains. By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers. To robustly balance performance and cost, we propose an exponential reward function that enhances stability across user preferences. The resulting architecture is lightweight, generalizes effectively across domains, and demonstrates improved efficiency compared to prior methods, establishing a new standard for cost-aware LLM routing.

[LG-43] Hybrid Adaptive Conformal Offline Reinforcement Learning for Fair Population Health Management

链接: https://arxiv.org/abs/2509.09772
作者: Sanjay Basu,Sadiq Y. Patel,Parth Sheth,Bhairavi Muralidharan,Namrata Elamaran,Aakriti Kinra,Rajaie Batniji
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Population health management programs for Medicaid populations coordinate longitudinal outreach and services (e.g., benefits navigation, behavioral health, social needs support, and clinical scheduling) and must be safe, fair, and auditable. We present a Hybrid Adaptive Conformal Offline Reinforcement Learning (HACO) framework that separates risk calibration from preference optimization to generate conservative action recommendations at scale. In our setting, each step involves choosing among common coordination actions (e.g., which member to contact, by which modality, and whether to route to a specialized service) while controlling the near-term risk of adverse utilization events (e.g., unplanned emergency department visits or hospitalizations). Using a de-identified operational dataset from Waymark comprising 2.77 million sequential decisions across 168,126 patients, HACO (i) trains a lightweight risk model for adverse events, (ii) derives a conformal threshold to mask unsafe actions at a target risk level, and (iii) learns a preference policy on the resulting safe subset. We evaluate policies with a version-agnostic fitted Q evaluation (FQE) on stratified subsets and audit subgroup performance across age, sex, and race. HACO achieves strong risk discrimination (AUC ~0.81) with a calibrated threshold ( \tau ~0.038 at \alpha = 0.10), while maintaining high safe coverage. Subgroup analyses reveal systematic differences in estimated value across demographics, underscoring the importance of fairness auditing. Our results show that conformal risk gating integrates cleanly with offline RL to deliver conservative, auditable decision support for population health management teams.

[LG-44] sting chatbots on the creation of encoders for audio conditioned image generation

链接: https://arxiv.org/abs/2509.09717
作者: Jorge E. León,Miguel Carrasco
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:On one hand, recent advances in chatbots has led to a rising popularity in using these models for coding tasks. On the other hand, modern generative image models primarily rely on text encoders to translate semantic concepts into visual representations, even when there is clear evidence that audio can be employed as input as well. Given the previous, in this work, we explore whether state-of-the-art conversational agents can design effective audio encoders to replace the CLIP text encoder from Stable Diffusion 1.5, enabling image synthesis directly from sound. We prompted five publicly available chatbots to propose neural architectures to work as these audio encoders, with a set of well-explained shared conditions. Each valid suggested encoder was trained on over two million context related audio-image-text observations, and evaluated on held-out validation and test sets using various metrics, together with a qualitative analysis of their generated images. Although almost all chatbots generated valid model designs, none achieved satisfactory results, indicating that their audio embeddings failed to align reliably with those of the original text encoder. Among the proposals, the Gemini audio encoder showed the best quantitative metrics, while the Grok audio encoder produced more coherent images (particularly, when paired with the text encoder). Our findings reveal a shared architectural bias across chatbots and underscore the remaining coding gap that needs to be bridged in future versions of these models. We also created a public demo so everyone could study and try out these audio encoders. Finally, we propose research questions that should be tackled in the future, and encourage other researchers to perform more focused and highly specialized tasks like this one, so the respective chatbots cannot make use of well-known solutions and their creativity/reasoning is fully tested.

[LG-45] Powering Job Search at Scale: LLM -Enhanced Query Understanding in Job Matching Systems CIKM2025

链接: https://arxiv.org/abs/2509.09690
作者: Ping Liu,Jianqiang Shen,Qianqi Shen,Chunnan Yao,Kevin Kao,Dan Xu,Rajat Arora,Baofen Zheng,Caleb Johnson,Liangjie Hong,Jingwei Wu,Wenjing Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM2025

点击查看摘要

Abstract:Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving taxonomies and language patterns. In this paper, we introduce a unified query understanding framework powered by a Large Language Model (LLM), designed to address these limitations. Our approach jointly models the user query and contextual signals such as profile attributes to generate structured interpretations that drive more accurate and personalized recommendations. The framework improves relevance quality in online A/B testing while significantly reducing system complexity and operational overhead. The results demonstrate that our solution provides a scalable and adaptable foundation for query understanding in dynamic web applications.

[LG-46] Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise

链接: https://arxiv.org/abs/2509.10385
作者: Utsab Saha,Tanvir Muntakim Tonoy,Hafiz Imtiaz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In this work, we explore differentially private synthetic data generation in a decentralized-data setting by building on the recently proposed Differentially Private Class-Centric Data Aggregation (DP-CDA). DP-CDA synthesizes data in a centralized setting by mixing multiple randomly-selected samples from the same class and injecting carefully calibrated Gaussian noise, ensuring (\epsilon, \delta)-differential privacy. When deployed in a decentralized or federated setting, where each client holds only a small partition of the data, DP-CDA faces new challenges. The limited sample size per client increases the sensitivity of local computations, requiring higher noise injection to maintain the differential privacy guarantee. This, in turn, leads to a noticeable degradation in the utility compared to the centralized setting. To mitigate this issue, we integrate the Correlation-Assisted Private Estimation (CAPE) protocol into the federated DP-CDA framework and propose CAPE Assisted Federated DP-CDA algorithm. CAPE enables limited collaboration among the clients by allowing them to generate jointly distributed (anti-correlated) noise that cancels out in aggregate, while preserving privacy at the individual level. This technique significantly improves the privacy-utility trade-off in the federated setting. Extensive experiments on MNIST and FashionMNIST datasets demonstrate that the proposed CAPE Assisted Federated DP-CDA approach can achieve utility comparable to its centralized counterpart under some parameter regime, while maintaining rigorous differential privacy guarantees.

[LG-47] Matrix-free Neural Preconditioner for the Dirac Operator in Lattice Gauge Theory

链接: https://arxiv.org/abs/2509.10378
作者: Yixuan Sun,Srinivas Eswar,Yin Lin,William Detmold,Phiala Shanahan,Xiaoye Li,Yang Liu,Prasanna Balaprakash
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear systems arise in generating samples and in calculating observables in lattice quantum chromodynamics~(QCD). Solving the Hermitian positive definite systems, which are sparse but ill-conditioned, involves using iterative methods, such as Conjugate Gradient (CG), which are time-consuming and computationally expensive. Preconditioners can effectively accelerate this process, with the state-of-the-art being multigrid preconditioners. However, constructing useful preconditioners can be challenging, adding additional computational overhead, especially in large linear systems. We propose a framework, leveraging operator learning techniques, to construct linear maps as effective preconditioners. The method in this work does not rely on explicit matrices from either the original linear systems or the produced preconditioners, allowing efficient model training and application in the CG solver. In the context of the Schwinger model U(1) gauge theory in 1+1 spacetime dimensions with two degenerate-mass fermions), this preconditioning scheme effectively decreases the condition number of the linear systems and approximately halves the number of iterations required for convergence in relevant parameter ranges. We further demonstrate the framework learns a general mapping dependent on the lattice structure which leads to zero-shot learning ability for the Dirac operators constructed from gauge field configurations of different sizes.

[LG-48] Why does your graph neural network fail on some graphs? Insights from exact generalisation error

链接: https://arxiv.org/abs/2509.10337
作者: Nil Ayday,Mahalakshmi Sabanayagam,Debarghya Ghoshdastidar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used in learning on graph-structured data, yet a principled understanding of why they succeed or fail remains elusive. While prior works have examined architectural limitations such as over-smoothing and over-squashing, these do not explain what enables GNNs to extract meaningful representations or why performance varies drastically between similar architectures. These questions are related to the role of generalisation: the ability of a model to make accurate predictions on unlabelled data. Although several works have derived generalisation error bounds for GNNs, these are typically loose, restricted to a single architecture, and offer limited insight into what governs generalisation in practice. In this work, we take a different approach by deriving the exact generalisation error for GNNs in a transductive fixed-design setting through the lens of signal processing. From this viewpoint, GNNs can be interpreted as graph filter operators that act on node features via the graph structure. By focusing on linear GNNs while allowing non-linearity in the graph filters, we derive the first exact generalisation error for a broad range of GNNs, including convolutional, PageRank-based, and attention-based models. The exact characterisation of the generalisation error reveals that only the aligned information between node features and graph structure contributes to generalisation. Furthermore, we quantify the effect of homophily on generalisation. Our work provides a framework that explains when and why GNNs can effectively leverage structural and feature information, offering practical guidance for model selection.

[LG-49] Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance

链接: https://arxiv.org/abs/2509.10166
作者: Vladimir Petrovic,Rémi Bardenet,Agnès Desolneux
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein distance between two measures on \mathbbR^d , which is precisely an integral on the d -dimensional sphere. The sliced Wasserstein distance (SW) has gained momentum in machine learning either as a proxy to the less computationally tractable Wasserstein distance, or as a distance in its own right, due in particular to its built-in alleviation of the curse of dimensionality. There has been recent numerical benchmarks of quadratures for the sliced Wasserstein, and our viewpoint differs in that we concentrate on quadratures where the nodes are repulsive, i.e. negatively dependent. Indeed, negative dependence can bring variance reduction when the quadrature is adapted to the integration task. Our first contribution is to extract and motivate quadratures from the recent literature on determinantal point processes (DPPs) and repelled point processes, as well as repulsive quadratures from the literature specific to the sliced Wasserstein distance. We then numerically benchmark these quadratures. Moreover, we analyze the variance of the UnifOrtho estimator, an orthogonal Monte Carlo estimator. Our analysis sheds light on UnifOrtho’s success for the estimation of the sliced Wasserstein in large dimensions, as well as counterexamples from the literature. Our final recommendation for the computation of the sliced Wasserstein distance is to use randomized quasi-Monte Carlo in low dimensions and \emphUnifOrtho in large dimensions. DPP-based quadratures only shine when quasi-Monte Carlo also does, while repelled quadratures show moderate variance reduction in general, but more theoretical effort is needed to make them robust.

[LG-50] FetalSleepNet: A Transfer Learning Framework with Spectral Equalisation Domain Adaptation for Fetal Sleep Stage Classification ALT

链接: https://arxiv.org/abs/2509.10082
作者: Weitao Tang,Johann Vargas-Calixto,Nasim Katebi,Nhi Tran,Sharmony B. Kelly,Gari D. Clifford,Robert Galinsky,Faezeh Marzbanrad
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 13 pages, 4 tables, 5 figures, submitted to IEEE Journal of Biomedical and Health Informatics

点击查看摘要

Abstract:Introduction: This study presents FetalSleepNet, the first published deep learning approach to classifying sleep states from the ovine electroencephalogram (EEG). Fetal EEG is complex to acquire and difficult and laborious to interpret consistently. However, accurate sleep stage classification may aid in the early detection of abnormal brain maturation associated with pregnancy complications (e.g. hypoxia or intrauterine growth restriction). Methods: EEG electrodes were secured onto the ovine dura over the parietal cortices of 24 late gestation fetal sheep. A lightweight deep neural network originally developed for adult EEG sleep staging was trained on the ovine EEG using transfer learning from adult EEG. A spectral equalisation-based domain adaptation strategy was used to reduce cross-domain mismatch. Results: We demonstrated that while direct transfer performed poorly, full fine tuning combined with spectral equalisation achieved the best overall performance (accuracy: 86.6 percent, macro F1-score: 62.5), outperforming baseline models. Conclusions: To the best of our knowledge, FetalSleepNet is the first deep learning framework specifically developed for automated sleep staging from the fetal EEG. Beyond the laboratory, the EEG-based sleep stage classifier functions as a label engine, enabling large scale weak/semi supervised labeling and distillation to facilitate training on less invasive signals that can be acquired in the clinic, such as Doppler Ultrasound or electrocardiogram data. FetalSleepNet’s lightweight design makes it well suited for deployment in low power, real time, and wearable fetal monitoring systems. Comments: 13 pages, 4 tables, 5 figures, submitted to IEEE Journal of Biomedical and Health Informatics Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2509.10082 [eess.SP] (or arXiv:2509.10082v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2509.10082 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weitao Tang [view email] [v1] Fri, 12 Sep 2025 09:19:04 UTC (4,303 KB)

[LG-51] Engineering Spatial and Molecular Features from Cellular Niches to Inform Predictions of Inflammatory Bowel Disease

链接: https://arxiv.org/abs/2509.09923
作者: Myles Joshua Toledo Tan,Maria Kapetanaki,Panayiotis V. Benos
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, 7 tables. Submitted to the 25th BNAIC Conference, Namur, Belgium, November 19 - 21, 2025

点击查看摘要

Abstract:Differentiating between the two main subtypes of Inflammatory Bowel Disease (IBD): Crohns disease (CD) and ulcerative colitis (UC) is a persistent clinical challenge due to overlapping presentations. This study introduces a novel computational framework that employs spatial transcriptomics (ST) to create an explainable machine learning model for IBD classification. We analyzed ST data from the colonic mucosa of healthy controls (HC), UC, and CD patients. Using Non-negative Matrix Factorization (NMF), we first identified four recurring cellular niches, representing distinct functional microenvironments within the tissue. From these niches, we systematically engineered 44 features capturing three key aspects of tissue pathology: niche composition, neighborhood enrichment, and niche-gene signals. A multilayer perceptron (MLP) classifier trained on these features achieved an accuracy of 0.774 +/- 0.161 for the more challenging three-class problem (HC, UC, and CD) and 0.916 +/- 0.118 in the two-class problem of distinguishing IBD from healthy tissue. Crucially, model explainability analysis revealed that disruptions in the spatial organization of niches were the strongest predictors of general inflammation, while the classification between UC and CD relied on specific niche-gene expression signatures. This work provides a robust, proof-of-concept pipeline that transforms descriptive spatial data into an accurate and explainable predictive tool, offering not only a potential new diagnostic paradigm but also deeper insights into the distinct biological mechanisms that drive IBD subtypes.

[LG-52] Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators

链接: https://arxiv.org/abs/2509.09894
作者: Jiayun Wang,Yousuf Aborahama,Arya Khokhar,Yang Zhang,Chuwei Wang,Karteekeya Sastry,Julius Berner,Yilin Luo,Boris Bonev,Zongyi Li,Kamyar Azizzadenesheli,Lihong V. Wang,Anima Anandkumar
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit. While three-dimensional PACT systems enable high-resolution volumetric imaging for applications spanning transcranial to breast imaging, current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation. We introduce Pano (PACT imaging neural operator), an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions. Unlike existing approaches (e.g. universal back-projection algorithm), Pano learns both physics and data priors while also being agnostic to the input data resolution. Pano employs spherical discrete-continuous convolutions to preserve hemispherical sensor geometry, incorporates Helmholtz equation constraints to ensure physical consistency and operates resolutionindependently across varying sensor configurations. We demonstrate the robustness and efficiency of Pano in reconstructing high-quality images from both simulated and real experimental data, achieving consistent performance even with significantly reduced transducer counts and limited-angle acquisition configurations. The framework maintains reconstruction fidelity across diverse sparse sampling patterns while enabling real-time volumetric imaging capabilities. This advancement establishes a practical pathway for making 3D PACT more accessible and feasible for both preclinical research and clinical applications, substantially reducing hardware requirements without compromising image reconstruction quality.

[LG-53] An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards

链接: https://arxiv.org/abs/2509.09855
作者: Agus Sudjianto,Denis Burakov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Credit risk modeling relies extensively on Weight of Evidence (WoE) and Information Value (IV) for feature engineering, and Population Stability Index (PSI) for drift monitoring, yet their theoretical foundations remain disconnected. We establish a unified information-theoretic framework revealing these industry-standard metrics as instances of classical information divergences. Specifically, we prove that IV exactly equals PSI (Jeffreys divergence) computed between good and bad credit outcomes over identical bins. Through the delta method applied to WoE transformations, we derive standard errors for IV and PSI, enabling formal hypothesis testing and probabilistic fairness constraints for the first time. We formalize credit modeling’s inherent performance-fairness trade-off as maximizing IV for predictive power while minimizing IV for protected attributes. Using automated binning with depth-1 XGBoost stumps, we compare three encoding strategies: logistic regression with one-hot encoding, WoE transformation, and constrained XGBoost. All methods achieve comparable predictive performance (AUC 0.82-0.84), demonstrating that principled, information-theoretic binning outweighs encoding choice. Mixed-integer programming traces Pareto-efficient solutions along the performance-fairness frontier with uncertainty quantification. This framework bridges theory and practice, providing the first rigorous statistical foundation for widely-used credit risk metrics while offering principled tools for balancing accuracy and fairness in regulated environments.

[LG-54] Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

链接: https://arxiv.org/abs/2509.09802
作者: Tianqi Qiao,Marie Maros
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose and study Sparse Polyak, a variant of Polyak’s adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

[LG-55] DCHO: A Decomposition-Composition Framework for Predicting Higher-Order Brain Connectivity to Enhance Diverse Downstream Applications

链接: https://arxiv.org/abs/2509.09696
作者: Weibin Li,Wendu Li,Quanying Liu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Higher-order brain connectivity (HOBC), which captures interactions among three or more brain regions, provides richer organizational information than traditional pairwise functional connectivity (FC). Recent studies have begun to infer latent HOBC from noninvasive imaging data, but they mainly focus on static analyses, limiting their applicability in dynamic prediction tasks. To address this gap, we propose DCHO, a unified approach for modeling and forecasting the temporal evolution of HOBC based on a Decomposition-Composition framework, which is applicable to both non-predictive tasks (state classification) and predictive tasks (brain dynamics forecasting). DCHO adopts a decomposition-composition strategy that reformulates the prediction task into two manageable subproblems: HOBC inference and latent trajectory prediction. In the inference stage, we propose a dual-view encoder to extract multiscale topological features and a latent combinatorial learner to capture high-level HOBC information. In the forecasting stage, we introduce a latent-space prediction loss to enhance the modeling of temporal trajectories. Extensive experiments on multiple neuroimaging datasets demonstrate that DCHO achieves superior performance in both non-predictive tasks (state classification) and predictive tasks (brain dynamics forecasting), significantly outperforming existing methods.

[LG-56] Machine-learning competition to grade EEG background patterns in newborns with hypoxic-ischaemic encephalopathy

链接: https://arxiv.org/abs/2509.09695
作者: Fabio Magarelli,Geraldine B. Boylan,Saeed Montazeri,Feargal O’Sullivan,Dominic Lightbody,Minoo Ashoori,Tamara Skoric Ceranic,John M. O’Toole
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 29 pages, supplementary materials: "supplementary materials ML this http URL "

点击查看摘要

Abstract:Machine learning (ML) has the potential to support and improve expert performance in monitoring the brain function of at-risk newborns. Developing accurate and reliable ML models depends on access to high-quality, annotated data, a resource in short supply. ML competitions address this need by providing researchers access to expertly annotated datasets, fostering shared learning through direct model comparisons, and leveraging the benefits of crowdsourcing diverse expertise. We compiled a retrospective dataset containing 353 hours of EEG from 102 individual newborns from a multi-centre study. The data was fully anonymised and divided into training, testing, and held-out validation datasets. EEGs were graded for the severity of abnormal background patterns. Next, we created a web-based competition platform and hosted a machine learning competition to develop ML models for classifying the severity of EEG background patterns in newborns. After the competition closed, the top 4 performing models were evaluated offline on a separate held-out validation dataset. Although a feature-based model ranked first on the testing dataset, deep learning models generalised better on the validation sets. All methods had a significant decline in validation performance compared to the testing performance. This highlights the challenges for model generalisation on unseen data, emphasising the need for held-out validation datasets in ML studies with neonatal EEG. The study underscores the importance of training ML models on large and diverse datasets to ensure robust generalisation. The competition’s outcome demonstrates the potential for open-access data and collaborative ML development to foster a collaborative research environment and expedite the development of clinical decision-support tools for neonatal neuromonitoring.

信息检索

[IR-0] MatSKRAFT: A framework for large-scale materials knowledge extraction from scientific tables

链接: https://arxiv.org/abs/2509.10448
作者: Kausik Hira,Mohd Zaki,Mausam,N. M. Anoop Krishnan
类目: Information Retrieval (cs.IR); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Scientific progress increasingly depends on synthesizing knowledge across vast literature, yet most experimental data remains trapped in semi-structured formats that resist systematic extraction and analysis. Here, we present MatSKRAFT, a computational framework that automatically extracts and integrates materials science knowledge from tabular data at unprecedented scale. Our approach transforms tables into graph-based representations processed by constraint-driven GNNs that encode scientific principles directly into model architecture. MatSKRAFT significantly outperforms state-of-the-art large language models, achieving F1 scores of 88.68 for property extraction and 71.35 for composition extraction, while processing data 19 - 496\times faster than them (compared to the slowest and the fastest models, respectively) with modest hardware requirements. Applied to nearly 69,000 tables from more than 47,000 research publications, we construct a comprehensive database containing over 535,000 entries, including 104,000 compositions that expand coverage beyond major existing databases, pending manual validation. This systematic approach reveals previously overlooked materials with distinct property combinations and enables data-driven discovery of composition-property relationships forming the cornerstone of materials and scientific discovery.

[IR-1] RecoWorld: Building Simulated Environments for Agent ic Recommender Systems

链接: https://arxiv.org/abs/2509.10397
作者: Fei Liu,Xinyu Lin,Hanchao Yu,Mingyuan Wu,Jianyu Wang,Qiang Zhang,Zhuokai Zhao,Yinglong Xia,Yao Zhang,Weiwei Li,Mingze Gao,Qifan Wang,Lizhu Zhang,Benyu Zhang,Xiangjun Fan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems. Such environments give agents a proper training space where they can learn from errors without impacting real users. RecoWorld distinguishes itself with a dual-view architecture: a simulated user and an agentic recommender engage in multi-turn interactions aimed at maximizing user retention. The user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions. The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop that actively engages users. This process leverages the exceptional reasoning capabilities of modern LLMs. We explore diverse content representations within the simulator, including text-based, multimodal, and semantic ID modeling, and discuss how multi-turn RL enables the recommender to refine its strategies through iterative interactions. RecoWorld also supports multi-agent simulations, allowing creators to simulate the responses of targeted user populations. It marks an important first step toward recommender systems where users and agents collaboratively shape personalized information streams. We envision new interaction paradigms where “user instructs, recommender responds,” jointly optimizing user retention and engagement.

[IR-2] A Research Vision for Web Search on Emerging Topics

链接: https://arxiv.org/abs/2509.10212
作者: Alisa Rieger,Stefan Dietze,Ran Yu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We regularly encounter information on novel, emerging topics for which the body of knowledge is still evolving, which can be linked, for instance, to current events. A primary way to learn more about such topics is through web search. However, information on emerging topics is sparse and evolves dynamically as knowledge grows, making it uncertain and variable in quality and trustworthiness and prone to deliberate or accidental manipulation, misinformation, and bias. In this paper, we outline a research vision towards search systems and interfaces that support effective knowledge acquisition, awareness of the dynamic nature of topics, and responsible opinion formation among people searching the web for information on emerging topics. To realize this vision, we propose three overarching research questions, aimed at understanding the status quo, determining requirements of systems aligned with our vision, and building these systems. For each of the three questions, we highlight relevant literature, including pointers on how they could be addressed. Lastly, we discuss the challenges that will potentially arise in pursuing the proposed vision.

[IR-3] Demonstrating Narrative Pattern Discovery from Biomedical Literature

链接: https://arxiv.org/abs/2509.09687
作者: Hermann Kroll,Pascal Sackhoff,Bill Matthias Thang,Christin Katharina Kreutz,Wolf-Tilo Balke
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: Accepted Demo at TPDL2025, 10 pages, 3 figures

点击查看摘要

Abstract:Digital libraries maintain extensive collections of knowledge and need to provide effective access paths for their users. For instance, PubPharm, the specialized information service for Pharmacy in Germany, provides and develops access paths to their underlying biomedical document collection. In brief, PubPharm supports traditional keyword-based search, search for chemical structures, as well as novel graph-based discovery workflows, e.g., listing or searching for interactions between different pharmaceutical entities. This paper introduces a new search functionality, called narrative pattern mining, allowing users to explore context-relevant entities and entity interactions. We performed interviews with five domain experts to verify the usefulness of our prototype.

[IR-4] Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

链接: https://arxiv.org/abs/2509.09682
作者: Maxim Zhelnin,Dmitry Redko,Volkov Daniil,Anna Volodkevich,Petr Sokerin,Valeriy Shevchenko,Egor Shvetsov,Alexey Vasilev,Darya Denisova,Ruslan Izmailov,Alexey Zaytsev
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch size, and sequence length. Recognizing this, practitioners in the field of recommendation systems typically address memory consumption by integrating the cross-entropy (CE) loss with negative sampling, thereby reducing the explicit memory demands of the final layer. However, a small number of negative samples would degrade model performance, and as we demonstrate in our work, increasing the number of negative samples and the batch size further improves the model’s performance, but rapidly starts to exceed industrial GPUs’ size (~40Gb). In this work, we introduce the CCE- method, which offers a GPU-efficient implementation of the CE loss with negative sampling. Our method accelerates training by up to two times while reducing memory consumption by more than 10 times. Leveraging the memory savings afforded by using CCE- for model training, it becomes feasible to enhance its accuracy on datasets with a large item catalog compared to those trained with original PyTorch-implemented loss functions. Finally, we perform an analysis of key memory-related hyperparameters and highlight the necessity of a delicate balance among these factors. We demonstrate that scaling both the number of negative samples and batch size leads to better results rather than maximizing only one of them. To facilitate further adoption of CCE-, we release a Triton kernel that efficiently implements the proposed method. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2509.09682 [cs.IR] (or arXiv:2509.09682v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.09682 Focus to learn more arXiv-issued DOI via DataCite

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-09-15

目录

概览 (2025-09-15)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载