Arxiv今日论文 | 2025-10-13

本篇博文主要内容为 2025-10-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理无限视频流时面临的计算复杂度高、内存占用大以及推理延迟高的问题。传统方法如全注意力机制会导致二次方级计算开销，而简单的滑动窗口策略则难以维持语义连贯性或引发冗余计算。其解决方案的关键在于提出StreamingVLM框架，通过训练与推理对齐的设计，在推理阶段利用注意力缓存（KV cache）重用机制，结合短期视觉token窗口和长期文本token窗口来实现高效稳定的流式理解；同时采用一种简化的监督微调（Supervised Fine-Tuning, SFT）策略，通过对短片段重叠视频块施加全注意力训练，模拟推理时的注意力模式，从而无需训练超长上下文即可获得优异性能。

链接: https://arxiv.org/abs/2510.09608
作者: Ruyi Xu,Guangxuan Xiao,Yukang Chen,Liuning He,Kelly Peng,Yao Lu,Song Han
机构: MIT (麻省理工学院); NVIDIA (英伟达); First Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract:Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at this https URL.
zh

[NLP-1] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理能力提升过程中对大规模标注推理数据集的依赖问题，此类数据集的构建通常耗时且资源密集。解决方案的关键在于提出一种名为“测试时提示缩放”（Prompting Test-Time Scaling, P-TTS）的推理时数据增强策略：通过仅使用90个手工精选的推理示例，结合系统性地调整提示强度（instruction prompting intensities）来合成多样化的推理路径上下文，从而在推理阶段动态生成丰富的训练样本；随后利用这些合成数据微调Qwen-2.5系列模型，显著提升了数学推理任务上的性能，同时实现了零样本泛化能力的增强，为资源受限或快速演进场景下的LLM推理能力释放提供了一种低成本、高效的可行路径。

链接: https://arxiv.org/abs/2510.09599
作者: Sondos Mahmoud Bsharat,Zhiqiang Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our code and data are available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME’24 (7B), and +13.34% and +6.67% on AIME’25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME’24, and +26.63% and +3.33% on AIME’25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
zh

[NLP-2] LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

【速读】：该论文旨在解决当前用于评估大语言模型（Large Language Models, LLMs）编码能力的竞赛编程基准存在的三大问题：缺乏极具挑战性的题目、测试用例覆盖不足，以及对在线平台API的依赖导致可访问性受限。其解决方案的核心是提出LiveOIBench——一个包含403道由专家精心筛选的奥林匹克级别竞赛编程题目的基准，每题平均配有60个专家设计的测试用例，并直接来源于2023至2025年间72场官方信息学奥林匹克竞赛。该基准的关键创新在于：（1）高质量任务与细粒度子任务评分规则及私有测试集；（2）集成顶尖选手的实际表现数据以实现与人类顶级水平的直观对比；（3）持续、无污染地更新新发布的奥赛题目；（4）自包含的离线评估系统，确保实验可复现性。

链接: https://arxiv.org/abs/2510.09595
作者: Kaijian Zou,Aaron Xiong,Yunxiang Zhang,Frederick Zhang,Yueqi Ren,Jirong Yang,Ayoung Lee,Shitanshu Bhushan,Lu Wang
机构: University of Michigan, Ann Arbor(密歇根大学，安娜堡分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.
zh

[NLP-3] Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

【速读】：该论文旨在解决实时口语语言模型（Real-time Spoken Language Models, SLMs）在采用链式思维（Chain-of-Thought, CoT）推理时因需串行生成完整思考过程而导致的高延迟问题。现有方法难以实现类似人类“边思考边说话”的实时交互能力。解决方案的关键在于提出了一种受大脑启发的“心速说话”（Mind-Paced Speaking, MPS）框架，其核心是双脑架构：一个“构想脑”（Formulation Brain）负责高层级推理以控制思考节奏，另一个“表达脑”（Articulation Brain）专注于流利语音生成。这种分工机制避免了模式切换，保障了推理完整性，同时显著降低延迟，在数学推理和语音对话任务中均实现了接近预计算CoT模型的准确性与实时性表现。

链接: https://arxiv.org/abs/2510.09592
作者: Donghang Wu,Haoyang Zhang,Jun Chen,Xiangyu(Tony)Zhang,Hexin Liu,Eng Siong Chng,Fei Tian,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu
机构: StepFun; Nanyang Technological University (南洋理工大学); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a “Formulation Brain” for high-level reasoning to pace and guide a separate “Articulation Brain” for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.
zh

[NLP-4] Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

【速读】：该论文旨在解决当前AI代理在长时程、交互式任务（如网页导航和计算机/手机操作）中表现不佳的问题，尽管其在数学和编程等特定领域已展现出专家级能力。问题的核心在于缺乏对未来的心理模拟能力，即“替代性试错”（vicarious trial and error），这限制了代理在复杂环境中的规划与决策效率。解决方案的关键是提出一个两阶段训练框架——Dyna-Mind：第一阶段引入ReSim（Reasoning with Simulations），通过从真实环境交互中构建扩展搜索树来生成结构化推理轨迹，使代理的推理基于忠实的世界动态并具备预测未来状态的能力；第二阶段采用Dyna-GRPO，一种在线强化学习方法，利用结果奖励和中间状态反馈进一步优化代理的模拟与决策能力。该方案有效提升了代理在长周期、高规划强度任务中的性能。

链接: https://arxiv.org/abs/2510.09577
作者: Xiao Yu,Baolin Peng,Michel Galley,Hao Cheng,Qianhui Wu,Janardhan Kulkarni,Suman Nath,Zhou Yu,Jianfeng Gao
机构: Columbia University (哥伦比亚大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ‘‘vicarious trial and error’’ - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent’s reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent’s simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
zh

[NLP-5] AutoPR: Lets Automate Your Academic Promotion!

【速读】：该论文旨在解决科研成果在社交媒体平台上的推广效率低下问题，即学者需投入大量人力进行内容创作与传播以提升论文可见度和引用率。为应对这一挑战，作者提出“自动推广”（Automatic Promotion, AutoPR）任务，其核心解决方案是构建一个名为PRAgent的多智能体框架，该框架通过三个阶段实现自动化：首先进行多模态内容提取与准备，其次通过协作式合成生成高质量文本，最后针对不同平台特性优化格式、语气与标签以最大化传播效果。实验表明，相较于直接使用大语言模型（LLM）的流水线方法，PRAgent在观看时长、点赞数及整体互动性上显著提升，验证了平台建模与精准投放策略的关键作用。

链接: https://arxiv.org/abs/2510.09558
作者: Qiguang Chen,Zheng Yan,Mingda Yang,Libo Qin,Yixin Yuan,Hanjing Li,Jinhao Liu,Yiyan Ji,Dengyun Peng,Jiannan Guan,Mengkang Hu,Yantao Du,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); The University of Hong Kong (香港大学); ByteDance China (字节跳动中国)
类目: Computation and Language (cs.CL)
备注: Preprint. Code: this https URL . Benchmark: this https URL

点击查看摘要

Abstract:As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.
zh

[NLP-6] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

【速读】：该论文试图解决的问题是：语言模型（Language Models, LMs）能否通过话语连接词（discourse connectives）推断出关于新实体的世界知识，即连接词是否能作为语言线索帮助模型获得对现实世界的推理能力。传统研究关注的是利用世界知识预测连接词，而本文则反向考察连接词是否能够引导模型进行有效的世界知识推理。解决方案的关键在于构建了一个名为 WUGNECTIVES 的新数据集，包含 8,880 个测试用例，其中连接词将新实体与其特定属性关联，从而评估模型在不同连接词类型下的推理表现。实验表明，对模型进行针对性微调可显著提升其在多数连接词上的推理能力，但所有模型在表示让步意义（concessive）的连接词上仍存在系统性困难，这揭示了连接词在驱动模型世界知识推理中的功能差异，为深入理解语言线索与认知推理之间的关系提供了新路径。

链接: https://arxiv.org/abs/2510.09556
作者: Daniel Brubaker,William Sheffield,Junyi Jessy Li,Kanishka Misra
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages total, 9 pages main; 7 figures total, 4 figures main; 8 tables total, 4 tables main

点击查看摘要

Abstract:The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at this https URL.
zh

[NLP-7] A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning : Performance Consistency and Faithfulness Across Languages

【速读】：该论文旨在解决多语言链式思维（Chain-of-Thought, CoT）推理中三个关键维度——性能（performance）、一致性（consistency）和忠实性（faithfulness）的系统性评估问题，尤其是针对生成式 AI 在非英语语言中的中间推理步骤（thinking traces）缺乏深入研究的现状。解决方案的关键在于：首先通过显式指令或提示劫持（prompt-hacking）引导模型在目标语言中进行推理，量化其语言合规性、答案准确率与一致性；其次，通过跨语言交换思考轨迹（interchanging thinking traces）来评估跨语言推理一致性；最后，采用扰动技术（如截断和错误注入）探测不同语言下模型对推理路径的依赖程度，从而揭示模型在多语言场景中推理过程的可信度差异。

链接: https://arxiv.org/abs/2510.09555
作者: Raoyuan Zhao,Yihong Liu,Hinrich Schütze,Michael A. Hedderich
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques – i.e., truncation and error injection – to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.
zh

[NLP-8] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval NLPCC2025

【速读】：该论文旨在解决跨语言医学视频语料库中复杂多跳问题的精准检索难题，即如何从多语言医学视频档案中高效获取与查询相关的细粒度视频片段。现有方法要么将时长数小时的视频压缩为粗粒度嵌入，丧失细节信息，要么采用高成本的细粒度匹配策略，难以规模化应用。其解决方案的关键在于提出一种多阶段框架：首先将视频字幕切分为语义连贯的块，并通过知识图谱（Knowledge Graph, KG）增强关键术语；随后构建层次化树结构索引，利用语言无关的多语言编码器生成节点嵌入；在查询时，使用相同编码器对问题进行嵌入，通过粗到精的树搜索快速剪枝无关分支，仅对Top-K候选块由轻量级大语言模型（Large Language Model, LLM）进行重排序，从而避免全量交叉编码器评分的同时保持块级别精度。该设计实现了高准确率与计算效率的平衡，适用于专业化、多语言的医学视频检索场景。

链接: https://arxiv.org/abs/2510.09553
作者: Yu Wang,Tianhao Tan,Yifei Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to NLPCC 2025 (Springer), to appear November 2025

点击查看摘要

Abstract:Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.
zh

[NLP-9] Beyond Surface Reasoning : Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, DLLMs）在推理过程中存在的“并行-序列矛盾”（Parallel-Sequential Contradiction, PSC）问题，即并行解码机制与严格因果顺序推理需求之间的冲突。研究表明，DLLMs仅在直接可判定的输出中表现出真正的并行性，而随着任务复杂度提升，其行为趋近于自回归模型（autoregressive LLMs, ALLMs），且自回归提示方式会显著增加解码步数而不提升质量。PSC还限制了模型的自我反思、推理深度和探索广度。解决方案的关键在于识别并量化PSC的三个尺度维度（并行、扩散、序列），并通过三种策略缓解：面向并行的提示设计（parallel-oriented prompting）、扩散早期停止（diffusion early stopping）以及并行扩展（parallel scaling），从而减少因PSC导致的无效性和低效性。

链接: https://arxiv.org/abs/2510.09544
作者: Qiguang Chen,Hanjing Li,Libo Qin,Dengyun Peng,Jinhao Liu,Jiangyi Wang,Chengyue Wu,Xie Chen,Yantao Du,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); The University of Hong Kong (香港大学); Shanghai Jiao Tong University (上海交通大学); ByteDance Seed (字节跳动种子项目)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs’ self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
zh

[NLP-10] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, dLLMs）在通过强化学习（Reinforcement Learning, RL）对齐人类偏好或特定任务奖励时面临的挑战，即由于其不可计算的对数似然（log-likelihood）导致标准策略梯度方法无法直接应用的问题。现有方法如证据下界（Evidence Lower Bound, ELBO）虽被用作替代方案，但因其单侧近似可能引入显著的策略梯度偏差。为此，作者提出了一种“夹层策略梯度”（Sandwiched Policy Gradient, SPG）方法，其关键在于同时利用真实对数似然的上下界来构造更精确的梯度估计，从而有效降低偏差并提升训练稳定性与性能。实验表明，SPG在多个基准任务上显著优于基于ELBO或单步估计的基线方法。

链接: https://arxiv.org/abs/2510.09541
作者: Chengyu Wang,Paria Rashidinejad,DiJia Su,Song Jiang,Sid Wang,Siyan Zhao,Cai Zhou,Shannon Zejiang Shen,Feiyu Chen,Tommi Jaakkola,Yuandong Tian,Bo Liu
机构: MIT (麻省理工学院); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
zh

[NLP-11] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言实际应用中面对用户输入中的拼写错误（typo）时鲁棒性不足的问题。当前多数基准测试假设输入为干净文本，忽略了真实场景下自然出现的打字错误，导致对LLMs在多语言环境下抗噪能力的评估严重不足。解决方案的关键在于提出MulTypo——一种基于语言特定键盘布局和打字行为模拟人类拼写错误的多语言错别字生成算法，并在此基础上系统评估了18个开源LLM在五类下游任务中的表现，揭示了拼写错误对不同任务类型、模型架构及语言资源水平的差异化影响，从而强调了噪声感知训练与多语言鲁棒性评测的重要性。

链接: https://arxiv.org/abs/2510.09536
作者: Yihong Liu,Raoyuan Zhao,Lena Altinger,Hinrich Schütze,Michael A. Hedderich
机构: Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs – naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning – while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.
zh

[NLP-12] Mitigating Overthinking through Reasoning Shaping

【速读】：该论文旨在解决大推理模型（Large Reasoning Models, LRM）在强化学习从验证器奖励（Reinforcement Learning from Verifier Reward, RLVR）训练过程中出现的“过度思考”（overthinking）问题，即推理过程冗长、低效且消耗大量计算资源，而现有基于token级惩罚的方法虽能减少Token消耗，却常损害模型性能。解决方案的关键在于提出一种分组相对段落惩罚机制（Group Relative Segment Penalization, GRSP），这是一种基于步骤（step-level）的正则化方法，其核心创新在于：通过分析发现推理段落（reasoning segments）与Token消耗及模型性能强相关，并设计了一个长度感知的跨段落聚类加权机制，从而在不显著牺牲准确性的前提下实现更优的Token效率，尤其在复杂问题上优势明显，同时提升了RL训练的稳定性并具备良好的模型规模扩展性。

链接: https://arxiv.org/abs/2510.09535
作者: Feifan Song,Shaohang Wei,Bofei Gao,Yejie Wang,Wen Luo,Wei Li,Linli Yao,Weimin Xiong,Liang Chen,Tianyu Liu,Houfeng Wang
机构: Peking University (北京大学); Moonshot AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
zh

[NLP-13] Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking ICASSP2026

【速读】：该论文旨在解决预训练的基于Transformer的自动语音识别（ASR）模型在面对口音和方言差异时敏感性高、导致词错误率（WER）上升的问题，尤其是在英语和波斯语等语言中。其解决方案的关键在于提出一种口音不变的ASR框架，通过引入口音与方言分类模块，训练一个基于频谱图（spectrogram）的分类器以捕捉口音特异性线索，随后对分类器预测影响最大的频谱区域进行掩码处理，并利用这些掩码后的频谱图进行数据增强，从而提升ASR模型对口音变异的鲁棒性。实验表明，该方法在英语和波斯语场景下均显著降低了WER，验证了其有效性。

链接: https://arxiv.org/abs/2510.09528
作者: Mohammad Hossein Sameti,Sepehr Harfi Moridani,Ali Zarean,Hossein Sameti
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: this https URL
zh

[NLP-14] Can We Reliably Rank Model Performance across Domains without Labeled Data?

【速读】：该论文旨在解决无标签情况下对自然语言处理（Natural Language Processing, NLP）模型性能估计的可靠性问题，特别是如何在跨域场景下获得可信的性能排名。其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）作为误差预测器，通过两阶段评估框架验证其相较于基于数据分布漂移或零样本基线方法在预测真实准确率排名上的更强一致性与稳定性。实验表明，LLM-based误差预测器能更可靠地捕捉模型在不同领域中的失败模式，且当各领域间性能差异较大时，排名可靠性显著提升。

链接: https://arxiv.org/abs/2510.09519
作者: Veronica Rammouz,Aaron Gonzalez,Carlos Cruzportillo,Adrian Tan,Nicole Beebe,Anthony Rios
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); Illinois Institute of Technology (伊利诺伊理工学院)
类目: Computation and Language (cs.CL)
备注: 8 pages + references and Appendix

点击查看摘要

Abstract:Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model’s predictions align with the base model’s true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.
zh

[NLP-15] StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在统计学领域推理能力评估不足的问题，即现有基准测试多聚焦于数学与逻辑推理，而忽视了统计学作为一门兼具理论深度与应用广度的学科的独特性。为填补这一空白，作者提出首个专注于统计学的综合性基准测试工具StatEval，其关键在于构建了一个可扩展的多智能体（multi-agent）流水线系统，结合人工验证（human-in-the-loop validation），实现大规模问题抽取、重写与质量控制，并设计了针对计算型任务与证明型任务的精细化评估框架，从而确保数据的学术严谨性和评测的细粒度合理性。

链接: https://arxiv.org/abs/2510.09517
作者: Yuchen Lu,Run Yang,Yichen Zhang,Shuguang Yu,Runpeng Dai,Ziwei Wang,Jiayi Xiang,Wenxin E,Siran Gao,Xinyao Ruan,Yirui Huang,Chenjing Xi,Haibo Hu,Yueming Fu,Qinglan Yu,Xiaobing Wei,Jiani Gu,Rui Sun,Jiaxuan Jia,Fan Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbfStatEval, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: this https URL.
zh

[NLP-16] Multimodal Policy Internalization for Conversational Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在构建对话代理时，因依赖预定义策略（policy）进行行为控制而面临的复杂性高、推理成本大以及多模态策略难以内化的挑战。当前系统通常通过上下文提示（in-context prompts）实现策略约束，但随着任务多样化，此类策略日益冗长且难以准确遵循，同时带来固定计算开销；尤其在视觉与多模态场景下，相关策略研究仍属空白。解决方案的关键在于提出多模态策略内化（Multimodal Policy Internalization, MPI）——将复杂的、推理密集型的多模态策略知识通过三阶段训练框架TriMPI注入模型参数中，从而在推理阶段无需携带原始策略即可实现强策略遵循能力。TriMPI包括：持续预训练注入策略知识、监督微调优化响应一致性、PolicyRollout（基于GRPO的强化学习扩展）增强探索过程中的策略感知反馈，显著提升了端到端准确性、泛化能力和遗忘鲁棒性。

链接: https://arxiv.org/abs/2510.09474
作者: Zhenhailong Wang,Jiateng Liu,Amin Fazel,Ritesh Sarkhel,Xing Fan,Xiang Li,Chenlei Guo,Heng Ji,Ruhi Sarikaya
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: this https URL.
zh

[NLP-17] Hybrid Models for Natural Language Reasoning : The Case of Syllogistic Logic

【速读】：该论文旨在解决当前预训练大语言模型（Large Language Models, LLMs）在逻辑推理中普遍存在的泛化能力不足问题，特别是区分并应对两个关键维度：组合性（compositionality），即抽象复杂推理背后原子逻辑规则的能力；以及递归性（recursiveness），即通过迭代应用推理规则构建复杂表示的能力。研究表明，LLMs在递归性上表现良好，但在组合性上存在显著短板。为克服这一局限，论文提出一种融合符号推理与神经计算的混合架构（hybrid architecture），其核心在于通过符号推理保障推理的完备性，同时利用神经组件提升处理效率，从而实现高效且可靠的逻辑推理系统。实验表明，即使神经模块规模较小，该方法仍能保持高效率，验证了混合模型在突破神经推理系统泛化瓶颈方面的潜力。

链接: https://arxiv.org/abs/2510.09472
作者: Manuel Vargas Guzmán,Jakub Szymanik,Maciej Malicki
机构: University of Warsaw (华沙大学); University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.
zh

[NLP-18] Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）训练数据可访问性不足的问题，尤其是针对开源LLM训练数据规模庞大、难以被科学界有效利用的现状。其核心挑战在于如何实现对海量文本数据（如15.2T tokens）的高效索引与检索，从而提升模型安全性并促进开放研究。解决方案的关键在于构建一个基于Elasticsearch的全文索引管道，并部署在高性能、低功耗的arm64架构超算集群（Alps基础设施）上，成功实现了对8.6T tokens的并行索引，不仅验证了Elasticsearch在下一代ARM架构上的可行性，还首次展示了大规模开放网络文本数据的可访问性与可用性，为实现无需依赖特定攻击方式的LLM安全检测提供了新工具。

链接: https://arxiv.org/abs/2510.09471
作者: Ines Altemir Marinas,Anastasiia Kucherenko,Alexander Sternfeld,Andrei Kucharavy
机构: EPFL (瑞士联邦理工学院洛桑分校); HES-SO Valais-Wallis (瑞士高等教育学院瓦莱-瓦利斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.09471 [cs.CL] (or arXiv:2510.09471v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.09471 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

【速读】：该论文旨在解决真实世界交通事故描述文本（free-text crash narratives）中难以进行大规模、高精度结构化提取的问题，尤其针对碰撞方式（Manner of Collision）和车辆级别事故类型（Crash Type）的识别任务。传统方法受限于非结构化文本的多样性与缺乏标准化处理工具，而大型闭源语言模型（LLMs）虽具备一定能力，但在推理密集型任务中表现下降，且存在数据隐私风险及领域知识不足的问题。解决方案的关键在于：采用轻量级开源预训练语言模型（PLMs），结合低秩适应（Low-Rank Adaption, LoRA）与BERT微调技术，注入特定领域知识以缩小领域差距，从而在仅需少量训练资源的情况下显著提升模型性能，并优于如GPT-4o等闭源大模型。

链接: https://arxiv.org/abs/2510.09434
作者: Xixi Wang,Jordanka Kovaceva,Miguel Costa,Shuai Wang,Francisco Camara Pereira,Robert Thomson
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.
zh

[NLP-20] KORMo: Korean Open Reasoning Model for Everyone

【速读】：该论文旨在解决在低资源语言（以韩语为例）中构建完全开源的双语大语言模型（Bilingual Large Language Model, BLM）所面临的训练数据稀缺问题。传统方法依赖于有限的真实语料，难以支撑高质量模型训练，尤其在非英语语境下表现受限。解决方案的关键在于：通过大规模合成数据（synthetic data）驱动从头训练（trained from scratch），并辅以精心设计的多语言指令微调（instruction tuning），从而实现稳定且高性能的预训练与推理能力。研究发现，当合成数据具备平衡的语言覆盖度和多样化的指令风格时，不会引发模型崩溃或性能退化，反而使模型在推理、知识掌握和指令遵循等任务上达到与主流开源多语言基线相当的水平，同时显著提升韩语下的近母语级推理与话语连贯性。

链接: https://arxiv.org/abs/2510.09426
作者: Minjun Kim,Hyeonseok Lim,Hangyeol Yoo,Inho Won,Seungwoo Song,Minkyung Cho,Junhun Yuk,Changsu Choi,Dongjae Shin,Huige Lee,Hoyun Song,Alice Oh,Kyungtae Lim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.
zh

[NLP-21] he Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

【速读】：该论文旨在解决端到端语音对话状态追踪（Spoken Dialog State Tracking）中上下文管理策略的有效性问题，即如何更高效地利用多模态对话历史信息以提升模型性能。其解决方案的关键在于系统性地比较三种上下文处理方式：传统多模态上下文（结合文本历史与当前语音轮次）、完整语音历史输入以及压缩后的语音历史输入；实验表明，使用完整的语音对话历史作为输入可显著提升模型性能，而基于注意力池化（attention-pooling）的压缩方法则在保持较高准确率的同时大幅减少上下文长度，实现性能与效率的平衡。

链接: https://arxiv.org/abs/2510.09424
作者: Nizar El Ghazal,Antoine Caubrière,Valentin Vielzeuf
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
zh

[NLP-22] On the Representations of Entities in Auto-regressive Large Language Models EMNLP2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）内部如何表征和操作命名实体（Named Entities）这一关键问题，特别是针对多标记实体（multi-token entities）的编码机制及其是否蕴含关系知识尚不明确的问题。其解决方案的关键在于提出了一种名为“实体提及重建”（entity mention reconstruction）的新框架，并引入基于任务向量（task vectors）的方法，从LLM隐藏状态中一致地生成多标记实体提及；同时提出了“实体透镜”（Entity Lens），扩展了原有的“logit-lens”方法以预测多标记提及，从而揭示LLMs具备针对未见实体的特定表征与操作机制。

链接: https://arxiv.org/abs/2510.09421
作者: Victor Morand,Josiane Mothe,Benjamin Piwowarski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at BlackBoxNLP@EMNLP2025

点击查看摘要

Abstract:Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging task vectors, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the Entity Lens, extending the logit-lens to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at this https URL .
zh

[NLP-23] Active Model Selection for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因缺乏高效、低成本的模型选择机制而导致的资源浪费问题。传统评估方法依赖于全标注数据集，成本高昂且不适用于大规模模型筛选场景。其解决方案的关键在于提出首个主动模型选择框架——LLM SELECTOR，该框架通过自适应选择最具信息量的小规模查询样本进行标注，并结合基于评判者（judge-based）的注释模型，显著降低人工标注成本。实验表明，在6个基准测试和151个LLM上的验证显示，该方法可将标注成本减少高达59.62%，同时准确识别出最优及次优模型。

链接: https://arxiv.org/abs/2510.09418
作者: Yavuz Durmazkeser,Patrik Okanovic,Andreas Kirsch,Torsten Hoefler,Nezihe Merve Gürel
机构: TU Delft (代尔夫特理工大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.
zh

[NLP-24] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

【速读】：该论文旨在解决当前图提示调优（graph prompt-tuning）方法局限于单一粒度（如节点级或子图级）进行提示生成的问题，从而忽视了图数据中固有的多尺度结构信息，限制了提示语义的多样性。解决方案的关键在于提出一种多尺度图思维链（Multi-Scale Graph Chain-of-Thought, MSGCOT）框架，其核心创新是设计了一个轻量级、低秩的粗化网络（coarsening network），以高效提取多层次结构特征作为提示生成的分层基础向量，并模仿人类从粗到细的认知过程，在每个推理步骤中动态整合多尺度信息，形成渐进式的粗到细提示链，从而显著提升模型在少样本场景下的性能表现。

链接: https://arxiv.org/abs/2510.09394
作者: Ziyu Zheng,Yaming Yang,Ziyu Guan,Wei Zhao,Xinyan Huang,Weigang Lu
机构: Xidian University (西安电子科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:The "pre-train, prompt’’ paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.
zh

[NLP-25] Identifying Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

【速读】：该论文旨在解决自然语言指令中存在歧义时，生成式AI在数据可视化任务中产出代码与用户真实意图不一致的问题。其关键解决方案在于构建一个针对数据可视化场景的歧义类型分类体系（taxonomy of ambiguity），并提出可量化的歧义度量指标（ambiguity metrics），以更准确地捕捉和评估自然语言描述中的模糊性。通过在DS-1000数据集上的实证分析，该方法相较于传统不确定性基线更能匹配人类标注结果；同时，论文进一步探索多轮对话机制如何借助语用学理论（如Gricean Cooperativity、Discourse Representation Theory和Questions under Discussion）减少歧义，从而提升代码生成的准确性，凸显了交互式对话在对齐用户目标中的核心价值。

链接: https://arxiv.org/abs/2510.09390
作者: Mert İnan,Anthony Sicilia,Alex Xie,Saujas Vaduguru,Daniel Fried,Malihe Alikhani
机构: Northeastern University (东北大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker’s intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.
zh

[NLP-26] HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在提升大语言模型（Large Language Models, LLMs）长链式思维（chain-of-thought, CoT）推理能力时，因任务难度超出模型能力而导致的奖励稀疏与训练效率低下问题。现有方法如GRPO在高难度任务中表现不佳，且依赖外部引导（如SFT混合或提示）时常因分布不匹配引发策略更新误导。论文识别出这一现象的根本原因在于“低训练亲和性”（low training affinity），即外部指导信号与模型策略之间的分布差异过大。解决方案的关键在于提出HINT框架——一种自适应提示机制，通过提供启发式提示（heuristic hints）而非直接答案，引导模型自主探索有效解法，从而提升训练亲和性、增强探索效率与训练稳定性，并在数学推理任务上实现更优且稳定的性能表现。

链接: https://arxiv.org/abs/2510.09388
作者: Xinyi Wang,Jinyi Han,Zishang Jiang,Tingyun Li,Jiaqing Liang,Sihang Jiang,Zhaoqian Dai,Shuguang Ma,Fei Yu,Yanghua Xiao
机构: Fudan University (复旦大学); East China Normal University (华东师范大学); Antgroup (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model’s capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model’s policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data this http URL is available on Github.
zh

[NLP-27] oken-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

【速读】：该论文旨在解决当前基于组相对策略优化（Group Relative Policy Optimization, GRPO）及其相关熵正则化方法在链式思维（Chain-of-Thought, CoT）推理中因稀疏的token级奖励而导致的熵崩溃（entropy collapse）或模型崩溃（model collapse）问题。现有方法通常采用统一的token级熵调整策略，难以有效利用群体奖励信号来指导token级别的学习。解决方案的关键在于提出TEPO（Token-level Entropy Regularization with Markov Likelihood），其核心创新是通过马尔可夫似然（Markov Likelihood）机制，将序列级别的奖励信息通过token级聚合映射到每个token上，从而实现更稳定且高效的策略更新，显著提升数学推理性能与训练稳定性。

链接: https://arxiv.org/abs/2510.09369
作者: Xingyu Lin,Yilin Wen,En Wang,Du Su,Wenbin Liu,Chenfu Bao,Zhonghou Lv
机构: Baidu Inc; College of Computer Science and Technology, Jilin University; Key Laboratory of Symbolic Computation and Knowledge Engineering of MOE, Jilin University; State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
zh

[NLP-28] Understanding the Effects of Domain Finetuning on LLM s

【速读】：该论文旨在解决领域特定微调（domain-specific fine-tuning）如何重塑大型语言模型（Large Language Models, LLMs）参数空间的机制不明确问题，尤其针对医疗领域等专业场景下的微调行为缺乏系统研究。其关键解决方案是提出“调优向量”（tuning vectors），这一框架受任务向量（task vectors）启发，能够显式捕捉微调过程中参数方向上的变化，揭示出微调主要在多层感知机（MLP）层中引入新的方向信息，并在注意力头中增强已有方向，从而实现对模型特化过程的可解释性分析与优化。

链接: https://arxiv.org/abs/2510.09359
作者: Eshaan Tanwar,Deepak Nathani,William Yang Wang,Tanmoy Chakraborty
机构: University of Copenhagen (哥本哈根大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Indian Institute of Technology, Delhi (德里印度理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) fine-tuned for specific domains exhibit strong performance; however, the underlying mechanisms by which this fine-tuning reshapes their parametric space are not well understood. Prior works primarily focus on auto-regressive or general-purpose instruct models, leaving domain-specialised LLMs under-explored. We present the first systematic study of domain-specific fine-tuning in large medical language models. Our analysis reveals that fine-tuning modifies only a small subset of the representational subspace, essentially preserving the pre-trained model’s representation. To interpret these changes in subspaces, we propose tuning vectors, a novel framework inspired by task vectors, which explicitly capture the directional parameter shifts induced by fine-tuning. We demonstrate that these vectors are critical for enhancing both instruction-following and generation quality. Furthermore, combining tuning vectors across different domains yields improved generalisation. Upon closer inspection of directional alignment, we find these vectors primarily write new directional information into the MLP layers of the model, while amplifying existing directions in attention heads. Our findings offer new insights into LLM adaptation and provide a general, interpretable framework for analysing specialisation in large language models.
zh

[NLP-29] NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models

【速读】：该论文旨在解决符号主义认知架构SOAR在实际应用中因手动规则编码费时费力而受限的问题，同时填补当前研究多停留在概念框架层面、缺乏实证验证的空白。其核心解决方案是提出NL2GenSym框架，通过将大语言模型（LLM）与SOAR系统深度融合，实现从自然语言到生成式符号规则的自动化转换；该框架的关键创新在于引入“执行驱动的生成-批评”机制：首先由基于检索增强生成（RAG）的LLM生成器根据自演化领域知识库提出候选规则，随后在SOAR环境中立即执行以严格验证规则正确性，再由反思式LLM批评者依据执行反馈迭代优化规则，从而形成闭环改进流程。实验表明，该方法在水壶问题（WJP）数据集上成功率达86%以上，并能生成启发式规则显著提升求解效率。

链接: https://arxiv.org/abs/2510.09355
作者: Fang Yuan,Junjie Zeng,Yue Hu,Zhengqiu Zhu,Quanjun Yin,Yuxiang Xie
机构: National University of Defense Technology (国防科技大学); State Key Laboratory of Digital Intelligent Modeling and Simulation (数字智能建模与仿真国家重点实验室); Test Center (测试中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SOAR, a classic symbol-based cognitive architecture, has been fostering the development of general, human-like intelligent agents. Nevertheless, its practical adoption is hindered by the laborious manual rule coding. Emerging Large Language Models (LLMs) present the immense potential for efficient rules generation. However, there is a critical gap that current research predominantly focuses on conceptual frameworks and lacks robust experimental validation. To bridge this gap, we propose \textitNatural \textitLanguage to \textitGenerative \textitSymbolic Rules (NL2GenSym), a novel framework that integrates LLMs with SOAR to autonomously produce generative symbolic rules from natural language. Specifically, our framework introduces a novel Execution-Grounded Generator-Critic mechanism. The LLM-based Generator, guided by a Retrieval-Augmented Generation-accessed self-evolving domain knowledge base, proposes rules from natural language. Subsequently, these rules are immediately executed within the SOAR environment to rigorously validate their correctness. Based on this execution-grounded feedback, a reflective LLM-based Critic drives the iterative refinement of these rules. Experiments on our specialized Water Jug Problem (WJP) dataset, utilizing both Gemini and Qwen series models, validate the efficacy of our framework. It achieves a success rate over 86% in generating rules from natural language. Crucially, the framework also generates novel heuristic rules, reducing average decision cycles for solving the WJP to 1.98 times the optimal solution and 1/1000 of baseline methods. Additionally, our initial experiments show that NL2GenSym enables smaller-parameter models to achieve better performance than larger counterparts.
zh

[NLP-30] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

【速读】：该论文旨在解决大模型在推理任务中缺乏长链思维（chain-of-thought reasoning）能力的问题，尤其是如何在不进行额外训练的情况下激发模型的回溯（backtracking）与自我修正（self-correction）等复杂推理行为。其解决方案的关键在于提出一种解码时（decoding-time）的方法——ThinkLogit，该方法通过逻辑值（logit）运算，利用一个较小的推理模型作为引导器（guider），动态调整目标大模型的输出分布，从而实现无需训练即可增强长链推理能力；进一步地，通过偏好优化（preference optimization）对引导器进行训练，使用来自目标模型和引导器自身的正确/错误推理样本对，形成ThinkLogit-DPO，显著提升性能，且该方法具有跨模型家族的通用性及与小模型后训练方法的正交性，为无需昂贵后训练即可解锁大规模模型的长推理能力提供了实用路径。

链接: https://arxiv.org/abs/2510.09354
作者: Yunxiang Zhang,Muhammad Khalifa,Lechen Zhang,Xin Liu,Ayoung Lee,Xinliang Frederick Zhang,Farima Fatahi Bayat,Lu Wang
机构: University of Michigan, Ann Arbor (密歇根大学，安娜堡分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.
zh

[NLP-31] ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

【速读】：该论文旨在解决当前对小型语言模型（Small Language Models, SLMs）在常识推理任务中性能评估过于依赖最终答案准确性，而忽视推理过程有效性的局限性问题。其解决方案的关键在于提出ReTraceQA这一新型基准，通过专家标注的推理过程数据实现对模型推理路径的细粒度评估，从而揭示出部分SLMs虽能给出正确答案但推理逻辑存在缺陷的现象（占比14-24%），并证明使用强大型语言模型（Large Language Models, LLMs）作为自动化评判者进行基于推理过程的评估时，SLMs的性能显著下降（最高达25%），表明传统评价指标可能高估了SLMs的实际能力。

链接: https://arxiv.org/abs/2510.09351
作者: Francesco Maria Molfese,Luca Moroni,Ciro Porcaro,Simone Conia,Roberto Navigli
机构: Sapienza University of Rome (罗马大学)
类目: Computation and Language (cs.CL)
备注: Submitted to ARR October 2025

点击查看摘要

Abstract:While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.
zh

[NLP-32] LLP: LLM -based Product Pricing in E-commerce

【速读】：该论文旨在解决二手商品定价中因市场动态变化（如二手iPhone价格随时间下降）导致的静态回归模型泛化能力差的问题。解决方案的关键在于提出首个基于大语言模型（Large Language Models, LLMs）的生成式定价框架LLP，其核心包括：通过检索相似商品增强对市场动态的理解；利用LLMs对自由文本中关键定价信息的细粒度理解生成精准价格建议；采用双向推理构建数据集，并通过监督微调（Supervised Fine-Tuning, SFT）与组相对策略优化（Group Relative Policy Optimization, GRPO）两阶段优化提升领域推理能力；最后引入置信度过滤机制以剔除不可靠预测。实验证明，该方法在未见品类上仍具强泛化性，并已在闲鱼平台部署，显著优于原有方法。

链接: https://arxiv.org/abs/2510.09347
作者: Hairu Wang,Sheng You,Qiheng Zhang,Xike Xie,Shuguang Han,Yuchen Wu,Fei Huang,Jufeng Chen
机构: University of Science and Technology of China (中国科学技术大学); Xianyu of Alibaba (阿里巴巴闲鱼)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs’ nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs’ domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote\Xianyu is China’s largest second-hand e-commerce platform., significantly outperforming the previous pricing method. Under the same 30% product coverage, it raises the static adoption rate (SAR) from 40% to 72%, and maintains a strong SAR of 47% even at 90% recall.
zh

[NLP-33] FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference EMNLP2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLM）在资源受限硬件上部署时面临的挑战，即模型参数量巨大导致的内存占用高和计算需求大问题。现有低秩压缩方法通常对所有层采用统一压缩比例，易造成性能显著下降，且在解码阶段表现不佳。其解决方案的关键在于提出细粒度低秩压缩器（Fine-grained Low-Rank Compressor, FLRC），通过为每一层动态分配最优秩（rank），实现更高效的压缩；同时引入渐进式低秩解码机制，在保持文本生成质量的同时提升推理效率。实验表明，FLRC在多种基准测试中显著优于当前最优低秩压缩方法，尤其在摘要任务中ROUGE-L指标提升达17%。

链接: https://arxiv.org/abs/2510.09332
作者: Yu-Chen Lu,Chong-Yan Chen,Chi-Chih Chang,Yu-Fang Hu,Kai-Chiang Wu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Macronix International Co., Ltd. (美光科技公司); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
zh

[NLP-34] Large Language Model Prompt Datasets: An In-depth Analysis and Insights

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）在实际应用中因提示（prompt）设计质量参差不齐而导致的输出效果不稳定问题。当前虽有大量来自GitHub和社交媒体等平台的提示数据集涌现，但缺乏系统性的整理与分析，且现有提示构造方式尚未充分挖掘语法结构信息以提升模型响应的相关性和语义合理性。解决方案的关键在于：首先构建一个涵盖多种下游任务、语言、工程技巧、属性及模态的综合性提示数据集集合，并通过系统性分析揭示不同类别提示的共性与差异；其次提出一种基于词性标注（Part-of-Speech, POS）和依存句法结构的语法嵌入优化方法，利用提示的中心表示（centroid representation）引导LLM重写提示，从而增强模型输出的意义性与一致性。

链接: https://arxiv.org/abs/2510.09316
作者: Yuanming Zhang,Yan Lin,Arijit Khan,Huaiyu Wan
机构: Beijing Jiaotong University (北京交通大学); Aalborg University (奥尔堡大学); Bowling Green State University (布林格林州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we–for the first time–have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.
zh

[NLP-35] Verifying Chain-of-Thought Reasoning via Its Computational Graph

【速读】：该论文旨在解决当前链式思维（Chain-of-Thought, CoT）验证方法在推理错误分析中缺乏可解释性的问题，即现有黑盒或灰盒方法仅能预测推理正确性，但无法提供计算失败的根本原因。其解决方案的关键在于提出一种白盒方法——基于电路的推理验证（Circuit-based Reasoning Verification, CRV），通过分析模型内部推理步骤的归因图（attribution graphs）结构特征，识别出正确与错误推理步骤在计算图结构上的显著差异。该方法不仅揭示了错误的结构指纹具有高度预测性，还发现这些指纹具有领域特异性，并通过针对性干预特定编解码器（transcoder）特征成功纠正模型错误推理，从而实现了从误差检测到因果理解的跃迁。

链接: https://arxiv.org/abs/2510.09312
作者: Zheng Zhao,Yeskendir Koishekenov,Xianjun Yang,Naila Murray,Nicola Cancedda
机构: University of Edinburgh (爱丁堡大学); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model’s latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model’s faulty reasoning. Our work shows that, by scrutinizing a model’s computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
zh

[NLP-36] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference

【速读】：该论文旨在解决扩散式大语言模型（Diffusion Large Language Models, dLLMs）在长文本处理中因双向注意力机制导致的缓存（KV cache）占用内存过大、资源受限环境下难以高效运行的问题。现有针对自回归模型（Autoregressive Models, ARMs）设计的缓存淘汰策略不适用于dLLMs，因其忽略了mask token在生成过程中的独特作用。解决方案的关键在于提出MaskKV——一个无需训练的缓存淘汰框架，其核心创新包括：(1) 基于mask-query的评分机制，利用注意力权重识别并剔除各注意力头中较不重要的提示token；(2) 自适应缓存预算分配策略，通过减少中间层缓存分配、集中资源到偏好提示的注意力头，显著提升缓存使用效率。实验表明，在仅保留256对缓存（<5% token数）的情况下，MaskKV仍可保持94%的完整缓存性能，并在32k提示长度下实现最高达31倍加速。

链接: https://arxiv.org/abs/2510.09309
作者: Jianuo Huang,Yaojie Zhang,Yicun Yang,Benhao Huang,Biqing Qi,Dongrui Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huazhong University of Science and Technology (华中科技大学); University of Electronic Science and Technology of China (电子科技大学); Carnegie Mellon University (卡内基梅隆大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: this https URL
zh

[NLP-37] CapGeo: A Caption-Assisted Approach to Geometric Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在几何推理任务中表现不佳的问题，尤其关注视觉理解与文本推理之间的鸿沟。研究表明，尽管当前最先进的闭源模型如GPT-O3和Gemini-2.5-Pro在国际数学奥林匹克（IMO）等文本类推理任务中表现出色，但在处理几何问题时仍存在显著瓶颈，其根源在于对几何图形的准确理解而非推理能力本身。解决方案的关键在于引入CapGeo框架——一个基于图像描述（caption）辅助的几何推理方法，通过将视觉内容转化为高质量的文本描述来桥接视觉与语言模态，从而显著提升模型性能；实验表明，Qwen2.5-VL-72B和Claude-Opus-4在使用Caption后准确率分别从8.6%和44.8%提升至59.0%和73.0%，验证了该策略的有效性。

链接: https://arxiv.org/abs/2510.09302
作者: Yuying Li,Siyi Qian,Hao Liang,Leqi Zheng,Ruichuan An,Yongzhen Guo,Wentao Zhang
机构: THU(清华大学); PKU(北京大学); Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint, under review

点击查看摘要

Abstract:Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
zh

[NLP-38] ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation

【速读】：该论文旨在解决刑事判决书中的“法院观点”（Court View）自动生成问题，该任务在法律人工智能中具有基础性意义。由于案件事实的多样性和复杂性，直接从原始事实生成法院观点往往效果受限。解决方案的关键在于构建了一个包含超过11万份中文案例的专用数据集CCVG（Chinese Court View Generation dataset），并基于此训练了首个专用于法院观点生成的大语言模型ShiZhi。实验表明，即使模型规模较小，只要在高质量领域数据上训练，也能生成合理且符合法律逻辑的法院观点，从而显著提升生成质量和法律一致性。

链接: https://arxiv.org/abs/2510.09297
作者: Zhitian Hou,Kun Zeng
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the “Court View” section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 58.5 BLEU-1 on court view generation and 86.1% accuracy with 92.5% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \hrefthis https URLthis https URL.
zh

[NLP-39] MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）预训练过程中评估不稳定的问题，这种不稳定性掩盖了真实的模型学习动态。作者系统诊断出两种噪声来源：参数不稳定性（Parameter Instability），源于训练的随机性；以及评估不稳定性（Evaluation Instability），源于测量协议的噪声。为应对这两种噪声，论文提出了一种双管齐下的框架 MaP（Merging and Pass@k），其关键在于协同整合检查点合并（Checkpoint Merging）与 Pass@k 指标：通过平均近期模型权重来平滑参数空间，同时利用 Pass@k 提供低方差、鲁棒的模型能力统计估计，从而显著改善性能曲线的平滑性、降低运行间方差并提升模型排名的一致性，为 LLM 训练动态提供更可靠、忠实的观测视角。

链接: https://arxiv.org/abs/2510.09295
作者: Jiapeng Wang,Changxin Tian,Kunlong Chen,Ziqi Liu,Jiaxin Mao,Wayne Xin Zhao,Zhiqiang Zhang,Jun Zhou
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textitParameter Instability from training stochasticity and \textitEvaluation Instability from noisy measurement protocols. To counteract both sources of noise, we introduce \textbfMaP, a dual-pronged framework that synergistically integrates checkpoint \underlineMerging \underlineand the \underlinePass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.
zh

[NLP-40] One Sentence Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

【速读】：该论文旨在解决传统句子嵌入方法仅能为每个句子分配单一向量表示，从而难以捕捉句子中隐含语义（implicit semantics）的问题。其解决方案的关键在于提出DualCSE方法，通过为每个句子生成两个嵌入向量——一个表征显式语义（explicit semantics），另一个表征隐含语义——并在共享嵌入空间中共存，从而支持根据下游任务需求灵活选择所需语义表示，显著提升了信息检索和文本分类等任务的性能。

链接: https://arxiv.org/abs/2510.09293
作者: Kohei Oda,Po-Min Chuang,Kiyoaki Shirai,Natthawut Kertkeidkachorn
机构: Japan Advanced Institute of Science and Technology (日本高级科学与技术研究院); Toshiba Corporation (东芝公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.
zh

[NLP-41] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

【速读】：该论文旨在解决在数据稀缺领域训练专家大语言模型（Large Language Models, LLMs）时，仅依赖多选题（Multiple-Choice Questions, MCQs）进行标准基于结果的强化学习（Reinforcement Learning, RL）会引发推理质量下降的问题，例如逻辑一致性受损。现有提升推理质量的方法如大规模过程奖励模型（Process Reward Models, PRMs）成本过高。解决方案的关键在于提出一种名为CLARity的低成本强化学习框架：其核心包括一个感知一致性的奖励机制、两阶段“精炼-监控”训练流程以增强推理一致性，以及一种动态数据重构策略以更高效利用有限数据。实验表明，CLARity在保持准确率提升的同时，将响应一致性提高16.5%，并实现7.5%的准确率增益，且人类评估验证了整体连贯性和专业性的改善。

链接: https://arxiv.org/abs/2510.09278
作者: Jiuheng Lin,Cong Jiang,Zirui Wu,Jiarui Sun,Yansong Feng
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning this http URL code is open sourced at: this https URL
zh

[NLP-42] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在医学诊断领域评估中存在的严重偏差问题，即现有静态基准测试多基于公开考试题目的固定样本，导致模型性能被高估，且无法反映真实临床实践中病例的模糊性与多样性。其解决方案的关键在于提出DyReMe——一个动态基准测试框架，通过生成具有咨询风格的新颖病例，引入鉴别诊断和常见误诊因素等干扰项，并模拟不同表达方式以贴近真实的临床问诊习惯；同时，除准确率外，还从真实性（veracity）、有用性（helpfulness）和一致性（consistency）三个临床相关维度综合评估模型表现，从而实现更贴近实际医疗场景的、更具挑战性的评估体系。

链接: https://arxiv.org/abs/2510.09275
作者: Xiangxu Zhang,Lei Li,Yanyun Zhou,Xiao Zhou,Yingying Zhang,Xian Wu
机构: GSAI, Renmin University of China (中国人民大学高瓴人工智能学院); Tencent Jarvis Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.
zh

[NLP-43] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

【速读】：该论文旨在解决当前多模态检索增强生成（Multimodal Retrieval-Augmented Generation, MRAG）基准测试在模态覆盖范围和格式多样性上的局限性，尤其是现有数据集多集中于单一或有限模态任务及粗粒度场景理解，难以全面评估多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂视频内容中进行细粒度跨模态信息检索与推理的能力。为此，作者构建了CFVBench——一个大规模、人工验证的视频基准，涵盖图表密集型报告、新闻播报和软件教程等高密度格式与多样化领域，要求模型在长时程视频跨度中保持细粒度的多模态感知能力。实验揭示当前主流MLLMs（包括GPT-5和Gemini）在捕捉瞬时但关键的细粒度多模态细节方面存在显著瓶颈。解决方案的关键在于提出自适应视觉精化（Adaptive Visual Refinement, AVR）框架：该框架通过动态增加帧采样密度并在必要时调用外部工具，实现对细粒度视觉信息的精准捕获与利用，从而系统性提升所有评测MLLMs在复杂多模态理解任务中的性能表现。

链接: https://arxiv.org/abs/2510.09266
作者: Kaiwen Wei,Xiao Liu,Jie Zhang,Zijian Wang,Ruida Liu,Yuming Yang,Xin Xiao,Xiao Sun,Haoyang Zeng,Changzai Pan,Yidan Zhang,Jiang Zhong,Peijin Wang,Yingchao Feng
机构: Chongqing University (重庆大学); University of the Chinese Academy of Sciences (中国科学院大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs
zh

[NLP-44] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）后训练阶段中数据污染（data contamination）的检测难题，这一问题在当前大语言模型（Large Language Models, LLMs）推理能力提升过程中尤为关键。现有方法主要针对预训练和监督微调阶段的数据污染检测，但对RL阶段缺乏有效手段，导致模型性能评估可能失真。解决方案的核心在于提出名为Self-Critique的方法，其关键创新是基于一个观察：经过RL训练后，模型输出的熵分布趋于坍缩为高度特定且稀疏的模式，这反映了策略坍缩（policy collapse）现象——即模型收敛至狭窄的推理路径。Self-Critique通过探测这种熵减少背后的策略坍缩机制，实现了对RL阶段数据污染的有效识别，显著优于基线方法，在多个模型和任务上AUC提升达30%。

链接: https://arxiv.org/abs/2510.09259
作者: Yongding Tao,Tian Wang,Yihong Dong,Huanyu Liu,Kechi Zhang,Xiaolong Hu,Ge Li
机构: Peking University (北京大学); New H3C Technologies Co., Ltd
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model’s convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
zh

[NLP-45] DSPO: Stable and Efficient Policy Optimization for Agent ic Search and Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂现实任务中缺乏主动获取外部知识能力的问题，现有方法要么依赖提示工程激发模型的内在代理能力，要么在应用于复杂交互任务时因强化学习（Reinforcement Learning, RL）性能瓶颈或崩溃而难以发挥其真正的代理潜力。解决方案的关键在于提出一种名为动态滤波序列级策略优化（Dynamic-filter Sequence-level Policy Optimization, DSPO）的改进型强化学习算法，该算法通过序列级优化与动态样本过滤机制实现鲁棒的智能体训练，使模型仅通过强化学习即可完成多轮搜索与推理的交替执行，无需监督示范数据；实验表明，基于DSPO训练的7B模型在多个问答基准上相比先前工作提升34.1%，并在HotpotQA等复杂多跳问答任务中相对优于此前的14B模型近9%，同时保持优异的训练稳定性。

链接: https://arxiv.org/abs/2510.09255
作者: Chenyang Gu,Yewen Pu,Bruce Yang,Xiaofan Li,Huan Gao
机构: Sapiens AI; Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbfDynamic-filter \textbfSequence-level \textbfPolicy \textbfOptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf34.1%, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf9% relative, maintaining exceptional training stability.
zh

[NLP-46] CrisiText: A dataset of warning messages for LLM training in emergency communication

【速读】：该论文旨在解决危机情境下警告信息生成的自动化问题，即如何利用自然语言生成（Natural Language Generation, NLG）技术高效、准确地生成针对不同类型的危机场景（如自然灾害或暴力袭击）的预警消息。其解决方案的关键在于构建了首个大规模数据集 CrisiText，该数据集包含超过 40 万条警告消息（覆盖近 18,000 个危机事件），每条消息均基于专家制定的指南生成，确保术语准确性和事实一致性，并附带三种次优警告类型以支持多种 NLG 方法的研究。此外，作者通过对比监督微调、偏好对齐、零样本和少样本学习等策略，系统评估了模型在分布外场景下的表现及自动后编辑器的有效性，从而推动了生成式 AI 在应急响应中的实用化发展。

链接: https://arxiv.org/abs/2510.09243
作者: Giacomo Gonella,Gian Maria Campedelli,Stefano Menini,Marco Guerini
机构: Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts’ written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.
zh

[NLP-47] Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras

【速读】：该论文旨在解决肩部疾病（如冻结肩，即粘连性囊炎）在医疗资源匮乏地区难以实现早期、准确诊断的问题，尤其针对老年人群和重复性肩部作业人群的高发病率需求。其解决方案的关键在于提出了一种基于消费级设备拍摄视频的混合运动视频诊断框架（Hybrid Motion Video Diagnosis framework, HMVDx），创新性地利用多模态大语言模型（Multimodal Large Language Models, MLLMs）将动作理解与疾病诊断任务分离，并由两个独立的MLLM分别完成，从而提升诊断准确性。此外，论文引入了基于医学决策逻辑流程（动作识别→运动诊断→最终诊断）的可用性指数（Usability Index），从全流程视角评估MLLM在医疗场景中的有效性，显著提升了诊断准确率（较直接视频诊断提高79.6%），为低成本、可扩展的辅助诊断工具提供了技术路径。

链接: https://arxiv.org/abs/2510.09230
作者: Jindong Hong,Wencheng Zhang,Shiqin Qiao,Jianhai Chen,Jianing Qiu,Chuanyang Zheng,Qian Xu,Yun Ji,Qianyue Wen,Weiwei Sun,Hao Li,Huizhen Li,Huichao Wang,Kai Wu,Meng Li,Yijun He,Lingjie Luo,Jiankai Sun
机构: Bytedance(字节跳动); Peking University(北京大学); Peking University People’s Hospital(北京大学人民医院); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.
zh

[NLP-48] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data ACL2025

【速读】：该论文旨在解决传统因果发现方法在科学研究所面临的三大挑战：昂贵的数据收集成本、对已知因果关系的冗余计算以及不切实际的假设。同时，现有基于大语言模型（Large Language Models, LLMs）的方法虽能识别常见因果关系，却难以发现新的因果关联。其解决方案的关键在于提出IRIS（Iterative Retrieval and Integrated System for Real-Time Causal Discovery）框架，该框架通过迭代检索与集成系统实现从初始变量出发的实时因果发现；其核心创新在于融合统计算法与LLM方法以同时挖掘已知和未知因果关系，并引入缺失变量提议机制自动识别并整合潜在缺失变量，从而扩展因果图结构，无需依赖预设数据集即可完成动态因果建模。

链接: https://arxiv.org/abs/2510.09217
作者: Tao Feng,Lizhen Qu,Niket Tandon,Gholamreza Haffari
机构: Monash University (莫纳什大学); Microsoft Research, India (微软研究院，印度)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025

点击查看摘要

Abstract:Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
zh

[NLP-49] DICE: Structured Reasoning in LLM s through SLM-Guided Chain-of-Thought Correction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在执行用户特定要求的推理任务时，常因优先考虑推理过程而忽视严格输出格式等问题。传统方法通过监督微调LLMs来改进输出规范性，但面临计算成本高和参数访问受限的挑战。解决方案的关键在于提出DICE框架，该框架通过解耦LLM的自然语言生成与结构化输出修正两个阶段实现高效优化：首先利用LLMs生成自然语言响应，随后由轻量级小语言模型（Small Language Models, SLMs）基于链式思维（Chain-of-Thought, CoT）进行分析并修正输出以满足结构化规范。DICE采用两阶段数据构建与双调优策略，使SLMs能够以“分析-回答”模式生成符合用户格式要求的内容，从而在不损害LLM知识能力的前提下显著提升输出的格式准确率（+35.4%）和内容正确率（+29.4%）。

链接: https://arxiv.org/abs/2510.09211
作者: Yiqi Li,Yusheng Liao,Zhe Chen,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.
zh

[NLP-50] Multimodal Prompt Optimization: Why Not Leverag e Multiple Modalities for MLLM s

【速读】：该论文试图解决的问题是：当前提示优化（prompt optimization）方法仍局限于纯文本范畴，无法充分发挥多模态大语言模型（Multimodal Large Language Models, MLLMs）在图像、视频乃至分子等非文本模态上的潜力。为填补这一空白，作者提出“多模态提示优化”（multimodal prompt optimization）这一新问题，即在由文本与非文本提示组成的联合空间中进行提示优化。解决方案的关键在于提出一种统一框架——多模态提示优化器（Multimodal Prompt Optimizer, MPO），其核心机制包括两点：一是通过保持对齐关系的更新策略实现多模态提示的联合优化；二是基于贝叶斯选择策略，利用前期评估结果作为先验信息引导候选提示的选择过程。实验表明，MPO在多种非文本模态任务中显著优于现有纯文本优化方法，验证了多模态提示优化对于释放MLLMs潜能的重要性。

链接: https://arxiv.org/abs/2510.09201
作者: Yumin Choi,Dongki Kim,Jinheon Baek,Sung Ju Hwang
机构: KAIST; DeepAuto.ai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
zh

[NLP-51] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

【速读】：该论文旨在解决通用大语言模型（Large Language Models, LLMs）在翻译任务中表现优异，但在推理任务上能力不足的问题。其核心挑战在于如何在不牺牲原有推理能力的前提下，有效提升模型的多语言翻译性能，尤其是低资源语言的翻译质量。解决方案的关键在于提出一种新颖的“翻译增强”方法：首先以指令微调（instruct tuning）模型为基础，仅对平行语料（parallel data）进行层选择性微调（layer-selective tuning），从而实现对翻译能力的定向增强，同时保留原始模型在推理任务上的性能。实验表明，该方法显著提升了高、低资源语言的翻译效果（如Swahili达到15+ spBLEU和40+ xComet），并在7个多语言任务上平均提升1+点，同时保持与Qwen3 instruct模型相当的推理能力。

链接: https://arxiv.org/abs/2510.09189
作者: Changjiang Gao,Zixian Huang,Jingyang Gong,Shujian Huang,Lei Li,Fei Yuan
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); The University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.
zh

[NLP-52] Stronger Re-identification Attacks through Reasoning and Aggregation

【速读】：该论文旨在解决文本去标识化（Text De-identification）技术在实际应用中难以量化其隐私保护效果的问题，尤其是如何有效评估去标识化方法对个人身份信息（PII）的隐藏能力。解决方案的关键在于构建更强大的重识别（Re-identification）攻击策略：一方面发现重识别过程中PII片段的识别顺序具有重要影响，通过聚合多种排序下的预测结果可显著提升攻击性能；另一方面引入推理模型（reasoning models），尤其当攻击者具备丰富背景知识时，能进一步增强重识别准确率，从而更真实地模拟对抗场景并衡量去标识化方法的鲁棒性。

链接: https://arxiv.org/abs/2510.09184
作者: Lucas Georges Gabriel Charpentier,Pierre Lison
机构: University of Oslo (奥斯陆大学); Norwegian Computing Center (挪威计算中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of re-identification, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the order in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.
zh

[NLP-53] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

【速读】：该论文旨在解决如何通过增强对话数据来提升大语言模型（Large Language Model, LLM）对个体人格特质的建模能力，尤其是基于五大性格特质（Big Five）框架中的宜人性（Agreeableness）和神经质（Neuroticism）维度。其解决方案的关键在于引入“思考 aloud 话语”（Think-Aloud Utterances, TAUs），即说话者在表达前的心理活动语句，作为训练数据的一部分，从而帮助LLM更准确地模仿说话者的个性特征。实验表明，使用TAU增强的数据训练出的“人格化LLM”（persona LLM）在宜人性和神经质维度上比仅使用原始对话数据训练的模型更贴近人类被试的人格测评结果。

链接: https://arxiv.org/abs/2510.09158
作者: Seiya Ishikura,Hiroaki Yamada,Tatsuya Hiraoka,Hiroaki Yamada,Takenobu Tokunaga
机构: Institute of Science Tokyo (东京科学大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Nara Institute of Science and Technology (奈良科学技术大学院大学); Fujitsu Limited (富士通有限公司)
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker’s thought before articulating the utterance. We expect “persona LLMs” trained with TAU-augmented data can mimic the speaker’s personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers’ Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM’s performance.
zh

[NLP-54] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在网文（web novel）翻译任务中缺乏系统性评估标准的问题，现有基准多依赖表面级指标，无法有效衡量网文特有的叙事连贯性与文化适配性。解决方案的关键在于提出DITING框架，首次构建涵盖六维核心维度（成语翻译、词义歧义处理、术语本地化、时态一致性、零代词指代消解和文化安全）的综合性评价体系，并配套18K对专家标注的中英句子对；同时引入AgentEval多智能体推理评估机制，模拟专家讨论过程以超越词汇重叠度的判断，实现与人工评分高度一致的自动评估；此外还开发MetricAlign元评估数据集用于统一比较不同指标性能，从而建立可复现、可对比的网文翻译质量评估新范式。

链接: https://arxiv.org/abs/2510.09116
作者: Enze Zhang,Jiaying Wang,Mengxi Xiao,Jifei Liu,Ziyan Kuang,Rui Dong,Youzhong Dong,Sophia Ananiadou,Min Peng,Qianqian Xie
机构: Wuhan University (武汉大学); Jiangxi Normal University (江西师范大学); The University of Manchester (曼彻斯特大学); Yunnan Trrans Technology Co., Ltd (云南特润科技有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
zh

[NLP-55] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）因训练数据静态化而导致的在处理快速演变信息或领域特定查询时表现受限的问题。其解决方案的关键在于检索增强生成（Retrieval-Augmented Generation, RAG），通过将LLMs与外部检索机制集成，使模型能够访问最新且上下文相关的知识，从而弥补纯LLM在时效性和专业性上的不足。

链接: https://arxiv.org/abs/2510.09106
作者: Yongjie Wang,Yue Yu,Kaisong Song,Jun Lin,Zhiqi Shen
机构: Alibaba-NTU Global e-Sustainability CorpLab, Nanyang Technological University, Singapore; Tongyi Lab, Alibaba Group, Hang zhou, China; College of Computing & Data Science, Nanyang Technological University, Singapore
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have enabled a wide range of applications through their powerful capabilities in language understanding and generation. However, as LLMs are trained on static corpora, they face difficulties in addressing rapidly evolving information or domain-specific queries. Retrieval-Augmented Generation (RAG) was developed to overcome this limitation by integrating LLMs with external retrieval mechanisms, allowing them to access up-to-date and contextually relevant knowledge. However, as LLMs themselves continue to advance in scale and capability, the relative advantages of traditional RAG frameworks have become less pronounced and necessary. Here, we present a comprehensive review of RAG, beginning with its overarching objectives and core components. We then analyze the key challenges within RAG, highlighting critical weakness that may limit its effectiveness. Finally, we showcase applications where LLMs alone perform inadequately, but where RAG, when combined with LLMs, can substantially enhance their effectiveness. We hope this work will encourage researchers to reconsider the role of RAG and inspire the development of next-generation RAG systems.
zh

[NLP-56] FrameEOL: Semantic Frame Induction using Causal Language Models EMNLP

【速读】：该论文旨在解决语义框架诱导（semantic frame induction）任务中传统方法依赖大规模标注数据和预训练语言模型（如掩码语言模型 MLMs）的局限性，尤其是在资源匮乏语言（如日语）中的性能瓶颈问题。其核心解决方案是提出一种基于因果语言模型（causal language model, CLM）的新方法 FrameEOL，通过提示学习（prompt-based learning）引导 CLM 输出单一框架名称作为标签，从而生成适用于框架诱导的嵌入表示；关键创新在于结合上下文学习（in-context learning, ICL）与深度度量学习（deep metric learning, DML），在仅用少量示例（如5个）的情况下即可获得高质量的框架嵌入，显著提升了跨语言场景下的框架诱导效果，尤其在日语 FrameNet 数据集上实现了与 MLM 方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2510.09097
作者: Chihiro Yano,Kosuke Yamada,Hayato Tsukagoshi,Ryohei Sasano,Koichi Takeda
机构: Nagoya University (名古屋大学); CyberAgent; National Institute of Informatics (日本信息研究所)
类目: Computation and Language (cs.CL)
备注: Accepted in EMNLP Findings 2025. This version corrects the model size of Table 3

点击查看摘要

Abstract:Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.
zh

[NLP-57] Exploiting Web Search Tools of AI Agents for Data Exfiltration

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在接入外部数据源（如企业敏感信息）时，因间接提示注入（indirect prompt injection）攻击所暴露的安全漏洞问题。此类攻击通过操纵输入内容诱导模型执行非预期行为，即使在防御机制存在的情况下仍具高度有效性。解决方案的关键在于：强化训练流程以提升模型内在鲁棒性、建立集中化的已知攻击向量数据库以实现主动防御，并构建统一的测试框架以持续验证安全性。这三项措施共同推动将安全设计内嵌于LLM开发的核心流程中，从而有效应对长期存在的安全威胁。

链接: https://arxiv.org/abs/2510.09093
作者: Dennis Rall,Bernhard Bauer,Mohit Mittal,Thomas Fraunholz
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, conference article

点击查看摘要

Abstract:Large language models (LLMs) are now routinely used to autonomously execute complex tasks, from natural language processing to dynamic workflows like web searches. The usage of tool-calling and Retrieval Augmented Generation (RAG) allows LLMs to process and retrieve sensitive corporate data, amplifying both their functionality and vulnerability to abuse. As LLMs increasingly interact with external data sources, indirect prompt injection emerges as a critical and evolving attack vector, enabling adversaries to exploit models through manipulated inputs. Through a systematic evaluation of indirect prompt injection attacks across diverse models, we analyze how susceptible current LLMs are to such attacks, which parameters, including model size and manufacturer, specific implementations, shape their vulnerability, and which attack methods remain most effective. Our results reveal that even well-known attack patterns continue to succeed, exposing persistent weaknesses in model defenses. To address these vulnerabilities, we emphasize the need for strengthened training procedures to enhance inherent resilience, a centralized database of known attack vectors to enable proactive defense, and a unified testing framework to ensure continuous security validation. These steps are essential to push developers toward integrating security into the core design of LLMs, as our findings show that current models still fail to mitigate long-standing threats.
zh

[NLP-58] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability Faithfulness and Interpretability

【速读】：该论文旨在解决当前长链式思维（long chain-of-thought, CoT）推理模型过度关注答案准确性和令牌效率，而忽视可信赖性（trustworthiness）的问题。作者指出，可用的推理系统必须具备三个关键属性：可解释性（interpretability）、忠实性（faithfulness）和可靠性（reliability）。为此，论文提出 ReFIne 训练框架，其核心在于融合监督微调（supervised fine-tuning）与 GRPO（Generalized Reward Policy Optimization），通过三项机制提升模型可信度：(i) 生成结构化、标签化的推理轨迹以增强可解释性；(ii) 显式披露每一步决策依据并保持跨段落一致性以提高忠实性；(iii) 提供对推导过程合理性及最终答案置信度的自评估以强化可靠性。实验表明，ReFIne 在 Qwen3 系列模型上显著提升了各项可信度指标，验证了从多维角度优化推理模型的重要性。

链接: https://arxiv.org/abs/2510.09062
作者: Chung-En Sun,Ge Yan,Akshay Kulkarni,Tsui-Wei Weng
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation’s soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: this https URL
zh

[NLP-59] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation EMNLP2025

【速读】：该论文旨在解决低资源语言（如乌尔都语）在构建高性能大语言模型（Large Language Models, LLMs）时面临的三大挑战：高质量数据稀缺、多语言一致性不足以及安全合规问题。现有方法依赖大规模数据翻译，但存在质量低、文化适配差且成本高昂的问题。其解决方案的关键在于提出一种改进的自指导（self-instruct）技术，用于生成高质量、多语言合成数据集（Urdu-Instruct），该数据集包含乌尔都语原生的思维链推理（chain-of-thought reasoning）、双语翻译、文化相关性及伦理安全对齐等特性；通过使用针对每项任务的独特提示和种子值，并结合全局任务池，显著提升了模型对乌尔都语特定任务的理解能力。基于此方法训练的Alif-1.0-8B-Instruct模型在性能上优于多个主流多语言LLM，在训练预算低于100的情况下实现了高效、文化契合的低资源语言建模。

链接: https://arxiv.org/abs/2510.09051
作者: Muhammad Ali Shafique,Kanwal Mehreen,Muhammad Arham,Maaz Amjad,Sabur Butt,Hamza Farooq
机构: Traversaal.ai; University of British Columbia (不列颠哥伦比亚大学); Texas Tech University (德克萨斯理工大学); Institute for the Future of Education, Tecnológico de Monterrey (未来教育研究所，蒙特雷科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the EMNLP 2025 Workshop on Multilingual Representation Learning (MRL)

点击查看摘要

Abstract:Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under 100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: this https URL.
zh

[NLP-60] Auto-scaling Continuous Memory for GUI Agent

【速读】：该论文旨在解决GUI代理在面对陌生界面和长时程任务时缺乏可扩展记忆机制的问题，从而影响其泛化能力。现有方法通过将历史轨迹压缩为文本标记来构建记忆，导致上下文长度激增并丢失关键视觉线索（如控件的具体尺寸与位置）。解决方案的关键在于提出一种连续记忆机制：利用视觉语言模型（VLM）自身作为编码器，将每条GUI轨迹映射为固定长度的连续嵌入向量，并直接注入骨干网络输入层，显著降低上下文开销的同时保留细粒度视觉信息。该设计使得随着记忆规模和检索深度增加，性能单调提升，克服了文本记忆因提示过长而退化的局限性。

链接: https://arxiv.org/abs/2510.09038
作者: Wenyi Wu,Kun Zhou,Ruoxin Yuan,Vivian Yu,Stephen Wang,Zhiting Hu,Biwei Huang
机构: University of California, San Diego (加州大学圣地亚哥分校); Fudan University (复旦大学); Abel AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone’s input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about \ 4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).
zh

[NLP-61] Large Language Models Do NOT Really Know What They Dont Know

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在生成事实性内容时，其内部表征是否能够可靠地区分真实回答与幻觉（hallucination）输出。尽管先前研究表明LLMs的隐藏状态、注意力权重或词概率可能编码了事实性信号，暗示模型“知道它不知道什么”，但实际中模型仍会因捷径学习或虚假关联产生错误，而这些错误由相同的训练目标驱动，使得区分真假输出变得困难。论文的关键解决方案在于通过机制分析（mechanistic analysis），对比两类不同来源的幻觉——一类依赖于主题知识（subject knowledge）、另一类脱离主题知识——发现：当幻觉与主题知识相关时，模型采用与正确回答相同的内部回忆过程，导致隐藏状态几何结构重叠且不可区分；而脱离主题知识的幻觉则产生独特且聚集的表征，从而可被检测。这一发现揭示了一个根本局限：LLMs并未在其内部状态中编码真实性（truthfulness），仅编码知识回忆的模式（pattern of knowledge recall），因此“LLMs并不真正知道它们不知道什么”。

链接: https://arxiv.org/abs/2510.09033
作者: Chi Seng Cheang,Hou Pong Chan,Wenxuan Zhang,Yang Deng
机构: Singapore Management University (新加坡管理大学); DAMO Academy, Alibaba Group (阿里达摩院); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may “know what they don’t know”. However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that “LLMs don’t really know what they don’t know”.
zh

[NLP-62] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

【速读】：该论文旨在解决Chakma语（一种印欧语系语言）在语言模型中代表性不足的问题，特别是在低资源语言场景下缺乏高质量训练数据的困境。其解决方案的关键在于构建了一个经过母语者验证的、基于孟加拉语转写的上下文连贯的Chakma语料库，并在此基础上对六种编码器型多语言和区域Transformer模型（包括mBERT、XLM-RoBERTa、DistilBERT、DeBERTaV3、BanglaBERT和IndicBERT）进行掩码语言建模（MLM）微调。实验表明，微调后的多语言模型在Bangla-transliterated Chakma上性能显著优于预训练模型，最高达到73.54%的token准确率和2.90的困惑度，证明了该语料库在迁移学习中的有效性，同时也揭示了光学字符识别（OCR）管道在处理形态丰富的印度文字时的局限性。

链接: https://arxiv.org/abs/2510.09032
作者: Adity Khisa,Nusrat Jahan Lia,Tasnim Mahfuz Nafis,Zarif Masud,Tanzir Pial,Shebuti Rayana,Ahmedul Kabir
机构: IIT, University of Dhaka (印度理工学院，达卡大学); Toronto Metropolitan University (多伦多都会大学); Stony Brook University (石溪大学); State University of New York, Old Westbury (纽约州立大学奥尔德韦斯特伯里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.
zh

[NLP-63] Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自动作文评分（Automated Essay Scoring, AES）中性能对提示（prompt）高度敏感的问题，尤其关注如何提升LLM评分结果与人类评分的一致性。其解决方案的关键在于引入一种迭代式评分量表（rubric）优化机制：通过让模型基于自身评分理由和与人工评分的差异进行反思，逐步迭代优化评分标准，从而显著提升评分一致性指标——QWK（Quadratic Weighted Kappa）。实验表明，即使初始量表简单，该方法也能达到甚至超越使用详细人工编写的量表的效果。

链接: https://arxiv.org/abs/2510.09030
作者: Keno Harada,Lui Yoshida,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models’ own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.
zh

[NLP-64] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的Text-to-SQL方法在部署可行性与数据隐私方面的挑战，尤其是在资源受限或对隐私要求较高的场景下难以应用的问题。其解决方案的关键在于提出一个轻量高效框架LitE-SQL，包含两个核心组件：一是利用预计算的schema嵌入向量数据库实现高效的schema链接（Schema Retriever），二是采用两阶段微调策略（监督微调+执行引导的强化学习）训练SQL生成器，从而实现无需昂贵的多候选生成即可自纠正的高效SQL生成能力。该方案在BIRD和Spider数据集上分别达到72.10%和88.45%的执行准确率，性能媲美甚至超越主流LLM方法，同时参数量仅为后者的1/2至1/30。

链接: https://arxiv.org/abs/2510.09014
作者: Shengmin Piao,Jieun Lee,Sanghyun Park
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
zh

[NLP-65] On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的对象幻觉（object hallucination）问题，即模型生成描述中包含输入图像中并不存在的物体。研究表明，视觉编码器（Vision Encoder, VE）中高认知不确定性（epistemic uncertainty）的视觉标记（visual tokens）是导致幻觉的关键因素；通过统计分析发现，这些高不确定性的视觉标记与幻觉发生具有正相关性。解决方案的关键在于：首先利用对抗扰动代理方法高效识别VE早期层中对微小扰动敏感、表示偏差大的视觉标记作为高不确定性标记；随后在VE中间层的自注意力机制中掩码这些标记，从而抑制其对视觉表征的影响，有效缓解幻觉现象。该方法仅修改视觉编码器，且可与其他现有技术协同增强效果。

链接: https://arxiv.org/abs/2510.09008
作者: Hoigi Seo,Dong Un Kang,Hyunjin Cho,Joohoon Lee,Se Young Chun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.
zh

[NLP-66] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在安全对齐（safety alignment）过程中难以兼顾模型安全性与通用性能的问题。当前方法通常需要昂贵的计算资源来搜索最优的安全关键数据与通用数据比例，但效果有限。论文提出基于低秩适配（LoRA）的拒绝训练（Refusal-training）方案，其核心创新在于：通过LoRA将安全能力解耦到一个与模型固有变换空间高度正交的低秩子空间中，从而实现仅用安全数据训练即可在不损害模型通用性能的前提下提升安全性，具备成本低、性能保持好、可即插即用的优势。

链接: https://arxiv.org/abs/2510.09004
作者: Yutao Mou,Xiaoling Zhou,Yuxiao Luo,Shikun Zhang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学软件工程国家工程研究中心)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model’s intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.
zh

[NLP-67] DARO: Difficulty-Aware Reweighting Policy Optimization

【速读】：该论文旨在解决当前基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）方法在数学推理任务中因静态或过于简化的权重分配机制导致的训练失衡问题。具体而言，现有主流方法如Group Relative Policy Optimization (GRPO) 及其变体依赖固定难度分组的权重策略，无法随模型能力的动态演化自适应调整损失贡献，从而引发显著的损失尺度不均衡现象，抑制了整体性能提升。解决方案的关键在于提出Difficulty-Aware Reweighting Policy Optimization (DARO)，其核心创新是设计了一种基于模型学习状态动态调节各难度组损失权重的机制，使训练过程能够持续聚焦于当前最具挑战性的样本区间，从而实现更快收敛与更优最终表现。

链接: https://arxiv.org/abs/2510.09001
作者: Jingyu Zhou,Lu Ma,Hao Liang,Chengyu Shen,Bin Cui,Wentao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model’s evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbfDifficulty-Aware Reweighting Policy Optimization (DARO), a method that dynamically adjusts the loss contribution of each difficulty group based on the model’s learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.
zh

[NLP-68] MASA: LLM -Driven Multi-Agent Systems for Autoformalization EMNLP2025

【速读】：该论文旨在解决自然语言与形式化推理之间缺乏高效、可靠转换机制的问题，即如何将自然语言表述自动转化为可被形式化验证的数学表达（autoformalization）。其解决方案的关键在于提出一种名为MASA的多智能体系统框架，该框架基于大语言模型（LLMs）驱动，通过协作式智能体实现自然语言到形式化表示的映射。MASA架构强调模块化、灵活性和可扩展性，支持动态集成新智能体与工具，从而适应快速演进的形式化数学研究需求，并在真实数学定义和形式化数学数据集上验证了其有效性，展示了LLM与定理证明器交互在提升autoformalization效率与可靠性方面的潜力。

链接: https://arxiv.org/abs/2510.08988
作者: Lan Zhang,Marco Valentino,André Freitas
机构: University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); Idiap Research Institute (Idiap 研究所); CRUK Manchester Institute (英国癌症研究基金会曼彻斯特研究所)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: EMNLP 2025 Demo camera-ready. Code and data are available at: this https URL

点击查看摘要

Abstract:Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.
zh

[NLP-69] Creation of the Chinese Adaptive Policy Communication Corpus

【速读】：该论文旨在解决中文政策文本中语言清晰度与模糊性难以量化分析的问题，从而支持政策沟通的自动化建模与多语言自然语言处理（Natural Language Processing, NLP）研究。其解决方案的关键在于构建首个开放的中文政策指令语料库CAPC-CG，该语料库基于Ang的适应性政策沟通理论，采用五色分类法对政策文本中的语言清晰度进行标注，并通过两轮标注框架和专家校验确保标注质量；同时提供330万段落级别的标注数据、详尽元数据及标注手册，实现高信度的监督学习基础，Fleiss’s kappa达K = 0.86，为后续政策语义理解、跨语言政策比较等任务提供可靠的数据支撑。

链接: https://arxiv.org/abs/2510.08986
作者: Bolun Sun,Charles Chang,Yuen Yuen Ang,Pingxu Hao,Ruotong Mu,Yuchen Xu,Zhengxin Zhang
机构: Johns Hopkins University (约翰霍普金斯大学); Northwestern University (西北大学); Duke Kunshan University (昆山杜克大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang’s theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China’s top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss’s kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.
zh

[NLP-70] Diagnosing and Mitigating System Bias in Self-Rewarding RL

【速读】：该论文旨在解决强化学习中基于可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）与基于内在奖励（Reinforcement Learning with Intrinsic Rewards, RLIR）之间的性能差距问题。RLVR虽能有效提升大语言模型（Large Language Models, LLMs）的推理能力，但受限于标注样本稀缺；而RLIR虽可在无标签数据下实现可持续扩展，却因模型对高置信度轨迹的过度估计导致奖励估计偏差和训练不稳定。作者指出这一系统性偏差源于三个指标：噪声影响（ $\rho_\text{noise}$ ）、自偏倚（ $\rho_\text{selfbias}$ ）和符号偏倚（ $\rho_\text{symbias}$ ），其中 $\rho_\text{selfbias}$ 会放大正确与错误更新，引发不稳定性。为此，论文提出基于集成奖励的强化学习（Reinforcement Learning with Ensembled Rewards, RLER），通过聚合多个模型、动态调整奖励插值与轨迹选择策略来缓解偏倚，显著提升了RLIR的稳定性和性能，最终在无标签场景下达到接近RLVR的效果（仅低3.6%），实现了高效且稳定的扩展。

链接: https://arxiv.org/abs/2510.08977
作者: Chuyi Tan,Peiwen Yuan,Xinglin Wang,Yiwei Li,Shaoxiong Feng,Yueqi Zhang,Jiayi Shi,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
机构: Beijing Institute of Technology (北京理工大学); Xiaohongshu Inc (小红书公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: \rho_\textnoise , \rho_\textselfbias , and \rho_\textsymbias . We find that \rho_\textnoise and \rho_\textsymbias impact convergence, while \rho_\textselfbias amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
zh

[NLP-71] Semantic-Condition Tuning: Fusing Graph Context with Large Language Models for Knowledge Graph Completion

【速读】：该论文旨在解决当前知识图谱（Knowledge Graph, KG）与大语言模型（Large Language Model, LLM）融合方法中存在的一系列问题，尤其是主流的前缀调优（prefix-tuning）策略因浅层融合导致对KG中丰富关系语义利用不足，且迫使LLM在推理阶段承担过重的隐式关联任务。解决方案的关键在于提出一种新的知识注入范式——语义条件调优（Semantic-condition Tuning, SCT），其核心由两个模块构成：首先，语义图模块（Semantic Graph Module）通过图神经网络从局部图邻域中提取受知识增强关系引导的上下文感知语义条件；其次，条件自适应融合模块（Condition-Adaptive Fusion Module）利用两个参数化投影器对文本嵌入进行特征级、知识感知的动态调制，实现深度交互。该方法在预融合阶段即引入结构化语义信息，从而为LLM提供更直接、更强的知识推理信号，显著提升知识密集型任务的性能。

链接: https://arxiv.org/abs/2510.08966
作者: Ruitong Liu,Yan Wen,Te Sun,Yunjia Wu,Pingyang Huang,Zihang Yu,Siyuan Li
机构: Dalian University of Technology (大连理工大学); Beijing Institute of Technology (北京理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 3 figures, conference

点击查看摘要

Abstract:Fusing Knowledge Graphs with Large Language Models is crucial for knowledge-intensive tasks like knowledge graph completion. The prevailing paradigm, prefix-tuning, simply concatenates knowledge embeddings with text inputs. However, this shallow fusion overlooks the rich relational semantics within KGs and imposes a significant implicit reasoning burden on the LLM to correlate the prefix with the text. To address these, we propose Semantic-condition Tuning (SCT), a new knowledge injection paradigm comprising two key modules. First, a Semantic Graph Module employs a Graph Neural Network to extract a context-aware semantic condition from the local graph neighborhood, guided by knowledge-enhanced relations. Subsequently, this condition is passed to a Condition-Adaptive Fusion Module, which, in turn, adaptively modulates the textual embedding via two parameterized projectors, enabling a deep, feature-wise, and knowledge-aware interaction. The resulting pre-fused embedding is then fed into the LLM for fine-tuning. Extensive experiments on knowledge graph benchmarks demonstrate that SCT significantly outperforms prefix-tuning and other strong baselines. Our analysis confirms that by modulating the input representation with semantic graph context before LLM inference, SCT provides a more direct and potent signal, enabling more accurate and robust knowledge reasoning.
zh

[NLP-72] Unleashing Perception-Time Scaling to Multimodal Reasoning Models

【速读】：该论文旨在解决当前大型视觉语言模型（Large Vision-Language Models, LVLMs）在视觉估计任务中感知精度不足的问题，尤其是在推理时缩放（inference-time scaling）策略下提升有限的现状。其核心问题在于现有LVLMs采用“快速感知”范式（fast perception paradigm），将视觉理解视为一次性输出，未建模底层的感知过程，从而限制了感知能力的提升。解决方案的关键是提出感知时间缩放（Perception-Time Scaling, PTS）新范式，通过鼓励高密度token的感知过程，并将复杂感知任务分解为可处理的中间子问题，使感知机制与推理时缩放技术对齐并受益。实验表明，结合强化学习技术后，PTS显著提升了感知准确性，在DisTANCE基准上高精度性能从8.0%提升至64.7%，且具备良好的跨域泛化能力。

链接: https://arxiv.org/abs/2510.08964
作者: Yifan Li,Zhenghao Chen,Ziheng Wu,Kun Zhou,Ruipu Luo,Can Zhang,Zhentao He,Yufei Zhan,Wayne Xin Zhao,Minghui Qiu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京市大模型与智能治理重点实验室); ByteDance (字节跳动); University of California, San Diego (加州大学圣地亚哥分校); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model’s attention to image tokens. Our code and data will be publicly released.
zh

[NLP-73] A Human Behavioral Baseline for Collective Governance in Software Projects NEURIPS2025

【速读】：该论文旨在解决开源社区中治理结构如何随时间演变的问题，特别是参与机制与控制权分配的变化模式。其核心问题是：在版本控制系统下的治理文档中，参与者角色、规则、行动和对象等要素是如何动态调整的？解决方案的关键在于构建一个可复现的分析框架，通过解析治理文本提取出“行为者（actors）”、“规则（rules）”、“行动（actions）”和“对象（objects）”，并利用熵（entropy）衡量分布均匀性、丰富度（richness）评估多样性、Jensen-Shannon散度（Jensen Shannon divergence）检测演化偏移，从而量化治理结构的演化特征。研究发现，项目随时间扩展了更多角色与行动且分布更均衡，而规则组成保持稳定，表明治理演化的本质是参与维度的扩展与平衡，而非规范效力的根本重构。

链接: https://arxiv.org/abs/2510.08956
作者: Mobina Noori,Mahasweta Chakraborti,Amy X Zhang,Seth Frey
机构: University of California Davis (加州大学戴维斯分校); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Algorithmic Collective Action Workshop @ NeurIPS 2025. arXiv admin note: text overlap with arXiv:2509.16295

点击查看摘要

Abstract:We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.
zh

[NLP-74] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在复杂标准操作流程（Standard Operating Procedures, SOPs）场景下评估不足的问题。当前多数基准测试聚焦于指令遵循和简单决策，难以刻画真实业务中多分支、高逻辑复杂度的SOP任务。解决方案的关键在于提出SOP-Maze基准，其基于真实业务数据构建了397个来自23个复杂SOP场景的任务，并将任务系统性地划分为两类：横向根系（Lateral Root System, LRS）——强调宽选项中的精确选择能力；纵向根系（Heart Root System, HRS）——侧重深层逻辑推理与复杂分支处理。实验表明，主流LLM在该基准上表现普遍不佳，暴露出路径盲区（route blindness）、对话脆弱性（conversational fragility）和计算错误（calculation errors）三大核心问题，从而为提升模型在真实业务场景中的泛化与鲁棒性提供了明确方向。

链接: https://arxiv.org/abs/2510.08942
作者: Jiaming Wang,Zhe Tang,Yilin Jin,Peng Ding,Xiaoyu Li,Xuezhi Cao
机构: Meituan M17 (美团); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on this https URL.
zh

[NLP-75] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions EMNLP2025

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在处理提示（prompts）时，其内部表征中是否存在类似人类印象和刻板印象的模式，以及这些“人工印象”（artificial impressions）如何影响模型的下游行为。解决方案的关键在于使用线性探测器（linear probes）对LLMs生成提示的隐藏表示进行建模，以预测基于双维度刻板印象内容模型（Stereotype Content Model, SCM）的人类印象，并进一步分析这些人工印象与模型响应质量、模糊表达（hedging）使用以及提示中的内容、风格和方言特征之间的关联。研究发现，尽管LLMs在输出层面表现出对印象的不一致性，但其内部表征中却能更稳定地解码出这些印象，表明人工印象是模型内部结构中可识别且具预测力的成分。

链接: https://arxiv.org/abs/2510.08915
作者: Nicholas Deas,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Camera Ready

点击查看摘要

Abstract:We introduce and study artificial impressions–patterns in LLMs’ internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.
zh

[NLP-76] Autoencoding-Free Context Compression for LLM s via Contextual Semantic Anchors

【速读】：该论文旨在解决当前上下文压缩（Context Compression）方法在训练过程中存在的任务不匹配问题：现有方法主要依赖自编码（Autoencoding）任务来训练与上下文无关的压缩标记（Compression Tokens），虽然能赋予模型压缩能力，但其优化目标（重建精度）与下游实际任务存在偏差，导致压缩后的特征对真实应用场景的适配性下降。解决方案的关键在于提出语义锚点压缩（Semantic-Anchor Compression, SAC），该方法摒弃了自编码训练范式，转而直接从原始上下文中选取关键的锚点标记（Anchor Tokens），并将上下文信息聚合到其键值（Key-Value, KV）表示中，从而实现无需额外训练的压缩能力。SAC通过两个核心设计保障性能：一是引入锚点嵌入（Anchor Embeddings）以识别重要上下文标记，二是采用双向注意力机制修改（Bidirectional Attention Modification）使锚点标记能够捕获全局上下文信息，最终在多种压缩比下均显著优于现有方法，尤其在分布外评估（MRQA）中表现突出。

链接: https://arxiv.org/abs/2510.08907
作者: Xin Liu,RunSong Zhao,PengCheng Huang,XinYu Liu,JunYi Xiao,ChunYang Xiao,Tong Xiao,Shengxiang Gao,Zhengtao Yu,JingBo Zhu
机构: NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China; Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
类目: Computation and Language (cs.CL)
备注: 18 pages,9 figures

点击查看摘要

Abstract:Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textita priori. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.
zh

[NLP-77] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

【速读】：该论文旨在解决生物医学命名实体识别（BioNER）中面临的三个核心挑战：嵌套实体（nested entities）识别困难、实体边界模糊以及跨语言泛化能力弱的问题。其解决方案的关键在于提出一个基于大语言模型（LLMs）的统一框架，首先将BioNER任务重构为文本生成任务，并设计了一种符号化标记策略（symbolic tagging strategy），能够显式标注实体边界并同时处理扁平与嵌套实体；其次通过多语种联合微调（bilingual joint fine-tuning）提升模型在中英文数据上的多任务泛化能力；最后引入基于对比学习的实体选择器（contrastive learning-based entity selector），利用边界敏感的正负样本过滤错误或冗余预测，从而显著提升识别精度与零样本跨语言迁移性能。

链接: https://arxiv.org/abs/2510.08902
作者: Tengxiao Lv,Ling Luo,Juntao Li,Yanhua Wang,Yuchen Pan,Chao Liu,Yanan Wang,Yan Jiang,Huiyi Lv,Yuanyuan Sun,Jian Wang,Hongfei Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a short paper at BIBM2025

点击查看摘要

Abstract:Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at this https URL.
zh

[NLP-78] HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance

【速读】：该论文旨在解决Text-to-SQL生成任务中语义准确性与计算效率难以兼顾的问题，即如何在保证生成SQL查询语法和逻辑正确性的同时，提升其在数据库管理系统（DBMS）中的执行效率。解决方案的关键在于提出了一种名为HES-SQL的混合训练框架，该框架融合了思维模式融合的监督微调（SFT）与组相对策略优化（GRPO），并引入三项核心创新：(1) 基于结构完整性的评分机制以增强生成查询与最优SQL结构之间的偏好对齐；(2) 查询延迟感知的奖励函数以激励生成计算高效的SQL语句；(3) 自蒸馏的思维模式补全过程以防止模型推理能力退化。该方法使模型能够在推理与非推理模式间动态切换，在保持高语义准确率的同时显著降低查询延迟（相比监督基线提升11%–20%）。

链接: https://arxiv.org/abs/2510.08896
作者: Suming Qiu,Jing Li,Zhicheng Zhou,Junjie Huang,Linyuan Qiu,Zhijie Sun
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces three key innovations: (1) a skeleton-completeness scoring mechanism that enhances preference alignment between generated queries and optimal SQL structures; (2) a query-latency-aware reward system that incentivizes the generation of computationally efficient SQL queries; (3) a self-distillation process for thinking-mode completion that prevents degradation of the model’s reasoning capabilities. This framework enables hybrid thinking models to switch between reasoning and non-reasoning modes while improving SQL query accuracy and execution efficiency. Experimental evaluation, conducted on MySQL 8.0 and SQLite 3.42 under controlled single-user conditions, demonstrates that HES-SQL achieves competitive performance with execution accuracies of 79.14% and 54.9% on the BIRD and KaggleDBQA benchmarks, respectively. Query latency is measured as the end-to-end execution time of generated queries on the DBMS, averaged over multiple runs to mitigate variance. Efficiency gains range from 11% to 20% relative to supervised baselines. Our results establish a new paradigm for Text-to-SQL systems that effectively balances semantic accuracy with computational efficiency through execution-informed reinforcement learning (RL). The proposed methodology has significant implications for developing robust natural language interfaces to databases and can be extended to broader structured generation tasks requiring both correctness and efficiency optimization. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.08896 [cs.DB] (or arXiv:2510.08896v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2510.08896 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-79] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在强化学习（Reinforcement Learning）驱动下的推理能力提升问题，尤其关注生成过程中不同token类型对探索行为的影响。现有方法通常通过限制参数更新来间接鼓励探索，但未在采样阶段显式促进探索性行为。解决方案的关键在于：区分推理token（high-entropy reasoning tokens）与知识token（low-entropy knowledge tokens），并采用不同的温度设置进行采样——对推理token使用较高温度以主动增强探索性，同时对知识token保持较低温度以保障事实准确性。该多温度调度策略在多个推理基准测试中显著提升了LLMs的推理性能。

链接: https://arxiv.org/abs/2510.08892
作者: Haomin Zhuang,Yujun Zhou,Taicheng Guo,Yue Huang,Fangxu Liu,Kai Song,Xiangliang Zhang
机构: University of Notre Dame (圣母大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at this https URL.
zh

[NLP-80] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLM s

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在处理结构化、依赖性强且受分类体系（taxonomy）驱动的财务文档时，推理能力不足的问题。具体而言，GAAP（美国通用会计准则）的复杂性和XBRL（eXtensible Business Reporting Language）文件的层级结构使得财务审计自动化和验证变得困难，而现有大语言模型（LLMs）在面对此类任务时表现不稳定。解决方案的关键在于提出首个与分类体系对齐、结构感知的多文档基准测试集FinAuditing，其包含三个互补子任务：FinSM（语义一致性）、FinRE（关系一致性）和FinMR（数值一致性），并设计了一个融合检索、分类与推理指标的统一评估框架，从而系统性地评估LLMs在金融审计中的结构化推理能力。

链接: https://arxiv.org/abs/2510.08886
作者: Yan Wang,Keyi Wang,Shanshan Yang,Jaisal Patel,Jeff Zhao,Fengran Mo,Xueqing Peng,Lingfei Qian,Jimin Huang,Guojun Xiong,Xiao-Yang Liu,Jian-Yun Nie
机构: The Fin AI(金融科技公司); Columbia University (哥伦比亚大学); Stevens Institute of Technology (史蒂文斯理工学院); Rensselaer Polytechnic Institute (伦斯勒理工学院); UT Austin (德克萨斯大学奥斯汀分校); University of Montreal (蒙特利尔大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.
zh

[NLP-81] ControlAudio: Tackling Text-Guided Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

【速读】：该论文旨在解决可控文本到音频（Text-to-Audio, TTA）生成中因数据稀缺导致的大规模生成性能受限问题，尤其关注对时间精度和语音内容可理解性的精细控制。其解决方案的关键在于将可控TTA生成建模为多任务学习问题，并提出一种渐进式扩散建模方法ControlAudio：首先通过标注与仿真相结合的数据构建策略增强文本、时间戳和音素特征的条件信息；其次在模型训练阶段，先在大规模文本-音频对上预训练扩散Transformer（Diffusion Transformer, DiT），再逐步引入时间与音素特征并融合统一语义表示以提升可控性；最后在推理阶段采用渐进引导生成策略，按粗粒度到细粒度顺序强调条件信息，充分利用DiT的粗到精采样特性，从而显著提升生成音频的时间准确性与语音清晰度，在客观和主观评估上均优于现有方法。

链接: https://arxiv.org/abs/2510.08878
作者: Yuxuan Jiang,Zehua Chen,Zeqian Ju,Yusheng Dai,Weibei Dou,Jun Zhu
机构: Tsinghua University (清华大学); Shengshu AI; University of Science and Technology of China (中国科学技术大学); Monash University (莫纳什大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 18 pages, 8 tables, 5 figures

点击查看摘要

Abstract:Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: this https URL.
zh

[NLP-82] Quality Estimation Reranking for Document-Level Translation

【速读】：该论文旨在解决文档级机器翻译（document-level machine translation, MT）中译文质量提升的问题，尤其关注在生成候选译文池后如何通过质量估计（Quality Estimation, QE）重排序技术实现更优的翻译结果。其解决方案的关键在于：利用不同类型的QE指标（包括基于学习的模型SLIDE和大型语言模型（Large Language Model, LLM）驱动的GEMBA-DA）对候选译文进行评分与重排序，并在文档级别验证其有效性。实验表明，使用SLIDE或GEMBA-DA进行重排序可在仅需少量候选译文（如2个）时即显著提升BLEURT-20分数（最高达+5.09），且在长文档场景下仍保持稳定增益（如+2.34），证明了文档级QE重排序在实际应用中的高效性与可行性，同时具备较低的运行时开销。

链接: https://arxiv.org/abs/2510.08870
作者: Krzysztof Mrozinski,Minji Kang,Ahmed Khota,Vincent Michael Sutanto,Giovanni Gatti De Giacomo
机构: Yaraku, Inc
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.
zh

[NLP-83] ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

【速读】：该论文旨在解决科学出版中同行评审（peer review）存在的不一致性、评审者主观性以及可扩展性挑战等问题。其解决方案的关键在于提出并验证了一个名为ReviewerToo的模块化框架，该框架支持AI辅助同行评审，通过构建特定的评审角色（reviewer personas）和结构化的评估标准，实现对论文的系统性、一致性的初步判断，并可部分或完全集成到真实会议工作流中。实验表明，基于gpt-oss-120b模型的AI评审在分类接受/拒绝任务上达到81.8%准确率（接近人类平均的83.9%），且生成的评审质量被LLM裁判评价高于人类平均水平，尤其在事实核查和文献覆盖方面表现优异，但在方法新颖性和理论贡献等复杂判断上仍需人类专家介入。因此，ReviewerToo为构建可扩展、公平且高效的混合型同行评审体系提供了实证基础与实践指南。

链接: https://arxiv.org/abs/2510.08867
作者: Gaurav Sahu,Hugo Larochelle,Laurent Charlin,Christopher Pal
机构: Mila – Quebec AI Institute (魁北克人工智能研究所); HEC Montréal (蒙特利尔高等商学院); ServiceNow Research (ServiceNow 研究院); Université de Montréal (蒙特利尔大学); Canada CIFAR Chair (加拿大 CIFAR 讲座); Polytechnique Montréal (蒙特利尔理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.
zh

[NLP-84] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多轮对话中易受“渐进式越狱攻击”（multi-turn jailbreaking attacks）的问题，此类攻击利用对话上下文逐步绕过安全约束，且针对不同危害类别（如恶意软件生成、骚扰或欺诈）采用差异化的对话策略。现有方法多依赖启发式探索，缺乏对模型脆弱性与对话模式之间关系的系统理解。其解决方案的关键在于提出Pattern Enhanced Chain of Attack (PE-CoA) 框架，通过五种自然对话模式构建高效多轮越狱攻击，揭示了模型在不同对话模式下的特定弱点及行为特征：即模型对某一模式的鲁棒性无法泛化至其他模式，且不同模型家族表现出相似的失败模式，从而凸显当前安全训练的局限性，并强调需发展基于对话模式感知的防御机制。

链接: https://arxiv.org/abs/2510.08859
作者: Ragib Amin Nihal,Rui Wen,Kazuhiro Nakadai,Jun Sakuma
机构: Institute of Science Tokyo (东京科学研究所); RIKEN AIP (理化学研究所人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: this https URL
zh

[NLP-85] me-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training ICLR2025

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoders, SAEs）在训练过程中存在的特征吸收（feature absorption）问题，即多个特征或神经元相互吸收以最小化 L₁ 正则化惩罚，导致难以稳定识别和分析大语言模型中的可解释特征。解决方案的关键在于提出自适应时间掩码（Adaptive Temporal Masking, ATM），其通过动态追踪激活幅度、频率及重构贡献来计算随时间演化的特征重要性得分，并基于这些得分的统计阈值实施概率掩码机制，从而实现更自然且稳定的特征选择过程。实验表明，ATM 在 Gemma-2-2b 模型上显著降低了吸收分数，同时保持了优异的重建质量，为学习稳定、可解释的神经网络特征提供了原则性的方法。

链接: https://arxiv.org/abs/2510.08855
作者: T. Ed Li,Junyu Ren
机构: Yale University(耶鲁大学); University of Chicago(芝加哥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First submitted on February 10th, 2025 to ICLR 2025 Workshop (XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge). The paper was accepted but the workshop does not generate proceedings. Now uploading to arXiv to make the paper publicly available

点击查看摘要

Abstract:Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize L_1 penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.
zh

[NLP-86] Everyone prefers human writers including AI

【速读】：该论文旨在解决生成式 AI (Generative AI) 在文学风格评估中是否存在 Attribution Bias（归属偏见）的问题，即人类与AI在判断文本作者身份时是否因来源不同而产生系统性评价差异。研究通过控制实验设计，利用雷蒙·昆诺（Raymond Queneau）的《风格练习》（Exercises in Style）作为基准文本，对比人类与AI模型在盲测、准确标注和反事实标注三种条件下对原始文本与GPT-4生成版本的评分差异。关键解决方案在于构建一个严谨的实验框架，量化了人类（+13.7百分点）和AI（+34.3百分点）均存在显著的亲人类归属偏见，并进一步验证该偏见在不同AI架构间具有普遍性（+25.8百分点），揭示出AI不仅复制了人类文化中对人工创作的歧视倾向，还将其放大，且标签本身可导致评估标准反转——同一内容因署名不同而被赋予相反评价，表明AI在训练过程中吸收并强化了人类社会的文化偏见。

链接: https://arxiv.org/abs/2510.08831
作者: Wouter Haverals,Meredith Martin
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 46 pages, 18 figures (5 main text + 13 supplementary), 5 tables

点击查看摘要

Abstract:As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau’s Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14 \times 14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen’s h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P 0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as “AI-generated” regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.
zh

[NLP-87] McMining: Automated Discovery of Misconceptions in Student Code

【速读】：该论文旨在解决学生在编程学习过程中因对编程语言概念理解偏差而产生的编程误解（misconceptions）问题，这类误解可能导致代码错误或效率低下，并阻碍后续知识的掌握。解决方案的关键在于提出“McMining”任务——从学生编写的代码样本中挖掘潜在的编程误解，并构建了一个可扩展的谬误基准数据集及大量体现这些误解的代码样本，从而支持McMining系统的训练与评估；同时设计了基于大语言模型（LLM）的两种McMiner方法，实证表明Gemini、Claude和GPT系列模型在识别学生代码中的编程误解方面具有较高有效性。

链接: https://arxiv.org/abs/2510.08827
作者: Erfan Al-Hossami,Razvan Bunescu
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:When learning to code, students often develop misconceptions about various programming language concepts. These can not only lead to bugs or inefficient code, but also slow down the learning of related concepts. In this paper, we introduce McMining, the task of mining programming misconceptions from samples of code from a student. To enable the training and evaluation of McMining systems, we develop an extensible benchmark dataset of misconceptions together with a large set of code samples where these misconceptions are manifested. We then introduce two LLM-based McMiner approaches and through extensive evaluations show that models from the Gemini, Claude, and GPT families are effective at discovering misconceptions in student code.
zh

[NLP-88] Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在知识密集型多跳问答任务中表现不可靠的问题，具体表现为遗漏长尾事实、产生幻觉以及内部知识滞后于现实世界变化。现有知识图谱问答（Knowledge Graph Question Answering, KGQA）方法存在根本性权衡：预定义完整SPARQL查询易受关系缺失影响，子图检索引入噪声，复杂代理框架则导致搜索空间指数级膨胀。解决方案的关键在于提出一种名为“图上搜索”（Search-on-Graph, SoG）的简单而有效的框架，其核心是通过一个精心设计的单一 \textscSearch 函数实现LLM对知识图谱的迭代式智能导航；SoG遵循“观察后导航”原则——每一步仅基于当前实体的实际可用关系进行决策，从而避免预规划路径或大规模子图检索，并能自适应不同知识图谱模式及高阶节点，无需微调即可在六个KGQA基准测试（涵盖Freebase和Wikidata）上达到最优性能，尤其在Wikidata上提升达16%。

链接: https://arxiv.org/abs/2510.08825
作者: Jia Ao Sun,Hao Yu,Fabrizio Gotti,Fengran Mo,Yihong Wu,Yuchen Hui,Jian-Yun Nie
机构: Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions – they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textscSearch function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate’’ principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.
zh

[NLP-89] he Models Language Matters: A Comparative Privacy Analysis of LLM s

【速读】：该论文试图解决多语言大语言模型（Large Language Models, LLMs）在处理敏感数据时因语言结构差异导致的隐私泄露问题，尤其是在不同语种（如英语、西班牙语、法语和意大利语）医学语料库训练下隐私风险的量化评估。其解决方案的关键在于通过量化六种语言特征指标并分析三种攻击向量（提取攻击、反事实记忆攻击和成员推理攻击），首次提供了语言因素对隐私泄露影响的定量证据，揭示了意大利语因高冗余度和细粒度分词导致最强泄露，而法语和西班牙语因形态复杂性表现出更强的抗泄露能力，从而强调了在LLM部署中采用语言感知型隐私保护机制的必要性。

链接: https://arxiv.org/abs/2510.08813
作者: Abhishek K. Mishra,Antoine Boutet,Lucas Magnana
机构: Inria(法国国家信息与自动化研究所); INSA Lyon (里昂国立应用科学学院); CITI (通信与信息技术中心)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.
zh

[NLP-90] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

【速读】：该论文旨在解决科学计算任务中生成高质量、可解释且鲁棒的代码难题，这类任务通常要求算法严谨、与领域知识深度耦合，并通过多步骤子问题求解实现最终目标，而传统通用编码大模型难以满足此类复杂需求。其解决方案的关键在于提出一种无需训练的多智能体框架MOSAIC，该框架基于“学生-教师”范式设计了具备自我反思、推理构建、代码生成与调试能力的专用代理（agent），并结合 Consolidated Context Window (CCW) 技术有效缓解链式子问题中的大语言模型（LLM）幻觉问题，从而实现分步问题分解、精准错误修正和高可解释性输出。

链接: https://arxiv.org/abs/2510.08804
作者: Siddeshwar Raghavan,Tanwi Mallick
机构: Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.
zh

[NLP-91] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在通用中文语境下综合评估不足的问题，特别是其在整合中文特定事实知识与多步逻辑推理能力方面的局限性。解决方案的关键在于构建了一个名为“中文常识多跳推理”（Chinese Commonsense Multi-hop Reasoning, CCMOR）的新基准，该基准通过从现有问答数据集中构建领域平衡的种子集，并利用LLM驱动的流水线生成基于事实单元链锚定的多跳问题，同时引入人机协同验证机制由领域专家系统性地审核和优化生成的问题，从而确保数据质量。实验表明，该基准揭示了LLMs在处理长尾知识和执行知识密集型推理上的持续短板，而检索增强生成（Retrieval-Augmented Generation, RAG）能显著缓解这些知识缺口并提升性能。

链接: https://arxiv.org/abs/2510.08800
作者: Wangjie You,Xusheng Wang,Xing Wang,Wenxiang Jiao,Chao Feng,Juntao Li,Min Zhang
机构: Douyin Content Group, ByteDance (抖音内容团队，字节跳动); School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.
zh

[NLP-92] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models EMNLP2025

【速读】：该论文旨在解决Transformer模型中注意力机制随序列长度呈二次增长（O(n²)）导致的长序列处理效率低下问题，从而限制了其在长文本任务中的应用。解决方案的关键在于提出一种名为“自适应保留”（Adaptive Retention）的概率性、逐层标记选择机制，该机制通过学习在全局预算M内保留哪些表示来优化计算资源分配；其核心创新在于使用伯努利门控（Bernoulli gates）结合Hard-Concrete/变分松弛进行训练，并在推理阶段采用简单的Top-M规则强制执行，使方法具备可微性和对标准编码器的即插即用特性，从而在保持95%模型性能的同时显著降低峰值内存消耗（约35–45%）并提升吞吐量（最高达1.8倍）。

链接: https://arxiv.org/abs/2510.08798
作者: S M Rafiuddin,Muntaha Nujat Khan
机构: Oklahoma State University (俄克拉荷马州立大学)
类目: Computation and Language (cs.CL)
备注: 14 Pages, 2 Figures, 6 Table, Accepted at EMNLP 2025 Findings as a Short Paper

点击查看摘要

Abstract:Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves = 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.
zh

[NLP-93] Measuring Moral LLM Responses in Multilingual Capacities

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多语言场景下响应准确性与一致性不足的问题，尤其是在低资源和高资源语言之间的差异性表现。其解决方案的关键在于构建一个涵盖五个维度（如同意、自主权、危害预防等）的多语言评估框架，结合五级评分量表与判别型语言模型（judge LLM）对前沿及主流开源模型进行系统评测，从而量化不同模型在跨语言语境中的性能波动，识别出如GPT-5在多数类别中表现最优而Gemini 2.5 Pro在安全类任务中显著落后等关键发现，为提升LLM多语言安全性与鲁棒性提供实证依据与改进方向。

链接: https://arxiv.org/abs/2510.08776
作者: Kimaya Basu,Savi Kolari,Allison Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures; referenced articles: arXiv:2303.08774 , arXiv:2303.12528 , arXiv:2308.14132 , arXiv:2505.12201 , arXiv:2406.04428 , arXiv:2407.02273 , arXiv:2404.01268 , arXiv:2502.09747 , arXiv:2507.13474 , arXiv:2505.21479 , arXiv:2306.05685

点击查看摘要

Abstract:With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent Autonomy and Harm Prevention Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.
zh

[NLP-94] Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成文本嵌入（text embeddings）时忽视结构信息（如超链接或引用关系）的问题，而这些结构信息在许多现实世界数据集中提供了关键上下文。其解决方案的关键在于将结构关系直接整合进LLM的内部编码过程，而非依赖传统的后处理聚合方法，具体采用两种在过程中（in-process）的方法：顺序拼接（sequential concatenation）和并行缓存（parallel caching），从而实现更结构感知的文本嵌入生成。

链接: https://arxiv.org/abs/2510.08774
作者: Shikun Liu,Haoyu Wang,Mufei Li,Pan Li
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM’s internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.
zh

[NLP-95] Robust Heuristic Algorithm Design with LLM s

【速读】：该论文旨在解决现有启发式算法在鲁棒性和性能上的局限性问题，尤其是在面对复杂输入空间时难以保持稳定表现的挑战。其解决方案的关键在于将大语言模型（Large Language Models, LLMs）用于启发式设计的过程中，引入可解释性工具以识别启发式失效的具体实例、分析失效原因，并据此对输入空间中的特定区域进行针对性优化。通过这一机制，所生成的启发式策略相较于现有方法（如FunSearch）展现出约28倍的最差情况性能提升，同时改善平均性能并维持原有运行时间，从而显著增强算法的整体鲁棒性与实用性。

链接: https://arxiv.org/abs/2510.08755
作者: Pantea Karimi,Dany Rouhana,Pooria Namyar,Siva Kesava Reddy Kakarla,Venkat Arun,Behnaz Arzani
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ – ~the heuristics we produce have a \sim28\times better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.
zh

[NLP-96] Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中训练数据记忆化（memorization）风险的检测难题，尤其是现有方法仅能逐样本检测，难以捕捉跨样本的细微记忆现象。其解决方案的关键在于提出一个细粒度的跨客户端记忆测量框架，能够量化FL中的客户端内（intra-client）和客户端间（inter-client）记忆化程度，从而更全面地评估模型对原始数据的记忆风险。该框架基于对所有客户端训练数据的交叉样本记忆分析，使原本仅适用于集中式学习（Centralized Learning, CL）的精细记忆检测方法可适配于FL场景。

链接: https://arxiv.org/abs/2510.08750
作者: Tinnakit Udsa,Can Udomcharoenchaikit,Patomporn Payoungkhamdee,Sarana Nutanong,Norrathep Rattanavipanon
机构: School of Information Science and Technology, VISTEC (VISTEC信息科学与技术学院); College of Computing, Prince of Songkla University (宋卡王子大学计算机学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.
zh

[NLP-97] Coordinates from Context: Using LLM s to Ground Complex Location References

【速读】：该论文旨在解决复合位置引用的地理编码（geocoding）问题，即如何将由多个语义成分构成的位置描述（如“纽约市中央公园附近的星巴克”）准确映射到具体的地理坐标。传统方法在处理此类复杂、非结构化的文本时表现有限，而本文基于大语言模型（LLM）在空间知识推理方面的潜力，提出了一种结合模型预训练知识与轻量微调的策略。其解决方案的关键在于：首先系统评估LLM对地理空间信息的理解与推理能力，进而设计一种针对复合位置参考的分步推理框架，使得一个相对较小的微调后LLM即可达到甚至超越更大规模现成模型的性能，从而提升了效率与实用性。

链接: https://arxiv.org/abs/2510.08741
作者: Tessa Masis,Brendan O’Connor
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ARR

点击查看摘要

Abstract:Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs’ abilities to reason over geospatial data, we evaluate LLMs’ geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.
zh

[NLP-98] When to Reason : Semantic Router for vLLM NEURIPS2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中因启用复杂推理模式（如思维链 Chain-of-Thought 和推理时缩放 Inference-time Scaling）而导致的显著延迟和资源消耗问题，尤其是在处理简单任务时这种开销是不必要的。解决方案的关键在于提出一种语义路由（Semantic Router）机制，能够根据输入查询的语义特征自动判断其是否需要推理增强，并仅在必要时激活高精度推理模式，从而在保证性能提升的同时大幅降低延迟与token消耗。实验表明，该方法在MMLU-Pro基准上实现准确率提升10.2个百分点，同时相比直接推理（vLLM）将响应延迟减少47.1%、token使用量降低48.5%。

链接: https://arxiv.org/abs/2510.08731
作者: Chen Wang,Xunzhuo Liu,Yuhan Liu,Yue Zhu,Xiangxi Mo,Junchen Jiang,Huamin Chen
机构: IBM Research (IBM 研究院); Tencent (腾讯); University of Chicago (芝加哥大学); UC Berkeley (加州大学伯克利分校); Red Hat (红帽)
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注: 5 pages, excluding references and appendix. To be appeared at Workshop on ML for Systems at NeurIPS 2025, December 6, 2025 this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems
zh

[NLP-99] How Reliable is Language Model Micro-Benchmarking?

【速读】：该论文旨在解决微基准测试（micro-benchmarking）在语言模型评估中可靠性不足的问题，即：微基准测试能否像完整基准测试一样稳定地对模型进行排序？其解决方案的关键在于引入一种元评估指标（meta-evaluation measure），用于衡量微基准测试在不同模型性能差异下正确排序的能力。该方法可识别哪些模型对能被微基准测试准确排名，从而实现对微基准测试规模与可靠性之间权衡的细粒度分析。研究发现，仅用少量样本（如10个示例）无法可靠区分性能相近的模型（如MMLU-Pro上相差3.5分或BIG-bench Hard上相差4分），而通常需至少250个样本才能保证一致性，此时随机采样已具备与现有微基准方法相当的可靠性。

链接: https://arxiv.org/abs/2510.08730
作者: Gregory Yauney,Shahzaib Saqib Warraich,Swabha Swayamdipta
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.
zh

[NLP-100] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）自动生成测试用例（test cases）时缺乏有效评估基准的问题。现有基准存在计算成本高、评分虚高以及对简单错误敏感而忽略罕见但关键缺陷的偏倚，导致无法真实反映生成方法的诊断能力。其解决方案的关键在于将基准构建形式化为在二进制代码-测试矩阵中寻找最优诊断基底（diagnostic basis），通过矩阵的秩确定最小独立错误模式数量，并以此作为完整故障覆盖所需测试用例数的紧致上界；进一步提出 WrongSelect 算法，以近似求解 NP-hard 问题，选择内部多样性最大的错误代码子集，从而构造出紧凑、多样且抗评分膨胀的 TC-Bench 基准。实验表明，当前最先进的测试用例生成方法在 TC-Bench 上仅能实现约 60% 的错误排除率，揭示了其诊断能力的显著不足。

链接: https://arxiv.org/abs/2510.08720
作者: Xianzhen Luo,Jinyang Huang,Wenzhen Zheng,Qingfu Zhu,Mingzheng Xu,Yiheng Xu,Yuantao Fan,Libo Qin,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); StepFun; Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: this https URL and our code is at: this https URL.
zh

[NLP-101] hinking Longer Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在法律领域中进行案例推理（Case-based Reasoning）时，对案件间关键差异识别与论证分析能力不足的问题。其核心挑战在于如何将复杂的法律推理过程形式化并分层评估，以揭示模型在不同推理层级上的表现差异。解决方案的关键在于提出一个三阶段的形式化框架：首先通过事实谓词（factual predicates）建模案件要素，构建法律知识层次结构；其次定义可验证的规则用于识别区分点、分析其论证支持度，并最终评估其法律重要性。该框架使研究者能够对LLM在表面推理（Task 1）、层级推理（Task 2）和整合分析（Task 3）三个维度上进行细粒度测评，从而发现模型虽在浅层任务表现良好，但在深层法律推理中性能显著下降，且存在“计算资源投入越多越易出错”的悖论现象，为提升法律AI系统的可靠性提供了关键方法论基础。

链接: https://arxiv.org/abs/2510.08710
作者: Li Zhang,Matthias Grabmair,Morgan Gray,Kevin Ashley
机构: University of Pittsburgh (匹兹堡大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that “thinking longer” does not always mean “thinking smarter.” Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.
zh

[NLP-102] Scaling Laws for Code: A More Data-Hungry Regime

【速读】：该论文旨在解决代码大语言模型（Code Large Language Models, Code LLMs）的训练效率优化问题，特别是现有基于自然语言（Natural Language, NL）的缩放定律（scaling laws）是否适用于代码这一具有严格语法结构的特殊模态。其关键解决方案是通过大规模实证研究（117次实验，模型规模0.2B–3.8B，训练token数2B–128B），系统性地验证并比较Chinchilla缩放定律与Farseer缩放定律在代码场景下的适用性，结果表明：Farseer定律更具表达力且预测更准确；代码LLM在模型规模上具有良好扩展性，但需要更高的数据-参数比（data-to-parameter ratio），即代码训练更依赖大规模数据；此外，在代码与自然语言混合训练中，自然语言可缓解计算资源受限场景下的性能瓶颈，但在高算力预算下反而会降低模型表现。

链接: https://arxiv.org/abs/2510.08702
作者: Xianzhen Luo,Wenzhen Zheng,Qingfu Zhu,Rongyi Zhang,Houyi Li,Siming Huang,YuanTao Fan,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); StepFun; Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
zh

[NLP-103] BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在代码生成任务中缺乏高效、可靠且可扩展的人类评估机制的问题。传统方法依赖人工逐行审查代码质量，不仅耗时费力，还难以准确判断代码功能正确性，尤其在多轮交互和复杂执行环境下的评估尤为困难。解决方案的关键在于构建一个名为BigCodeArena的开源人类评估平台，其核心创新是集成实时代码执行环境，使评估者能够直接观察LLM生成代码的运行结果并据此进行偏好判断。该平台基于Chatbot Arena开发，支持跨10种编程语言和8类执行环境的动态验证，并收集了超过14,000个代码导向对话会话及4,700个多轮配对偏好样本。在此基础上，研究进一步提出两个基准：BigCodeReward用于衡量奖励模型与人类偏好的一致性，以及AutoCodeArena——一种无需人工参与的自动Elo评分系统，用于客观量化前沿LLMs的代码生成能力。实验表明，具备执行反馈的LLMs在编码偏好判断上表现更优，且GPT-5、Claude-Sonnet-4等专有模型仍保持领先优势。

链接: https://arxiv.org/abs/2510.08697
作者: Terry Yue Zhuo,Xiaolong Jin,Hange Liu,Juyong Jiang,Tianyang Liu,Chen Gong,Bhupesh Bishnoi,Vaisakhi Mishra,Marek Suppa,Noah Ziems,Saiteja Utpala,Ming Xu,Guangyu Song,Kaixin Li,Yuhan Cao,Bo Liu,Zheng Liu,Sabina Abdurakhmanova,Wenhao Yu,Mengzhao Jia,Jihan Yao,Kenneth Hamilton,Kumar Shridhar,Minh Chien Vu,Dingmin Wang,Jiawei Liu,Zijian Wang,Qian Liu,Binyuan Hui,Meg Risdal,Ahsen Khaliq,Atin Sood,Zhenchang Xing,Wasi Uddin Ahmad,John Grundy,David Lo,Banghua Zhu,Xiaoning Du,Torsten Scholak,Leandro von Werra
机构: Monash University (莫纳什大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Purdue University (普渡大学); Independent (独立); HKUST (Guangzhou) (香港科技大学（广州）); UCSD (加州大学圣地亚哥分校); UVA (弗吉尼亚大学); CNRS, France (法国国家科学研究中心); IBM (国际商业机器公司); Cisco (思科); Comenius University in Bratislava (布拉迪斯拉发考门斯基大学); University of Notre Dame (圣母大学); Uber (优步); Tano Labs (Tano实验室); NUS (新加坡国立大学); Institute of Automation, CAS (中国科学院自动化研究所); Tencent AI Lab (腾讯人工智能实验室); University of Washington (华盛顿大学); Nevsky Collective (涅夫斯基集体); ETH Zurich (苏黎世联邦理工学院); Detomo Inc (Detomo公司); University of Oxford (牛津大学); UIUC (伊利诺伊大学厄巴纳-香槟分校); Google (谷歌); NVIDIA (英伟达); Singapore Management University (新加坡管理大学); ServiceNow Research (ServiceNow研究院); Hugging Face (Hugging Face)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Built with love by the BigCode community :)

点击查看摘要

Abstract:Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.
zh

[NLP-104] Optimizing delivery for quick commerce factoring qualitative assessment of generated routes

【速读】：该论文旨在解决印度电商市场中末端配送（last-mile delivery）成本高企的问题，特别是传统车辆路径问题（Vehicle Routing Problem, VRP）求解器在实际应用中因地址不规范、地图信息不完整及距离估算计算约束等因素导致的效率低下问题。其解决方案的关键在于引入大语言模型（Large Language Models, LLMs）作为评估层，对VRP生成的路线进行基于政策规则的批判性分析，从而帮助物流运营商识别并优先选择更高效、可靠且可持续的配送方案。实验表明，开源LLM可实现79%的路由问题识别准确率，而专有推理模型可达86%，验证了LLM在超越传统距离与时间指标基础上提升末端物流优化能力的有效性与可扩展性。

链接: https://arxiv.org/abs/2510.08671
作者: Milon Bhattacharya,Milan Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Indias e-commerce market is projected to grow rapidly, with last-mile delivery accounting for nearly half of operational expenses. Although vehicle routing problem (VRP) based solvers are widely used for delivery planning, their effectiveness in real-world scenarios is limited due to unstructured addresses, incomplete maps, and computational constraints in distance estimation. This study proposes a framework that employs large language models (LLMs) to critique VRP-generated routes against policy-based criteria, allowing logistics operators to evaluate and prioritise more efficient delivery plans. As a illustration of our approach we generate, annotate and evaluated 400 cases using large language models. Our study found that open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved reach upto 86%. The results demonstrate that LLM-based evaluation of VRP-generated routes can be an effective and scalable layer of evaluation which goes beyond beyond conventional distance and time based metrics. This has implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially for developing countries like India.
zh

[NLP-105] dInfer: An Efficient Inference Framework for Diffusion Language Models

【速读】：该论文旨在解决扩散模型类大语言模型（diffusion-based large language models, dLLMs）在实际应用中因缺乏标准化且高效的推理框架而导致的广泛采用受限问题。其解决方案的关键在于提出了一种名为dInfer的高效可扩展推理框架，该框架将推理流程分解为四个模块化组件——模型、扩散迭代管理器、解码策略和KV缓存管理器，并在每个组件中集成新颖的算法与系统级优化技术。通过算法创新与系统优化的协同作用，dInfer在不牺牲输出质量的前提下实现了显著的性能提升，在8×H800 GPU环境下，单批次推理速度超过1,100 tokens/秒（HumanEval基准），并在六个基准测试上平均达到800 tokens/秒以上，相较Fast-dLLM提速达10倍，同时较高度优化的自回归模型QWen2.5-3B仍具2–3倍速度优势。

链接: https://arxiv.org/abs/2510.08666
作者: Yuxin Ma,Lun Du,Lanning Wei,Kun Chen,Qian Xu,Kangyu Wang,Guofeng Feng,Guoshan Lu,Lin Liu,Xiaojing Qi,Xinyuan Zhang,Zhen Tao,Haibo Feng,Ziyun Jiang,Ying Xu,Zenan Huang,Yihong Zhuang,Haokai Xu,Jiaqi Hu,Zhenzhong Lan,Junbo Zhao,Jianguo Li,Da Zheng
机构: Ant Group(蚂蚁集团); Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on 8\times H800 GPUs. Compared to prior systems, dInfer delivers 10\times speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers 2 - 3\times speedup. The implementation of dInfer is open-sourced at this https URL.
zh

[NLP-106] A Novel Framework for Augmenting Rating Scale Tests with LLM -Scored Text Data

【速读】：该论文旨在解决传统心理评估中依赖结构化评分量表无法捕捉受试者自然语言丰富语义信息的问题。其解决方案的关键在于提出一种新颖的增强测试框架，通过结合大型语言模型（Large Language Model, LLM）对文本内容的评分与传统量表项目，实现对心理特质（以抑郁为例）的更精准测量。该方法不依赖预标注数据或复杂专家制定的评分标准，而是基于项目信息量计算，实证选择最优LLM评分指令，从而显著提升测量精度和准确性，相当于在原19项量表基础上增加6.3至16.0个等效项目。

链接: https://arxiv.org/abs/2510.08663
作者: Joe Watson,Ivan O’Conner,Chia-Wen Chen,Luning Sun,Fang Luo,David Stillwell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent’s natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.
zh

[NLP-107] Formalizing Style in Personal Narratives

【速读】：该论文试图解决个人叙事中风格（style）缺乏系统分析框架的问题，即如何形式化地捕捉和解释作者在表达主观体验时的语言选择模式。其解决方案的关键在于构建一个整合功能语言学、计算机科学与心理学的跨学科框架：通过语言模型自动提取叙事中的语言特征（如过程、参与者和情境），识别出语言使用的序列模式，并将其与心理状态关联起来，从而实现对个人叙事风格的量化与意义阐释。

链接: https://arxiv.org/abs/2510.08649
作者: Gustave Cortal(ENS Paris Saclay, LISN),Alain Finkel(ENS Paris Saclay)
机构: Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF; Université Paris-Saclay, CNRS, LISN
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.
zh

[NLP-108] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression ACL2026

【速读】：该论文旨在解决长链式思维（Chain-of-Thought, CoT）在大型语言模型（Large Language Models, LLMs）推理过程中存在的计算成本高和延迟大的问题，其核心挑战在于如何在不显著损失推理能力的前提下实现CoT的高效压缩。解决方案的关键在于提出一种名为“前置思维嵌入”（Upfront CoT, UCoT）的协作式推理框架：该框架由一个小型压缩模型（compressor）和一个大型执行模型（executor）组成，第一阶段通过训练压缩模型生成富含推理信息的前置思维嵌入（upfront thought embeddings），从而避免人工设计提示词的繁琐过程；第二阶段利用奖励机制优化执行模型，使其能够基于这些嵌入以更短的推理路径得出正确答案。实验表明，UCoT在保持强大推理能力的同时显著缩短了CoT长度，在Qwen2.5-7B-Instruct模型上于GSM8K数据集上token使用量减少50%，且性能优于当前最优方法3.08%。

链接: https://arxiv.org/abs/2510.08647
作者: Chengzhengxu Li,Xiaoming Liu,Zhaohan Zhang,Shaochu Zhang,Shengchao Liu,Guoxin Ma,Yu Lan,Chao Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2026 Under Review

点击查看摘要

Abstract:Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50%, while the performance is 3.08% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.
zh

[NLP-109] Energy-Driven Steering: Reducing False Refusals in Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）安全对齐中的一个核心问题：现有对齐方法通常仅关注提升对有害提示的安全性，导致模型过度谨慎，频繁拒绝良性请求（即“误拒”），从而损害可用性。为此，作者提出了一种无需微调的新型框架——能量驱动引导（Energy-Driven Steering, EDS），其关键在于引入一个轻量级外部能量模型（Energy-Based Model, EBM），该模型通过定义能量景观来区分理想状态（如安全拒绝或有用响应）与不良状态（如误拒或越狱攻击）。在推理阶段，EBM基于LLM内部激活计算能量梯度，并动态调整隐藏状态以引导模型进入低能量区域，从而实时修正输出行为，而无需修改原始模型权重。这一机制实现了行为控制与模型知识的解耦，显著降低误拒率的同时保持原有安全性，实验证明在ORB-H基准上合规率从57.3%提升至82.6%。

链接: https://arxiv.org/abs/2510.08646
作者: Eric Hanchen Jiang,Weixuan Ou,Run Liu,Shengyuan Pang,Guancheng Wan,Ranjie Duan,Wei Dong,Kai-Wei Chang,XiaoFeng Wang,Ying Nian Wu,Xinfeng Li
机构: University of California, Los Angeles (加州大学洛杉矶分校); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM’s internal activations to an “energy landscape”. We use the gradient of the energy function to dynamically steer the LLM’s hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model’s core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.
zh

[NLP-110] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models NEURIPS2025

【速读】：该论文旨在解决传统语言模型在生成高质量文本时面临的复杂语义建模难题，尤其是如何更有效地捕捉和生成具有层次化语义结构的文本。其解决方案的关键在于提出了一种分层扩散语言模型（Hierarchical Diffusion Language Models, HDLM），该模型基于一个分层词汇表结构，将低层级的细粒度语义token映射到高层级的粗粒度语义token；在前向扩散过程中，每个token独立地被扰动为其更高层级的祖先token，而在反向过程则逐步预测更细节化的语义表示。这一机制构建了一个可变时间步长的语义尺度预测流程，从而实现了对语言建模中语义层次的灵活控制与优化。

链接: https://arxiv.org/abs/2510.08632
作者: Cai Zhou,Chenyu Wang,Dinghuai Zhang,Shangyuan Tong,Yifei Wang,Stephen Bates,Tommi Jaakkola
机构: Massachusetts Institute of Technology (麻省理工学院); Microsoft Research (微软研究院); Mila - Quebec AI Institute (蒙特利尔人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:In this paper we introduce Hierarchical Diffusion Language Models (HDLM) – a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.
zh

[NLP-111] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

【速读】：该论文旨在解决仇恨表情包（hateful memes）自动检测系统缺乏解释性与上下文支持的问题，传统二分类模型仅输出简单判断，无法满足实际内容审核中对原因分析和政策依据的需求。其关键解决方案是提出ExPO-HM（Explain-then-Detect Policy Optimization for Hateful Memes），通过模拟人工标注者训练流程，融合监督微调（SFT）预热、带课程学习的GRPO强化学习以及条件决策熵（Conditional Decision Entropy, CDE）作为推理质量的度量与奖励信号，从而显著提升检测精度、细粒度分类能力和解释合理性，在多个基准上相较GRPO和DPO基线分别实现最高达15%和17%的F1分数提升。

链接: https://arxiv.org/abs/2510.08630
作者: Jingbiao Mei,Mingsheng Sun,Jinghong Chen,Pengda Qin,Yuhong Li,Da Chen,Bill Byrne
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
zh

[NLP-112] From What to Why: Thought-Space Recommendation with Small Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推荐系统中推理能力强大但推理成本过高、难以部署，而小语言模型（Small Language Models, SLMs）虽具高效性却缺乏推荐推理能力的问题。现有方法通常将自然语言理由仅作为无监督描述文本使用，未能充分挖掘其作为学习信号的潜力。解决方案的关键在于提出PULSE（Preference Understanding by Latent Semantic Embeddings）框架，通过构建跨域统一的“思维空间”（Thought Space），利用SLM生成的rationales作为指导性学习信号，并结合用户交互历史联合建模用户行为（what）及其语义驱动因素（why），从而实现更鲁棒和可迁移的推荐表示。此设计首次将rationales视为第一类信号，显著提升了模型在多场景下的性能与泛化能力。

链接: https://arxiv.org/abs/2510.08626
作者: Prosenjit Biswas,Pervez Shaik,Abhinav Thorat,Ravi Kolla,Niranjan Pedanekar
机构: Sony Research India (索尼研究印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs’ distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \hrefthis https URLhere.
zh

[NLP-113] Do LLM s Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT -OSS-20B

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）评估中存在“评价气味”（evaluation scent）的问题，即评测基准常依赖于引导模型展示推理过程和严格格式的提示词，而这种设计可能人为夸大模型性能，却未带来实际部署所需的简洁、合规且可靠输出能力的提升。其解决方案的关键在于构建一套可复现的A/B对比框架，通过固定任务内容与解码策略，仅改变提示框架（评价导向 vs. 实际应用场景）和推理深度（中/高），系统性地量化不同条件下模型输出在准确性、答案合规性、链式思维（Chain-of-Thought, CoT）长度、拒绝响应比例及结构化输出质量等方面的差异。实证结果表明，评价导向提示显著增加CoT长度但降低答案独立合规性，且对多语言场景（如乌尔都语标题）可能引入额外误差，从而揭示了现有基准指标与真实部署需求之间的脱节，并提出包括中立提示、双框架校验、合同意识评分、风格差异报告、置信度治理和多语言仪表盘在内的实践指南，以确保评测结果真正反映可部署能力。

链接: https://arxiv.org/abs/2510.08624
作者: Nisar Ahmed,Muhammad Imran Zaman,Gulshan Saleem,Ali Hassan
机构: SparkVerse; University of Central Punjab (UCP)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such “evaluation scent” inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to 1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.
zh

[NLP-114] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction EMNLP2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在从非结构化文本中提取结构化信息时，因依赖静态JSON Schema而导致的提取准确率低、幻觉频发及代理行为不可靠的问题。其核心挑战在于现有JSON Schema作为人类开发者设计的“契约”，缺乏对LLM的适配性，尤其当Schema存在歧义或不完整时，会显著影响生成式AI（Generative AI）系统的稳定性与性能。解决方案的关键在于提出PARSE系统，包含两个协同模块：ARCHITECT通过自动优化JSON Schema以提升LLM可理解性并保持向后兼容性（借助RELAY代码生成机制），SCOPE则引入基于反思的抽取策略，结合静态规则与LLM驱动的约束机制实现更可靠的结构化输出。实验证明，该框架在多个数据集上显著提升了抽取准确率（最高达64.7%），并将首次重试错误率降低92%，同时维持实用延迟。

链接: https://arxiv.org/abs/2510.08623
作者: Anubhav Shrimal,Aryan Jain,Soumyajit Chowdhury,Promod Yenigalla
机构: RBS Tech Sciences, Amazon(亚马逊)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: EMNLP 2025 Industry Track

点击查看摘要

Abstract:Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
zh

[NLP-115] xt2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

【速读】：该论文旨在解决如何自动化评估由大语言模型（Large Language Models, LLMs）生成的软件需求（以用户故事形式呈现）与原始访谈文本之间的一致性问题，即确保生成的需求忠实反映利益相关者的真实需求。传统上，此类评估依赖人工判断，效率低下且难以扩展。其解决方案的关键在于提出Text2Stories任务及相应的量化指标体系：通过将访谈文本切分为语义块（text chunks），构建块与用户故事之间的匹配问题，从而分别衡量“正确性”（故事是否被原文支持）和“完整性”（原文是否被至少一个故事覆盖）。实验表明，基于LLM的匹配器在独立标注数据集上达到0.86宏F1分数，嵌入模型虽性能略低但可有效用于预筛选（blocking），从而为比较不同来源（如人工 vs. 生成）的用户故事提供可扩展、源文忠实的评估框架。

链接: https://arxiv.org/abs/2510.08622
作者: Francesco Dente,Fabiano Dalpiaz,Paolo Papotti
机构: EURECOM; Utrecht University
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 8 pages

点击查看摘要

Abstract:Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders’ needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.
zh

[NLP-116] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

【速读】：该论文旨在解决销售导向型对话系统在实际应用中因缺乏对用户个体差异敏感性而导致的对话效率与成功率不足的问题。其解决方案的关键在于引入一种轻量级的、基于职业（occupation）条件的策略，通过分析用户职业特征对对话意图的影响，使代理模型能够优先选择符合用户偏好的交互意图，从而显著缩短对话轮次并提升整体对话效果。这一方法强调了丰富模拟器用户画像的重要性，并验证了简单的人格化策略可有效增强销售导向型对话系统的性能。

链接: https://arxiv.org/abs/2510.08621
作者: Wen-Yu Chang,Tzu-Hung Huang,Chih-Ho Chen,Yun-Nung Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.
zh

[NLP-117] JAI-1: A Thai-Centric Large Language Model

【速读】：该论文旨在解决现有泰国语（Thai）语言模型在迁移学习过程中因参数空间冲突而导致通用知识流失的问题。当前主流方法通常基于开源大语言模型（Large Language Model, LLM）进行增量训练以适配泰语，但这种做法可能破坏原模型中已优化的通用能力，因为通用任务与泰语特异性需求之间存在参数优化冲突。解决方案的关键在于采用“上扩策略”（upscaling strategy）：从一个高性能英文开源LLM出发，通过扩展参数规模并系统性地注入泰语知识，从而在不损害原始通用智能的前提下构建专用于泰语的模型架构。此方法不仅保留了原模型的泛化能力，还实现了可扩展的未来升级路径，最终在多个泰语基准测试中优于Typhoon2-70B模型，验证了其有效性。

链接: https://arxiv.org/abs/2510.08620
作者: Attapol T. Rutherford,Jullajak Karnjanaekarin,Narongkorn Panitsrisit,Pontakorn Trakuekul,Sumana Sumanakul,Natchanon Pollertlam
机构: Jasmine Technology Solution(贾斯明科技解决方案); Chulalongkorn University(朱拉隆功大学); Upstage(Upstage)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model’s parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model’s general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages – supervised fine-tuning and alignment tuning – using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.
zh

[NLP-118] LLM s Show Surface-Form Brittleness Under Paraphrase Stress Tests NEURIPS2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在基准测试中可能出现的性能虚高问题，其根源在于模型对测试样本或近似重复内容的记忆（memorization）以及依赖表面形式的浅层捷径（surface-form shortcuts）。为评估模型的真实泛化能力，作者提出一种简单但有效的协议：通过重新评估模型在改写（paraphrased）版本的基准问题上的表现来探测其泛化性。该方案的关键在于构建一个包含严格解码控制、强制多选输出格式以及鲁棒的改写清洗步骤的流水线，确保语义不变的前提下有效扰动输入形式，从而揭示模型因数据污染或脆弱特征依赖导致的准确性下降。

链接: https://arxiv.org/abs/2510.08616
作者: Juan Miguel Navarro Carranza
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: NeurIPS 2025 Workshop. Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling. Selected for contributed talk

点击查看摘要

Abstract:Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.
zh

[NLP-119] Iterative LLM -Based Generation and Refinement of Distracting Conditions in Math Word Problems

【速读】：该论文旨在解决当前数学应用题（Math Word Problems, MWPs）数据集中缺乏有效干扰条件（distracting conditions）的问题，尤其是现有数据集在干扰条件的设置上存在难度不足、语境不合理以及易被模型识别和忽略等缺陷，导致对大语言模型（Large Language Models, LLMs）数学推理能力的评估可信度降低。其解决方案的关键在于设计了一个迭代式框架，利用LLMs自动生成与原问题共享相同解法的干扰条件：通过精心设计的提示（prompts）从多角度和认知层次引导模型生成有意义的干扰项，并明确要求不改变原始问题的解答逻辑，从而避免了人工重写答案的繁琐过程，显著提升了生成效率与数据质量。

链接: https://arxiv.org/abs/2510.08615
作者: Kaiqi Yang,Hang Li,Yucheng Chu,Zitao Liu,Mi Tian,Hui Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.08615 [cs.CL] (or arXiv:2510.08615v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.08615 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-120] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医疗场景中可能因性别身份分配不一致而引入的隐性偏见问题，特别是这种偏见如何影响临床诊断判断和对患者性别信息的评估。解决方案的关键在于通过系统性地为多个开源与专有LLM分配不同性别（女性、男性或未指定），并基于《新英格兰医学杂志》挑战案例进行实证分析，发现尽管多数模型在诊断一致性上表现稳定，但在判断患者性别是否具有临床相关性或必要性时表现出显著的性别偏差，尤其在“相关性”评估中更为突出，部分模型甚至呈现系统性的女性-男性差异。这揭示了LLM在医疗应用中潜在但此前被忽视的身份敏感性偏见，强调需建立常规化的身份赋值一致性检测机制，以保障AI辅助临床决策的可靠性与公平性。

链接: https://arxiv.org/abs/2510.08614
作者: Mingxuan Liu,Yuhe Ke,Wentao Zhu,Mayli Mertens,Yilin Ning,Jingchi Liao,Chuan Hong,Daniel Shu Wei Ting,Yifan Peng,Danielle S. Bitterman,Marcus Eng Hock Ong,Nan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models’ judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender’s relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.
zh

[NLP-121] GraphGhost: Tracing Structures Behind Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中推理能力的结构机制尚不明确的问题，即如何从神经元激活与信号传播的角度揭示其结构语义捕捉和输出生成的内在原理。解决方案的关键在于提出GraphGhost框架，将神经元激活及其信号传播建模为图结构，从而利用图算法（如PageRank）量化模型的推理行为，并通过结构干预识别关键神经元节点——实验证明对这些节点的编辑可导致推理崩溃，进而破坏逻辑流与语义理解，这为解析、干预乃至理解LLMs推理结构基础提供了新范式。

链接: https://arxiv.org/abs/2510.08613
作者: Xinnan Dai,Kai Guo,Chung-Hsiang Lo,Shenglai Zeng,Jiayuan Ding,Dongsheng Luo,Subhabrata Mukherjee,Jiliang Tang
机构: Michigan State University (密歇根州立大学); Northeastern University (东北大学); Florida International University (佛罗里达国际大学); Hippocratic AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.
zh

[NLP-122] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在非西方、低资源语境下多模态理解与推理能力显著下降的问题，尤其关注亚洲文化背景下的文化敏感性缺失。其解决方案的关键在于提出MMA-ASIA框架，该框架包含一个由人工标注的多语言、多模态对齐的多项选择基准数据集，覆盖8个亚洲国家和10种语言，共27,000道题目，其中超过79%需基于文化语境进行多步推理；该数据集首次在输入层面实现文本、图像（视觉问答）和语音三模态对齐，支持跨模态迁移测试。此外，研究构建了五维评估协议以量化文化意识差异、跨语言一致性、跨模态一致性、文化知识泛化能力和基础有效性，并引入文化意识基础验证模块检测“捷径学习”现象，最终通过注意力追踪和创新的视觉消融前缀重放（Vision-ablated Prefix Replay, VPR）方法揭示模型在不同语言和模态间表现差异的原因，为构建文化可靠的多模态大模型提供可操作洞见。

链接: https://arxiv.org/abs/2510.08608
作者: Weihua Zheng,Zhengyuan Liu,Tanmoy Chakraborty,Weiwen Xu,Xiaoxue Gao,Bryan Chen Zhengyu Tan,Bowei Zou,Chang Liu,Yujia Hu,Xing Xie,Xiaoyuan Yi,Jing Yao,Chaojun Wang,Long Li,Rui Liu,Huiyao Liu,Koji Inoue,Ryuichi Sumida,Tatsuya Kawahara,Fan Xu,Lingyu Ye,Wei Tian,Dongjun Kim,Jimin Jung,Jaehyung Seo,Nadya Yuki Wangsajaya,Pham Minh Duc,Ojasva Saxena,Palash Nandi,Xiyan Tao,Wiwik Karlina,Tuan Luong,Keertana Arun Vasan,Roy Ka-Wei Lee,Nancy F. Chen
机构: Singapore University of Technology and Design (新加坡科技设计大学); Agency for Science, Technology and Research, Singapore (新加坡科技研究局); Indian Institute of Technology Delhi (印度理工学院德里分校); Alibaba DAMO Academy (阿里巴巴达摩院); Microsoft Research Asia (微软亚洲研究院); Shanghai University of Finance and Economics (上海财经大学); Inner Mongolia University (内蒙古大学); Kyoto University (京都大学); Jiangxi Normal University (江西师范大学); Korea University (韩国科学技术院); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs’ cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects “shortcut learning” by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.
zh

[NLP-123] Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations ICASSP2026

【速读】：该论文旨在解决多模态情感识别（Emotion Recognition in Conversations, ERC）中因情感判别证据稀疏、局部化且跨模态异步导致的建模困难问题。其解决方案的关键在于提出一种以“情感热点”（emotion hotspots）为中心的统一模型：首先在文本、音频和视频中检测每句utterance的情感热点区域，随后通过热点门控融合（Hotspot-Gated Fusion, HGF）机制将局部热点与全局特征进行融合，并利用路由混合对齐器（routed Mixture-of-Aligners, MoA）实现跨模态对齐；同时引入跨模态图结构编码对话结构。该设计聚焦于显著片段建模，缓解模态错位问题并保留上下文信息，在标准ERC基准上显著优于强基线模型，验证了热点驱动范式在多模态情感识别中的有效性。

链接: https://arxiv.org/abs/2510.08606
作者: Yu Liu,Hanlei Shi,Haoxun Li,Yuqing Sun,Yuxuan Ding,Linlin Gong,Leyuan Qu,Taihao Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review for ICASSP 2026

点击查看摘要

Abstract:Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.
zh

[NLP-124] oward a Safer Web: Multilingual Multi-Agent LLM s for Mitigating Adversarial Misinformation Attacks

【速读】：该论文旨在解决数字平台上虚假信息（misinformation）快速传播对公共话语、情绪稳定性和决策过程造成的威胁，尤其是针对现有研究中尚未系统探讨的语言切换（language-switching）、查询长度膨胀（query length inflation）及结构重排（structural reformatting）等新型对抗性攻击手段。其解决方案的关键在于提出一个基于检索增强生成（retrieval-augmented generation, RAG）的多语言、多智能体大语言模型框架，并可作为网页插件部署于在线平台，从而实现对多样化攻击方式的高效检测与防护，验证了插件化部署在真实网络环境中的可行性与有效性。

链接: https://arxiv.org/abs/2510.08605
作者: Nouar Aldahoul,Yasir Zaki
机构: New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.
zh

[NLP-125] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

【速读】：该论文旨在解决现有对抗性越狱攻击（jailbreak）易被基于困惑度（perplexity）的过滤机制检测的问题。当前主流方法通过添加高困惑度的对抗后缀或冗长的提示模板来诱导大语言模型生成有害内容，但这些特征容易被检测。其解决方案的关键在于提出LatentBreak——一种白盒越狱攻击方法，该方法不依赖于添加额外文本，而是通过在输入提示中替换语义等价词（semantically-equivalent words），使对抗提示在潜在空间（latent space）中尽可能接近无害请求的表示，从而保持低困惑度并有效规避防御机制。实验表明，LatentBreak生成的提示更短、困惑度更低，在多个安全对齐模型上均优于现有攻击算法。

链接: https://arxiv.org/abs/2510.08604
作者: Raffaele Mura,Giorgio Piras,Kamilė Lukošiūtė,Maura Pintor,Amin Karbasi,Battista Biggio
机构: University of Cagliari (卡利亚里大学); Centre for AI Governance (人工智能治理中心); Foundation AI – Cisco Systems Inc. (Foundation AI – 思科系统公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.
zh

[NLP-126] YpathRAG :A Retrieval-Augmented Generation Framework and Benchmark for Pathology

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在病理学等高门槛领域中仍存在幻觉（hallucination）的问题，现有方法如领域微调（domain fine-tuning）无法扩展知识边界或施加基于证据的约束。其解决方案的关键在于构建一个覆盖28个子领域、包含153万段落的病理向量数据库，并提出YpathRAG框架，该框架采用双通道混合检索机制（BGE-M3密集检索与词汇引导的稀疏检索相结合），并引入基于LLM的支持性证据判断模块，以闭环式整合检索、判断与生成过程，从而提升检索质量和事实可靠性。

链接: https://arxiv.org/abs/2510.08603
作者: Deshui Yu,Yizhi Wang,Saihui Jin,Taojie Zhu,Fanyi Zeng,Wen Qian,Zirui Huang,Jingli Ouyang,Jiameng Li,Zhen Song,Tian Guan,Yonghong He
机构: Tsinghua University Shenzhen International Graduate School (清华大学深圳国际研究生院); China Unicom Guangdong Branch (中国联通广东分公司); Department of Network Intelligence, Peng Cheng Laboratory (鹏城实验室网络智能部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.
zh

[NLP-127] Human Texts Are Outliers: Detecting LLM -generated Texts via Out-of-distribution Detection

【速读】：该论文旨在解决当前AI生成文本检测方法在跨领域和跨模型场景下泛化能力差的问题。现有方法多将检测任务视为二分类问题，但这种设定忽略了人类写作文本分布的异质性，导致模型易对特定数据特征过拟合而非学习到本质的“非ID”（即非机器生成）行为模式。论文的关键解决方案是将检测任务重新定义为分布外检测（Out-of-Distribution, OOD）检测问题，即把人类写作视为分布外样本，而机器生成文本作为分布内（In-Distribution, ID）样本。基于此，作者提出结合单类学习（如DeepSVDD、HRN）与基于得分的学习技术（如能量模型），构建具有鲁棒性和通用性的检测框架，在多个数据集及多语言、对抗攻击、未见模型和领域等复杂设置下均表现出优越性能。

链接: https://arxiv.org/abs/2510.08602
作者: Cong Zeng,Shengkun Tang,Yuanzhou Chen,Zhiqiang Shen,Wenchao Yu,Xujiang Zhao,Haifeng Chen,Wei Cheng,Zhiqiang Xu
机构: MBZUAI; UCLA; NEC Lab
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID’ behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.
zh

[NLP-128] Mnemosyne: An Unsupervised Human-Inspired Long-Term Memory Architecture for Edge-Based LLM s

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）在边缘设备上长期记忆能力不足的问题，尤其是现有方法依赖暴力扩展上下文或静态检索管道，难以在资源受限环境下实现自然、连贯的长时间对话。其解决方案的关键在于提出一种受人类记忆机制启发的无监督长时记忆架构 Mnemosyne，该架构采用图结构存储、模块化物质与冗余过滤器、记忆提交与修剪机制，并引入具有时间衰减和刷新过程的概率性回忆策略。此外，Mnemosyne 还设计了一个固定长度子图提取的“核心摘要”，用于高效捕捉用户个性等长期特征，在纵向医疗对话场景中显著提升了真实感与记忆能力，实验表明其在盲评中胜率高达 65.8%，优于传统检索增强生成（Retrieval-Augmented Generation, RAG）基线（31.1%），并在 LoCoMo 基准测试中取得最优的时间推理与单跳检索性能。

链接: https://arxiv.org/abs/2510.08601
作者: Aneesh Jonelagadda,Christina Hahn,Haoze Zheng,Salvatore Penachio(Kaliber AI)
机构: Kaliber AI(卡尔伯AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated “core summary” efficiently derived from a fixed-length subset of the memory graph to capture the user’s personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.
zh

[NLP-129] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation EMNLP2025

【速读】：该论文旨在解决语言模型在部署过程中因推理优化（如量化、剪枝、格式转换等）导致的性能下降问题，尤其是针对由不当模型序列化等操作引起的权重损坏所引发的准确率损失。其解决方案的关键在于提出 Recover-LoRA 方法，该方法通过合成数据与 logit 蒸馏技术，在选定层上学习 LoRA（Low-Rank Adaptation）适配器，从而将受损模型对齐至全精度基准模型，实现无需特定数据集且轻量化的准确率恢复。

链接: https://arxiv.org/abs/2510.08600
作者: Devleena Das,Rajeev Patwari,Ashish Sirasao
机构: Advanced Micro Devices, Inc. (AMD)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025 Industry Track

点击查看摘要

Abstract:Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.
zh

[NLP-130] Confidence Not Perplexity: A Better Metric for the Creative Era of LLM s AACL

【速读】：该论文旨在解决现有无参考指标（如自困惑度，self-perplexity）在评估生成式 AI（Generative AI）文本时对创造性内容存在显著偏差的问题，即这些指标倾向于偏好熟悉、常规的输出，从而低估新颖性。解决方案的关键在于提出一种新的评估指标——置信度得分（Confidence Score, CS），该指标基于模型输出的概率分布计算得出，能够更公平地衡量生成文本的质量与创造性。实验表明，CS 在 99 个创意提示下能以 19% 的概率偏好新颖响应，显著优于传统指标（0%），且可有效区分任务难度层级，展现出更强的判别能力与平衡性。

链接: https://arxiv.org/abs/2510.08596
作者: V. S. Raghu Parupudi
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注: Submitted to AACL-IJCNLP 2025 (Eval4NLP)

点击查看摘要

Abstract:Reference-free metrics like self-perplexity are strongly biased against creative text generation. We propose the Confidence Score (CS), derived from a model’s output probability distribution, as a less biased alternative. Experiments on gpt-4o-mini show that while fluency-based metrics prefer novel responses in 0% of cases on 99 creative prompts, our CS does so 19% of the time, a statistically significant difference (95% CI for difference: [11.1%, 27.3%]). We also show that CS effectively distinguishes between easy, medium, and hard tasks, confirmed by non-overlapping confidence intervals. The Confidence Score thus mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.
zh

[NLP-131] Systematic Diagnosis of Brittle Reasoning in Large Language Models NEURIPS-2025

【速读】：该论文旨在解决当前机器学习模型在数学推理能力评估中缺乏细粒度诊断的问题，即现有标准基准无法揭示模型在特定推理模式下的失败原因。其解决方案的关键在于提出一种新型框架，首先利用GPT-3.5-turbo在GSM8K数据集上生成结构化的分步推理过程，随后使用更强大的分析模型GPT-4o-mini对错误进行分类，并通过无监督聚类识别出隐含的“推理模式”（reasoning modes）。这一方法不仅量化了不同推理技能的可靠性，还揭示了模型在组合推理（combinatorial reasoning）等复杂任务上的显著脆弱性，从而为提升数学理解能力和开发更可靠的AI应用提供了精准的改进方向。

链接: https://arxiv.org/abs/2510.08595
作者: V. S. Raghu Parupudi
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注: Submitted to NEURIPS-2025 MATHAI workshop

点击查看摘要

Abstract:A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent “reasoning modes.” This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.
zh

[NLP-132] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

【速读】：该论文旨在解决语音抑郁症检测（Speech-based Depression Detection, SDD）中因难以提取有意义特征以及难以捕捉随时间变化的稀疏且异质性抑郁线索而导致的性能瓶颈问题。现有方法通常仅使用预训练自监督学习（Self-Supervised Learning, SSL）模型（如WavLM）的最终层或单一最优层，忽略了其分层结构所蕴含的丰富语义信息，导致模型过拟合特定数据集且无法有效识别细微而持续的抑郁信号。解决方案的关键在于提出HAREN-CTC架构，该架构通过两个核心模块实现多层SSL特征的有效融合：一是层级自适应聚类模块（Hierarchical Adaptive Clustering），将SSL特征重组为互补嵌入；二是跨模态融合模块（Cross-Modal Fusion），利用交叉注意力机制建模层间依赖关系；同时引入连接主义时序分类（Connectionist Temporal Classification, CTC）损失函数，以对齐不规则的时间模式，增强模型对稀疏时序监督信号的敏感性。此设计显著提升了模型在DAIC-WOZ和MODMA数据集上的泛化能力和性能表现，宏观F1分数分别达到0.81和0.82，优于现有方法。

链接: https://arxiv.org/abs/2510.08593
作者: Yuxin Li,Eng Siong Chng,Cuntai Guan
机构: Nanyang Technological University (南洋理工大学); Center of AI in Medicine (人工智能医学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2510.08593 [cs.CL] (or arXiv:2510.08593v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.08593 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-133] Less Diverse Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在测试时扩展（Test-Time Scaling, TTS）过程中因候选响应多样性不足而导致的安全性下降问题。TTS通常通过探索多个候选输出并从中选择最优结果来提升大语言模型（Large Language Models, LLMs）的推理能力，其隐含假设是高多样性有助于增强可靠性。然而，本文揭示了一个此前未被识别的失效模式：当候选池多样性受限时，即使程度轻微，TTS也会显著增加产生不安全输出的概率。解决方案的关键在于提出一种参考引导的多样性降低协议（Reference-guided Diversity Reduction, RefDiv），作为诊断性攻击手段用于压力测试TTS流程。实验表明，RefDiv在多种开源和闭源模型中均能一致地放大TTS的安全风险，且现有安全护栏分类器（如Llama-Guard和OpenAI Moderation API）无法有效检测此类攻击输入，凸显了当前TTS策略在多样性敏感性上的脆弱性。

链接: https://arxiv.org/abs/2510.08592
作者: Shahriar Kabir Nahin,Hadi Askari,Muhao Chen,Anshuman Chhabra
机构: Bellini College of AI, Cybersecurity, and Computing, University of South Florida(南佛罗里达大学贝尔ini人工智能、网络安全与计算学院); Department of Computer Science, University of California, Davis(加州大学戴维斯分校计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.
zh

[NLP-134] Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6

【速读】：该论文旨在解决生物医学命名实体识别（BioNER）中因实体类型相似而导致的误分类问题，例如基因与化学物质之间的混淆。其解决方案的关键在于引入一种基于词典的后处理策略，通过匹配预定义术语库对模型输出进行精细化修正，从而提升识别准确性。实验表明，该方法在开发集上将微平均F1分数从0.79提升至0.83，但未能在盲测集上实现泛化性能提升（微F1分数由0.79降至0.77），揭示了模型易过拟合开发数据的问题，并强调了提高泛化能力对于实际应用的重要性。

链接: https://arxiv.org/abs/2510.08588
作者: Ritesh Mehta
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Paper published to CLEF 2025 CEUR-WS

点击查看摘要

Abstract:Biomedical Named Entity Recognition (BioNER), task6 in BioASQ (A challenge in large-scale biomedical semantic indexing and question answering), is crucial for extracting information from scientific literature but faces hurdles such as distinguishing between similar entity types like genes and chemicals. This study evaluates the GLiNER-BioMed model on a BioASQ dataset and introduces a targeted dictionary-based post-processing strategy to address common misclassifications. While this post-processing approach demonstrated notable improvement on our development set, increasing the micro F1-score from a baseline of 0.79 to 0.83, this enhancement did not generalize to the blind test set, where the post-processed model achieved a micro F1-score of 0.77 compared to the baselines 0.79. We also discuss insights gained from exploring alternative methodologies, including Conditional Random Fields. This work highlights the potential of dictionary-based refinement for pre-trained BioNER models but underscores the critical challenge of overfitting to development data and the necessity of ensuring robust generalization for real-world applicability.
zh

[NLP-135] Comparative Analysis of Large Language Models for the Machine-Assisted Resolution of User Intentions

【速读】：该论文试图解决当前基于大语言模型（Large Language Models, LLMs）的用户意图解析与任务编排依赖云端专有模型所带来的隐私、自主性和可扩展性限制问题。其解决方案的关键在于评估开源且开放获取的本地部署LLM在实现用户意图解析和工作流生成方面的可行性与性能表现，从而为下一代以意图为中心的操作系统提供可自主运行、隐私友好的智能基础组件。

链接: https://arxiv.org/abs/2510.08576
作者: Justus Flerlage,Alexander Acker,Odej Kao
机构: Technische Universität Berlin (柏林工业大学); logsight.ai GmbH
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as transformative tools for natural language understanding and user intent resolution, enabling tasks such as translation, summarization, and, increasingly, the orchestration of complex workflows. This development signifies a paradigm shift from conventional, GUI-driven user interfaces toward intuitive, language-first interaction paradigms. Rather than manually navigating applications, users can articulate their objectives in natural language, enabling LLMs to orchestrate actions across multiple applications in a dynamic and contextual manner. However, extant implementations frequently rely on cloud-based proprietary models, which introduce limitations in terms of privacy, autonomy, and scalability. For language-first interaction to become a truly robust and trusted interface paradigm, local deployment is not merely a convenience; it is an imperative. This limitation underscores the importance of evaluating the feasibility of locally deployable, open-source, and open-access LLMs as foundational components for future intent-based operating systems. In this study, we examine the capabilities of several open-source and open-access models in facilitating user intention resolution through machine assistance. A comparative analysis is conducted against OpenAI’s proprietary GPT-4-based systems to assess performance in generating workflows for various user intentions. The present study offers empirical insights into the practical viability, performance trade-offs, and potential of open LLMs as autonomous, locally operable components in next-generation operating systems. The results of this study inform the broader discussion on the decentralization and democratization of AI infrastructure and point toward a future where user-device interaction becomes more seamless, adaptive, and privacy-conscious through locally embedded intelligence.
zh

[NLP-136] Limitations of Normalization in Attention Mechanism

【速读】：该论文旨在解决注意力机制中归一化（normalization）方法的局限性问题，特别是softmax归一化在token选择过程中的几何分离能力和模型区分能力受限的问题。其解决方案的关键在于构建一个理论框架，用于量化分析注意力机制中token向量的距离边界和分离准则，并通过预训练的GPT-2模型实验证实：随着被选token数量增加，模型区分信息token的能力下降，趋于均匀选择；同时发现低温度设置下梯度对softmax归一化的敏感性会加剧训练困难。这些发现揭示了现有softmax注意力机制的内在缺陷，为未来设计更鲁棒的归一化与选择策略提供了理论依据。

链接: https://arxiv.org/abs/2508.17821
作者: Timur Mudarisov,Mikhail Burtsev,Tatiana Petrova,Radu State
机构: University of Luxembourg (卢森堡大学); London Institute for Mathematical Sciences (伦敦数学科学研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model’s selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model’s ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.
zh

[NLP-137] Estimating Brain Activity with High Spatial and Temporal Resolution using a Naturalistic MEG-fMRI Encoding Model

【速读】：该论文旨在解决当前非侵入性神经影像技术在空间分辨率和时间分辨率之间难以兼顾的问题，尤其是在单次试验的自然情境数据中，如何实现高时空分辨率的脑活动映射。现有源定位或MEG-fMRI融合方法无法有效整合两种模态的优势，导致对大脑动态过程的刻画不完整。其解决方案的关键在于构建一个基于Transformer架构的编码模型，该模型联合训练来自多受试者的MEG与fMRI数据，通过一个潜在层同时估计高时空分辨率的皮层源响应；该潜在层作为跨模态共享表示，不仅显著提升了对MEG的预测性能，还在模拟实验中优于传统最小范数解法，并展现出良好的跨受试者和跨模态泛化能力——验证了其在预测ECoG信号时优于专门训练于ECoG的数据模型，从而为实现毫秒级与毫米级脑图谱提供了一条可行路径。

链接: https://arxiv.org/abs/2510.09415
作者: Beige Jerry Jin,Leila Wehbe
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Current non-invasive neuroimaging techniques trade off between spatial resolution and temporal resolution. While magnetoencephalography (MEG) can capture rapid neural dynamics and functional magnetic resonance imaging (fMRI) can spatially localize brain activity, a unified picture that preserves both high resolutions remains an unsolved challenge with existing source localization or MEG-fMRI fusion methods, especially for single-trial naturalistic data. We collected whole-head MEG when subjects listened passively to more than seven hours of narrative stories, using the same stimuli in an open fMRI dataset (LeBel et al., 2023). We developed a transformer-based encoding model that combines the MEG and fMRI from these two naturalistic speech comprehension experiments to estimate latent cortical source responses with high spatiotemporal resolution. Our model is trained to predict MEG and fMRI from multiple subjects simultaneously, with a latent layer that represents our estimates of reconstructed cortical sources. Our model predicts MEG better than the common standard of single-modality encoding models, and it also yields source estimates with higher spatial and temporal fidelity than classic minimum-norm solutions in simulation experiments. We validated the estimated latent sources by showing its strong generalizability across unseen subjects and modalities. Estimated activity in our source space predict electrocorticography (ECoG) better than an ECoG-trained encoding model in an entirely new dataset. By integrating the power of large naturalistic experiments, MEG, fMRI, and encoding models, we propose a practical route towards millisecond-and-millimeter brain mapping.
zh

[NLP-138] arget speaker anonymization in multi-speaker recordings ICASSP2026

【速读】：该论文旨在解决多说话人对话音频中目标说话人匿名化（speaker anonymization）的问题，尤其关注仅需对单个目标说话人进行匿名化的场景，这在呼叫中心等实际应用中具有重要隐私保护意义。传统匿名化方法在此类复杂场景下效果不佳，且现有评估方法无法准确衡量隐私保护与语音保真度之间的平衡。解决方案的关键在于提出针对性的匿名化策略，并构建适用于多说话人场景的改进型评估体系，从而有效提升匿名化效果并实现更可靠的性能验证。

链接: https://arxiv.org/abs/2510.09307
作者: Natalia Tomashenko,Junichi Yamagishi,Xin Wang,Yun Liu,Emmanuel Vincent
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer’s voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.
zh

[NLP-139] Unsupervised lexicon learning from speech is limited by representations rather than clustering ICASSP2026

【速读】：该论文旨在解决零资源词切分与聚类系统中词典诱导性能不佳的问题，即在无文本标注条件下如何更准确地将语音切分为词级单元并将其聚类为词类。其关键解决方案在于通过受控实验分离表示（representation）与聚类方法的影响，发现限制性能的主要因素是同一词类内不同片段的表示变异性，而非聚类算法本身；因此，采用基于动态时间规整（DTW）的图聚类方法结合连续特征可获得最佳效果，而更快的替代方案则使用余弦距离对平均连续特征或编辑距离对离散单元序列进行图聚类，均能有效提升性能。

链接: https://arxiv.org/abs/2510.09225
作者: Danel Adendorff,Simon Malan,Herman Kamper
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Zero-resource word segmentation and clustering systems aim to tokenise speech into word-like units without access to text labels. Despite progress, the induced lexicons are still far from perfect. In an idealised setting with gold word boundaries, we ask whether performance is limited by the representation of word segments, or by the clustering methods that group them into word-like types. We combine a range of self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data. The best system uses graph clustering with dynamic time warping on continuous features. Faster alternatives use graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences. Through controlled experiments that isolate either the representations or the clustering method, we demonstrate that representation variability across segments of the same word type – rather than clustering – is the primary factor limiting performance.
zh

[NLP-140] A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large?

【速读】：该论文旨在解决社会科学研究中语言属性如何因果影响受众态度与行为的问题，尤其关注文本特征之间存在的潜在混杂因素（latent confounding）可能导致的估计偏差。传统方法利用大语言模型（LLM）学习文本的潜在表示以同时预测处理变量（treatment）和结果变量，但因处理变量是文本的一部分，这类深度学习方法可能无意中编码了处理本身，从而引发重叠偏差（overlap bias）。论文提出一种新的实验设计，能够有效控制潜在混杂、避免重叠问题，并无偏地估计处理效应；其关键在于通过结构化实验设计从源头上分离处理变量与潜在混杂因素，而非依赖事后调整。

链接: https://arxiv.org/abs/2510.08758
作者: Graham Tierney,Srikar Katta,Christopher Bail,Sunshine Hillygus,Alexander Volfovsky
机构: Duke University (杜克大学)
类目: Methodology (stat.ME); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Many social science questions ask how linguistic properties causally affect an audience’s attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text, these deep learning methods risk learning representations that actually encode the treatment itself, inducing overlap bias. Rather than depending on post-hoc adjustments, we introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects. We apply this design in an experiment evaluating the persuasiveness of expressing humility in political communication. Methodologically, we demonstrate that LLM-based methods perform worse than even simple bag-of-words models using our real text and outcomes from our experiment. Substantively, we isolate the causal effect of expressing humility on the perceived persuasiveness of political statements, offering new insights on communication effects for social media platforms, policy makers, and social scientists.
zh

[NLP-141] BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

【速读】：该论文旨在解决在低资源语言（如Bambara）场景下，对大规模预训练Transformer模型进行剪枝以适配边缘设备时面临的性能下降难题。由于这类语言缺乏足够的训练数据，传统剪枝方法（如词汇剪枝或层移除）难以恢复性能，且需大量重新训练数据。其解决方案的关键在于：避免使用不适用于频繁代码转换的词汇剪枝，转而采用低秩分解（low-rank decomposition）压缩嵌入层，并结合特征蒸馏（feature distillation）保留关键信息；同时通过合并网络层而非直接删除，有效控制性能损失。最终实现模型大小减少48%、推理速度提升2.15倍，同时保持原始性能的90%。

链接: https://arxiv.org/abs/2510.08599
作者: Yaya Sy,Christophe Cerisara,Irina Illina
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.
zh

[NLP-142] Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech

【速读】：该论文旨在解决从语音中检测心理压力（psychological stress）的问题，传统方法通常将压力视为静态标签，忽略了其随时间演变的特性。为改进这一局限，作者提出了一种动态标注策略，通过情感标签推导出细粒度的压力标注，并引入基于交叉注意力（cross-attention）的时序模型——单向LSTM和Transformer编码器，以捕捉压力在时间维度上的演化过程。解决方案的关键在于将压力建模为一个受历史情绪状态影响的动态过程，并通过结构化的时间建模机制显著提升在MuSE和StressID数据集上的准确率（分别提高5%和18%），同时展现出良好的跨场景泛化能力。

链接: https://arxiv.org/abs/2510.08586
作者: Vishakha Lall,Yisi Liu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at IEEE CogMI 2025

点击查看摘要

Abstract:Detecting psychological stress from speech is critical in high-pressure settings. While prior work has leveraged acoustic features for stress detection, most treat stress as a static label. In this work, we model stress as a temporally evolving phenomenon influenced by historical emotional state. We propose a dynamic labelling strategy that derives fine-grained stress annotations from emotional labels and introduce cross-attention-based sequential models, a Unidirectional LSTM and a Transformer Encoder, to capture temporal stress progression. Our approach achieves notable accuracy gains on MuSE (+5%) and StressID (+18%) over existing baselines, and generalises well to a custom real-world dataset. These results highlight the value of modelling stress as a dynamic construct in speech.
zh

[NLP-143] Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

【速读】：该论文旨在解决传统自动语音识别（ASR）系统在低资源条件下性能受限的问题，同时探索如何有效利用发音特征（articulatory features）以提升模型表现。其解决方案的关键在于：将发音特征作为辅助任务（通过语音逆向重建实现）和伪输入（pseudo-input）双重引入深度神经网络架构中——具体而言，使用语音逆向作为辅助预测任务，并将预测得到的发音特征作为查询流（query stream）注入到基于交叉注意力机制（cross-attention module）的Transformer模型中，其中声学嵌入作为键（key）和值（value）。实验表明，该方法在LibriSpeech数据集上显著优于强基准模型，尤其在低资源场景下效果突出，证明了现代深度学习框架下重新引入发音特征可带来实质性收益。

链接: https://arxiv.org/abs/2510.08585
作者: Ahmed Adel Attia,Jing Liu,Carol Espy Wilson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.
zh

计算机视觉

[CV-0] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

【速读】：该论文旨在解决当前视觉-语言-动作（Vision-Language Action, VLA）模型在机器人操作任务中训练成本高、效率低的问题。现有方法通常需要从头训练大型VLA模型，导致资源消耗巨大且难以快速部署。解决方案的关键在于提出一种基于知识蒸馏的轻量级框架，通过将预训练的小型动作模型（small action model）的知识迁移至视觉-语言模型（Vision-Language Model, VLM），从而赋予VLM动作执行能力。该框架仅在原VLM结构中添加一个动作标记（action token）和状态编码器（state encoder），以保留原有架构并引入物理状态输入；同时采用两阶段训练策略：第一阶段通过轻量对齐将VLM隐藏状态映射到小型动作模型的动作空间，复用其预训练解码器避免昂贵的重新训练；第二阶段选择性微调语言模型、状态编码器与动作模块，实现多模态输入融合与精准动作生成。此设计显著降低了训练成本，并在多个基准数据集（LIBERO、LIBERO-LONG）及真实世界五项操作任务中均取得优于先前最优方法的性能提升。

链接: https://arxiv.org/abs/2510.09607
作者: Shaoqi Dong,Chaoyou Fu,Haihan Gao,Yi-Fan Zhang,Chi Yan,Chu Wu,Xiaoyu Liu,Yunhang Shen,Jing Huo,Deqiang Jiang,Haoyu Cao,Yang Gao,Xing Sun,Ran He,Caifeng Shan
机构: Nanjing University (南京大学); Tencent Youtu Lab (腾讯优图实验室); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
zh

[CV-1] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在跨尺度空间推理（all-scale spatial reasoning）能力上的局限性，具体包括两个核心问题：一是数据集构建严重依赖室内3D扫描和人工标注，导致成本高、覆盖场景有限；二是缺乏有效的跨尺度场景建模机制，易造成对单一场景的过拟合。解决方案的关键在于提出一个集成结构化空间推理知识体系、尺度感知建模（scale-aware modeling）与渐进式训练范式（progressive training paradigm）的综合性框架。通过任务导向的自动化流水线构建了包含38K视频场景和1M空间问答对的SpaceVista-1M数据集，并设计了以尺度为锚点的专家网络与渐进奖励机制来缓解知识冲突，最终提出的SpaceVista-7B模型在多个基准测试中展现出卓越的跨尺度泛化性能。

链接: https://arxiv.org/abs/2510.09606
作者: Peiwen Sun,Shiqiang Lang,Dongming Wu,Yi Ding,Kaituo Feng,Huadai Liu,Zhen Ye,Rui Liu,Yun-Hui Liu,Jianan Wang,Xiangyu Yue
机构: Multimedia Lab, Chinese University of Hong Kong (香港中文大学多媒体实验室); Astribot; Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on this https URL .
zh

[CV-2] STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging

【速读】：该论文旨在解决时间序列数据中隐含的时序结构（如局部平稳状态间的转换、重复模式和变异爆发）在标准表示学习流程中未被充分利用的问题。现有模型通常基于原始或固定窗口的序列进行处理，将所有时间步视为同等信息量，导致长序列或噪声序列中效率低下、鲁棒性差且扩展性受限。其解决方案的关键在于提出一种轻量级、无监督的结构感知时间序列摘要框架STaTS，通过BIC-based统计差异准则在多时间尺度上检测变化点，并采用均值等简单函数或高斯混合模型（GMM）对每个片段进行摘要，从而实现高达30倍的序列压缩比，同时保留核心时序动态。STaTS作为模型无关的预处理器可无缝集成至现有无监督时间序列编码器中，无需重新训练，在150+数据集上的实验证明其能在保持85–90%全模型性能的同时显著降低计算成本，并提升噪声下的鲁棒性和判别结构保真度。

链接: https://arxiv.org/abs/2510.09593
作者: Disharee Bhowmick,Ranjith Ramanathan,Sathyanarayanan N. Aakur
机构: Auburn University (奥本大学); Oklahoma State University (俄克拉荷马州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables. Under Review

点击查看摘要

Abstract:Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.
zh

[CV-3] Vision Language Models: A Survey of 26K Papers

【速读】：该论文旨在解决科研趋势量化分析中缺乏透明、可复现方法的问题，尤其针对计算机视觉与机器学习顶会（CVPR、ICLR、NeurIPS）中研究主题演进的系统性刻画。其解决方案的关键在于构建一个基于人工标注词典的标准化文本处理流程：对26,104篇论文的标题和摘要进行归一化、短语保护，并匹配至细粒度主题标签体系（最多35类），从而精准提取任务类型、模型架构、训练策略、优化目标、数据集及多模态共现等信息。此方法实现了跨年度、跨会议的宏观趋势识别，揭示了多模态视觉-语言大模型（VLMs）、生成式方法（Generative Methods）以及3D/视频理解三大领域的演变规律，同时确保结果的可审计性和扩展性。

链接: https://arxiv.org/abs/2510.09586
作者: Fengming Lin
机构: The University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: VLM/LLM Learning Notes

点击查看摘要

Abstract:We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.
zh

[CV-4] FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection

【速读】：该论文旨在解决生物医学场景中目标检测面临的两大核心挑战：标注数据稀缺性以及新类别或罕见类别的频繁出现。为此，作者提出FSP-DETR框架，其关键在于构建一个基于类无关DETR（DETR）骨干网络的统一模型，通过从支持图像中提取类原型（class prototypes），并利用增强视图和轻量级Transformer解码器学习嵌入空间，从而实现少样本检测、开放集识别及未见生物医学任务的泛化能力。训练过程中联合优化原型匹配损失、基于对齐的分离损失与KL散度正则项，以提升特征判别力与弱监督下的校准性能；该方法在推理阶段无需重训练即可灵活支持未见类别识别、背景过滤与跨任务迁移，显著优于现有少样本和原型驱动的目标检测方法，尤其在低样本和开放集场景下表现突出。

链接: https://arxiv.org/abs/2510.09583
作者: Shubham Trehan,Udhav Ramachandran,Akash Rao,Ruth Scimeca,Sathyanarayanan N. Aakur
机构: Auburn University (奥本大学); Oklahoma State University (俄克拉荷马州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 Figures, 5 Tables. Under Review

点击查看摘要

Abstract:Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.
zh

[CV-5] C-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control NEURIPS2025

【速读】：该论文旨在解决当前可控扩散模型在多阶段去噪过程中因采用静态条件策略而导致的适应性不足问题，即固定架构仅通过修改中间激活值来注入条件引导，无法动态响应生成从粗略结构到精细细节的演变过程。解决方案的关键在于提出TC-LoRA（Temporally Modulated Conditional LoRA），其通过一个超网络（hypernetwork）实时生成LoRA适配器，直接基于时间步和用户条件对冻结主干模型的权重进行动态调整，从而实现参数化、时序感知的条件控制机制，使模型能够在整个生成流程中学习并执行显式的自适应引导策略，显著提升生成保真度与空间条件一致性。

链接: https://arxiv.org/abs/2510.09561
作者: Minkyoung Cho,Ruben Ohana,Christian Jacobsen,Adityan Jothi,Min-Hung Chen,Z. Morley Mao,Ethem Can
机构: University of Michigan (密歇根大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages; NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)

点击查看摘要

Abstract:Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model’s ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model’s weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user’s condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model’s conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.
zh

[CV-6] FLOWING: Implicit Neural Flows for Structure-Preserving Morphing

【速读】：该论文旨在解决视觉与计算机图形学中长期存在的形态变换（morphing）问题，即如何实现特征对齐的时变形变（time-dependent warping）以及平滑插值（smooth interpolation）。传统基于多层感知机（MLP）的隐式神经表示（INRs）方法虽具备无网格和可微性优势，但通常依赖昂贵的正则化策略来获取一致且准确的形变结果，易导致训练不稳定并阻碍有效特征对齐。其解决方案的关键在于提出 FLOWING（FLOW morphing）框架，将形变建模为微分向量场（differential vector flow）的构建过程，通过在网络结构中直接编码流的结构性质（如连续性、可逆性和时间一致性），自然保障了变换的稳定性与物理合理性，从而实现了高效、结构保持的 2D 图像与 3D 形状形态变换。

链接: https://arxiv.org/abs/2510.09537
作者: Arthur Bizzi,Matias Grynberg,Vitor Matias,Daniel Perazzo,João Paulo Lima,Luiz Velho,Nuno Gonçalves,João Pereira,Guilherme Schardong,Tiago Novello
机构: EPFL(瑞士联邦理工学院); University of Buenos Aires(布宜诺斯艾利斯大学); University of São Paulo(圣保罗大学); IMPA(国家数学与计算机科学研究所); Universidade Federal Rural de Pernambuco(伯南布哥联邦农村大学); ISR-UC(里斯本大学研究所); INCM(国家医学研究中心); University of Georgia(佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages main paper; 9 pages references and appendix

点击查看摘要

Abstract:Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at this http URL.
zh

[CV-7] PRNet: Original Information Is All You Have

【速读】：该论文旨在解决航空图像中小目标检测中因特征提取过程中信息退化严重而导致的漏检和误检问题，其核心在于浅层空间细节难以与语义信息有效对齐。解决方案的关键在于提出PRNet框架，通过两个核心模块实现：一是基于主干网络复用和迭代精化的渐进式优化颈部（Progressive Refinement Neck, PRN），以提升空间与语义特征的对齐能力；二是增强型切片采样模块（Enhanced SliceSamp, ESSamp），通过优化重排和卷积策略在下采样过程中保留浅层空间信息，从而显著改善小目标表示能力。

链接: https://arxiv.org/abs/2510.09531
作者: PeiHuang Zheng,Yunlong Zhao,Zheng Cui,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object detection in aerial images suffers from severe information degradation during feature extraction due to limited pixel representations, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods attempt to mitigate these losses through post-processing enhancements, but the reconstructed details often deviate from the original image information, impeding their fusion with semantic content. To address this limitation, we propose PRNet, a real-time detection framework that prioritizes the preservation and efficient utilization of primitive shallow spatial features to enhance small object representations. PRNet achieves this via two modules:the Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and the Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution. Extensive experiments on the VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.
zh

[CV-8] Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions

【速读】：该论文旨在解决智能手机图像中因传感器或处理算法导致的对角线伪影（diagonal artifacts）对基于照片噪声指纹（Photo Response Non-Uniformity, PRNU）的相机源识别可靠性造成的干扰问题。其关键解决方案在于：对于支持PRO模式并可获取原始图像（raw images）的设备（如部分Galaxy S系列），通过使用未经处理的原始数据可规避引入伪影的图像处理流水线，从而保证PRNU验证的准确性；而对于不支持RAW输出的中端机型（如Galaxy A系列）或无法获取原始图像的法证场景，则提出利用这些对角线伪影本身作为辅助特征，用于提升HDR图像中的误检率控制和定位合成背景虚化（synthetic bokeh）区域，实现新的法证应用价值。

链接: https://arxiv.org/abs/2510.09509
作者: David Vázquez-Padín,Fernando Pérez-González,Alejandro Martín-Del-Río
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.
zh

[CV-9] PhysToolBench: Benchmarking Physical Tool Understanding for MLLM s

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在物理工具理解能力上的量化评估缺失问题，即尽管MLLMs在高阶规划和视觉-语言-动作（Vision-Language-Action, VLA）模型中展现出强大通用知识，但其对工具功能、原理及创造性的实际掌握程度尚不明确。解决方案的关键在于构建首个专门用于评估MLLMs物理工具理解能力的基准测试——PhysToolBench，该基准采用视觉问答（Visual Question Answering, VQA）形式，包含超过1000组图像-文本对，从三个难度层级系统性地考察模型能力：工具识别（Tool Recognition）、工具理解（Tool Understanding）和工具创造（Tool Creation）。通过在32个代表性MLLMs上的全面评测，研究揭示了现有模型在工具理解层面存在显著不足，并提出了初步改进方向。

链接: https://arxiv.org/abs/2510.09507
作者: Zixin Zhang,Kanghao Chen,Xingwang Lin,Lutao Jiang,Xu Zheng,Yuanhuiyi Lyu,Litao Guo,Yinchuan Li,Ying-Cong Chen
机构: HKUST(GZ); HKUST; Beihang University; Knowin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool’s primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool’s operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.
zh

[CV-10] A methodology for clinically driven interactive segmentation evaluation

【速读】：该论文旨在解决当前交互式医学图像分割（Interactive Medical Image Segmentation）领域中评估方法不一致且临床现实性不足的问题，这导致算法性能比较不公平，并不能真实反映其在实际临床场景中的表现。为应对这一挑战，作者提出了一种基于临床实践的评估任务与指标定义方法，并构建了一个标准化评估流程的软件框架。解决方案的关键在于：(1) 强调在处理用户交互时最小化信息损失以提升模型鲁棒性；(2) 引入自适应缩放机制（adaptive-zooming）以增强模型稳健性和加速收敛；(3) 明确指出训练与验证阶段提示行为或预算（prompting budget）不一致会导致性能下降；(4) 揭示2D方法适用于切片状图像和粗略目标，而3D上下文对大尺寸或不规则形状目标更有效；(5) 非医学领域预训练模型（如SAM2）在低对比度和复杂形态下性能显著退化。这些发现共同构成了一个可复现、临床相关的评估体系，推动了交互式分割算法向更可靠、实用的方向发展。

链接: https://arxiv.org/abs/2510.09499
作者: Parhom Esmaeili,Virginia Fernandez,Pedro Borges,Eli Gibson,Sebastien Ourselin,M. Jorge Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, Medical Image Computing and Computed Assisted Intervention 2025

点击查看摘要

Abstract:Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.
zh

[CV-11] Few-shot multi-token DreamBooth with LoRa for style-consistent character generation

【速读】：该论文旨在解决如何从少量人工设计的参考角色中生成大量新颖且保持一致艺术风格与视觉特征的角色问题，从而拓展动画、游戏等领域的创作可能性。其核心解决方案在于对DreamBooth（一种用于文本到图像扩散模型的微调技术）进行改进：首先采用多标记（multi-token）策略，通过聚类为每个角色及其整体风格分配独立标记；其次结合LoRA（Low-Rank Adaptation）参数高效微调方法，在训练中移除类别特定正则化集，并在生成阶段引入随机标记和嵌入，实现无限角色生成的同时保留学习到的艺术风格。这一方法显著提升了角色多样性与风格一致性，经定量指标与人类评估验证，效果优于现有基线。

链接: https://arxiv.org/abs/2510.09475
作者: Ruben Pascual,Mikel Sesma-Sara,Aranzazu Jurio,Daniel Paternain,Mikel Galar
机构: University of Navarra (纳瓦拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.
zh

[CV-12] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在测试时提示调优（test-time prompt tuning）过程中因模态间特征维度主导性差异导致的校准性能下降问题。其关键解决方案是提出维度熵最大化（dimensional entropy maximization），通过正则化文本特征分布趋向均匀，抑制单一主导维度对模型预测的影响，从而提升模型在域偏移场景下的校准可靠性。

链接: https://arxiv.org/abs/2510.09473
作者: Jisu Han,Wonjun Hwang
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
zh

[CV-13] SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests

【速读】：该论文旨在解决森林管理中复杂自然环境下树种感知能力不足的问题，特别是针对遮挡严重、光照变化大和植被密集等挑战，这些因素限制了自动化系统在精准林业、生物多样性监测及林业设备自动化中的应用。其解决方案的关键在于构建了一个名为SilvaScenes的新数据集，该数据集包含来自加拿大魁北克省五个生物气候区的1476棵树木（24个物种）的林下图像，并由林业专家进行标注，用于实例分割任务。实验表明，尽管树体分割性能良好（mAP达67.65%），但细粒度树种分类仍具挑战性（mAP仅35.69%），凸显了该数据集对推动森林场景下生成式AI（Generative AI）感知模型发展的价值。

链接: https://arxiv.org/abs/2510.09458
作者: David-Alexandre Duclos,William Guimont-Martin,Gabriel Jeanson,Arthur Larochelle-Tremblay,Théo Defosse,Frédéric Moore,Philippe Nolet,François Pomerleau,Philippe Giguère
机构: Northern Robotics Laboratory (北方机器人实验室); Université Laval (拉瓦尔大学); Institut des Sciences de la Forêt tempérée (森林科学研究所); Université du Québec en Outaouais (魁北克省渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at this https URL.
zh

[CV-14] Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement

【速读】：该论文旨在解决低光照视频增强（Low-light Video Enhancement, LLVE）中因噪声、对比度低和色彩退化带来的挑战，尤其针对现有基于学习的方法在真实低光照场景下难以有效利用时间信息而导致的重噪声问题。其解决方案的关键在于提出一种两阶段框架DWTA-Net，第一阶段通过视觉状态空间块（Visual State-Space blocks）实现多帧对齐，恢复亮度、颜色与结构并保持局部一致性；第二阶段引入基于光流引导的动态权重时间聚合机制的循环精修模块，自适应平衡静态与动态区域，并结合纹理自适应损失函数，在保留细节的同时提升平坦区域的平滑性，从而显著抑制噪声与伪影，提升整体视觉质量。

链接: https://arxiv.org/abs/2510.09450
作者: Ruirui Lin,Guoxi Huang,Nantheera Anantrasirichai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.
zh

[CV-15] Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

【速读】：该论文旨在解决从单目视频中重建的4D场景（4D scene）进行基于文本提示的局部编辑问题，其核心挑战在于如何在复杂动态场景中实现语义精确的局部区域编辑，同时保持未编辑内容的完整性。解决方案的关键在于提出Mono4DEditor框架：首先通过将量化后的CLIP特征嵌入3D高斯表示（3D Gaussians），构建语言感知的动态场景表示，从而支持对任意空间区域的高效语义查询；其次设计两阶段点级定位策略，先利用CLIP相似度筛选候选高斯点，再精细调整其空间范围以提升定位精度；最后结合扩散模型与光流及涂鸦引导（flow and scribble guidance）对局部区域实施精准编辑，保障空间一致性与时间连贯性。

链接: https://arxiv.org/abs/2510.09438
作者: Jin-Chuan Shi,Chengye Su,Jiajun Wang,Ariel Shamir,Miao Wang
机构: Zhejiang University (浙江大学); Beihang University (北京航空航天大学); Reichman University (雷克曼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.
zh

[CV-16] Utilizing dynamic sparsity on pretrained DETR

【速读】：该论文旨在解决基于Transformer的视觉模型（如DETR）在推理阶段计算效率低的问题，特别是其MLP层中存在的固有稀疏性未被有效利用。解决方案的关键在于提出两种无需重新训练即可挖掘这种稀疏性的方法：一是静态指示器驱动的稀疏化（SIBS），通过固定激活模式预测神经元不活跃状态，但受限于输入依赖性导致收益有限；二是微门控稀疏化（MGS），引入一个轻量级门控机制（由一个小线性层实现）对预训练DETR进行微调，动态预测神经元激活状态，从而实现高达85–95%的激活稀疏度，在保持甚至提升检测性能的同时显著降低计算开销。MGS的核心优势在于其输入自适应性与部署友好性，为高效部署预训练视觉Transformer提供了实用路径。

链接: https://arxiv.org/abs/2510.09380
作者: Reza Sedghi,Anand Subramoney,David Kappel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages 4 figures and 4 tables , accepted for 2025 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, AUG. 31 to SEP. 3, 2025, ISTANBUL, TURKEY

点击查看摘要

Abstract:Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.
zh

[CV-17] Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification

【速读】：该论文旨在解决从机载激光雷达（LiDAR）点云中直接准确估算林木体积和地上生物量（Aboveground Biomass, AGB）的难题，其核心挑战在于建模长距离依赖关系以有效区分树木个体。解决方案的关键创新在于提出Minkowski-MambaNet框架，该框架将Mamba模型中的选择性状态空间模型（Selective State Space Model, SSM）嵌入到Minkowski网络结构中，从而高效编码全局上下文信息与长程依赖关系，显著提升树冠识别精度；同时引入跳跃连接（skip connections）增强特征表达并加速训练收敛。实验表明，该方法在丹麦国家森林清查LiDAR数据上优于现有最先进方法，且无需数字高程模型（Digital Terrain Model, DTM），对边界伪影具有鲁棒性。

链接: https://arxiv.org/abs/2510.09367
作者: Jinxiang Tu,Dayong Ren,Fei Shi,Zhenhong Jia,Yahong Ren,Jiwei Qin,Fang He
机构: Xinjiang University (新疆大学); Nanjing University (南京大学); Xinjiang Guanghui New Energy Co., Ltd (新疆广汇新能源有限公司); Hongyousoft Co., Ltd (宏友软件有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish this http URL propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model’s Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate this http URL on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.
zh

[CV-18] Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

【速读】：该论文旨在解决3D高斯散射（3D Gaussian splatting, 3DGS）在复杂城市场景中因初始点云不完整或不可靠而导致的几何重建失真问题，尤其是在无界、动态环境中难以获得均匀且完整的点覆盖。其关键解决方案在于提出VAD-GS框架，通过体素级可见性推理识别不可靠几何结构，结合多样性感知的视图选择策略筛选有效支撑视角，并利用基于图像块匹配的多视图立体重建方法恢复缺失结构，从而在缺乏初始点的区域生成由可靠几何先验引导的新高斯原型，显著提升静态与动态物体的重建质量。

链接: https://arxiv.org/abs/2510.09364
作者: Yikang Zhang,Rui Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.
zh

[CV-19] BLINK-Twice: You see but do you observe? A Reasoning Benchmark on Visual Perception NEURIPS2025

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉推理能力评估中存在偏重语言推理、忽视图像本体信息的问题。现有基准测试通常将视觉输入视为可替换的上下文，未能真正检验模型从图像内容出发进行细粒度观察与分析的能力。解决方案的关键在于提出一个以视觉为中心的推理基准BLINK-Twice，其核心创新包括：（1）设计七类视觉挑战任务，聚焦于图像本体的复杂感知；（2）引入自然对抗性图像对，强制模型依赖视觉内容而非外部知识；（3）提供标注的推理链（reasoning chains），实现对推理过程的细粒度评估，而非仅关注最终答案。这一方法推动了从“看见”到“观察”的范式转变，揭示了当前模型在视觉推理上的局限性，并指出重复图像观察和主动视觉交互（如o3模型所展示）是未来提升方向。

链接: https://arxiv.org/abs/2510.09361
作者: Junyan Ye,Dongzhi Jiang,Jun He,Baichuan Zhou,Zilong Huang,Zhiyuan Yan,Hongsheng Li,Conghui He,Weijia Li
机构: Sun Yat-sen University (中山大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); CUHK MMLab (香港中文大学多媒体实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception (“see”) and requires fine-grained observation and analytical reasoning (“observe”). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at this https URL
zh

[CV-20] Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models EMNLP2025

【速读】：该论文旨在解决多模态关键词预测（Multi-modal keyphrase prediction, MMKP）任务中现有方法在处理缺失模态和未见场景时表现不佳的问题，以及现有基准测试因训练集与测试集存在显著重叠而高估模型性能的缺陷。解决方案的关键在于：首先利用视觉语言模型（Vision-language models, VLMs）通过零样本（zero-shot）和监督微调（supervised fine-tuning, SFT）策略评估其下界性能；其次引入Fine-tune-CoT方法，借助教师模型生成的高质量思维链（Chain-of-Thought, CoT）数据对小型模型进行微调，以增强其复杂推理能力；最后提出一种动态思维链（dynamic CoT）策略，在训练过程中自适应注入CoT数据，使模型在推理阶段能灵活调用推理能力，从而缓解“过度思考”（overthinking）现象。

链接: https://arxiv.org/abs/2510.09358
作者: Qihang Ma,Shengyu Li,Jie Tang,Dingkang Yang,Shaodong Chen,Yingyi Zhang,Chao Feng,Jiao Ran
机构: ByteDance Douyin Content Group (字节跳动抖音内容组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP2025. Code is avaible at this https URL

点击查看摘要

Abstract:Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at this https URL.
zh

[CV-21] Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark NEURIPS2025

【速读】：该论文旨在解决热红外图像增强中耦合退化问题，即现有方法通常仅针对单一退化类型（如噪声、对比度低或模糊）进行优化，难以有效处理多种退化同时存在的情况；而通用的RGB图像增强方法因成像模型差异，在热红外场景下效果有限。解决方案的关键在于提出一种渐进式提示融合网络（Progressive Prompt Fusion Network, PPFN），通过重建热成像机制构建退化类型的提示对，并基于不同退化特征动态融合提示以自适应引导模型特征，从而实现对单退化与多退化场景的精准增强；同时引入选择性渐进训练（Selective Progressive Training, SPT）机制，逐步提升模型对复合退化情况的处理能力，显著改善图像信噪比与结构保真度，最终在多场景红外基准上实现8.76%的性能提升。

链接: https://arxiv.org/abs/2510.09343
作者: Jinyuan Liu,Zihang Chen,Zhu Liu,Zhiying Jiang,Long Ma,Xin Fan,Risheng Liu
机构: School of Software Engineering, Dalian University of Technology (大连理工大学软件工程学院); Information Science and Technology College, Dalian Martime University (大连海事大学信息科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by NeurIPS 2025

点击查看摘要

Abstract:We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model’s features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model’s handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76% improvement. Code is available at this https URL.
zh

[CV-22] Efficient Bayesian Inference from Noisy Pairwise Comparisons

【速读】：该论文旨在解决生成式 AI（Generative AI）模型人类评估中因评分者差异导致的可靠性与可解释性不足的问题。现有基于 Bradley-Terry 模型的方法要么忽略评分者质量差异，要么缺乏收敛性保证，从而影响排名的稳健性和可信度。其解决方案的关键在于提出 BBQ（Bayesian Bradley-Terry with Quality modeling），通过引入贝叶斯框架显式建模评分者质量，自动降低不可靠评分者的权重甚至剔除其影响，并利用期望最大化（Expectation-Maximization, EM）算法确保似然函数单调收敛，从而在噪声或众包评分场景下仍能实现更快速收敛、校准良好的不确定性估计以及更具鲁棒性和可解释性的排序结果。

链接: https://arxiv.org/abs/2510.09333
作者: Till Aczel,Lucas Theis,Wattenhofer Roger
机构: ETH Zürich (苏黎世联邦理工学院); Mabyduck; ETH Zürich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
zh

[CV-23] Instance-Aware Robust Consistency Regularization for Semi-Supervised Nuclei Instance Segmentation

【速读】：该论文旨在解决病理图像中细胞核实例分割在半监督学习场景下的性能瓶颈问题，尤其是现有方法在实例级别一致性正则化不足、未能有效利用病理结构先验知识以及训练过程中易引入噪声伪标签等缺陷。其解决方案的关键在于提出一种基于实例感知的鲁棒一致性正则化网络（IRCR-Net），通过引入两种创新机制：匹配驱动的实例感知一致性（Matching-Driven Instance-Aware Consistency, MIAC）和先验驱动的实例感知一致性（Prior-Driven Instance-Aware Consistency, PIAC），以精细化教师与学生子网络的实例级分割结果，尤其针对密集分布和重叠细胞核；同时融合形态学先验知识对未标注数据生成的伪标签质量进行评估，剔除低质量伪标签并增强高质量预测，从而降低伪标签噪声，提升模型训练的鲁棒性。

链接: https://arxiv.org/abs/2510.09329
作者: Zenan Lin,Wei Li,Jintao Chen,Zihao Wu,Wenxiong Kang,Changxin Gao,Liansheng Wang,Jin-Gang Yu
机构: South China University of Technology (华南理工大学); University of South China (南华大学); Huazhong University of Science and Technology (华中科技大学); Xiamen University (厦门大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi-supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo-labels during training. In this paper, we propose an Instance-Aware Robust Consistency Regularization Network (IRCR-Net) for accurate instance-level nuclei segmentation. Specifically, we introduce the Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo-labels generated from unlabeled data. Low-quality pseudo-labels are discarded, while high-quality predictions are enhanced to reduce pseudo-label noise and benefit the network’s robust training. Experimental results demonstrate that the proposed method significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.
zh

[CV-24] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation ICCV2025

【速读】：该论文旨在解决当前自监督单目深度估计（Self-supervised Monocular Depth Estimation, MDE）方法因语义-空间知识提取不足而导致性能受限的问题。其解决方案的关键在于提出了一种名为Hybrid-depth的新框架，通过系统性地融合基础模型（如CLIP和DINO）来提取视觉先验并获取充分的上下文信息：首先利用CLIP提取全局语义特征、DINO提取局部空间细节，并在对比语言引导下设计代理任务以强制深度感知特征对齐；随后在此基础上引入相机位姿信息与像素级语言对齐机制，实现从粗到精的渐进式学习，从而有效缓解特征粒度不匹配问题，并可无缝集成至现有自监督MDE流程中作为即插即用的深度编码器，显著提升深度估计精度及下游任务（如BEV感知）性能。

链接: https://arxiv.org/abs/2510.09320
作者: Wenyao Zhang,Hongsi Liu,Bohan Li,Jiawei He,Zekun Qi,Yunnan Wang,Shengyang Zhao,Xinqiang Yu,Wenjun Zeng,Xin Jin
机构: Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology (东方理工大学); University of Science and Technology of China (中国科学技术大学); CASIA (中国科学院自动化研究所); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP’s semantic context and DINO’s spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at this https URL.
zh

[CV-25] RadioFlow: Efficient Radio Map Construction Framework with Flow Matching

【速读】：该论文旨在解决当前基于扩散模型的无线电信道地图（Radio Map, RM）生成方法中存在的模型规模大、迭代去噪速度慢及推理延迟高问题，这些问题严重限制了其在下一代无线系统中的实际部署。解决方案的关键在于提出一种基于流匹配（flow-matching）的新框架——RadioFlow，该框架通过学习从噪声到数据的连续传输轨迹（continuous transport trajectories），实现单步高效采样，从而显著加速训练与推理过程，同时保持高保真度的重建精度。实验表明，RadioFlow相较领先的扩散基线模型RadioDiff，在参数量上减少最多8倍，推理速度提升超过4倍，为未来6G网络中可扩展、低功耗、实时的电磁数字孪生提供了可行路径。

链接: https://arxiv.org/abs/2510.09314
作者: Haozhe Jia,Wenshuo Chen,Xiucheng Wang,Nan Cheng,Hongbo Zhang,Kuimou Yu,Songning Lai,Nanjian Jia,Bowen Tian,Hongru Xiao,Yutao Yue
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Xidian University (西安电子科技大学); Peking University (北京大学); Beijing Technology and Business University (北京工商大学); Tongji University (同济大学); Institute of Deep Perception Technology, JITRI (深感知技术研究院，江苏研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbfRadioFlow, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbfup to 8 \times fewer parameters and \textbfover 4 \times faster inference compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \hrefthis https URLGitHub.
zh

[CV-26] Forag ing with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling

【速读】：该论文试图解决的问题是：人类视觉注视行为的统计动力学机制尚不明确，尤其是其时空特性是否与自然觅食行为中的莱维飞行（Levy walk）相似，以及这种规律性是否可由图像本身的视觉结构所驱动。解决方案的关键在于通过大规模眼动实验（40名参与者、50张图像、超400万次注视点记录）发现人类视觉扫描轨迹遵循莱维飞行模式，表明视觉信息获取具有最优效率；同时训练卷积神经网络（CNN）仅基于图像输入即可准确预测显著性注视热点图，证明关键注视行为特征可从视觉结构中学习并建模，从而为注意力机制和视觉界面设计提供了生成式与预测性框架的基础。

链接: https://arxiv.org/abs/2510.09299
作者: Tejaswi V. Panchagnula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.
zh

[CV-27] Spotlight on Token Perception for Multimodal Reinforcement Learning UAI

【速读】：该论文旨在解决当前多模态强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）中忽视视觉感知在策略优化过程中作用的问题，尤其关注大型视觉语言模型（Large Vision-Language Models, LVLMs）在生成式推理链（Chain-of-Thought, CoT）中的视觉依赖性建模不足。其解决方案的关键在于提出一种名为“视觉感知策略优化”（Visually-Perceptive Policy Optimization, VPPO）的新颖策略梯度算法，该算法通过两个核心机制实现：一是基于轨迹整体视觉依赖性对优势值进行重加权，二是仅针对具有感知关键性的token进行策略更新，从而显著提升LVLM在多模态推理任务中的表现。

链接: https://arxiv.org/abs/2510.09285
作者: Siyuan Huang,Xiaoye Qu,Yafu Li,Yun Luo,Zefeng He,Daizong Liu,Yu Cheng
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Nanjing University (南京大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 10 figures, project page: this https URL

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory’s advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
zh

[CV-28] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

【速读】：该论文旨在解决生成式 AI (Generative AI) 在视频目标分割任务中因采样策略不当导致的时序推理能力不足与视觉理解精度低的问题。现有方法要么依赖人工设计的启发式规则（忽略关键时序线索），要么引入外部关键帧模型（增加系统复杂度）。其核心解决方案是提出一个统一框架，联合优化时序句子定位（Temporal Sentence Grounding, TSG）与参考视频目标分割（Referring Video Object Segmentation, RefVOS），自然融合关键时刻识别能力：训练阶段采用新型TSG范式，通过专用[FIND]标记进行时序token相似性匹配实现无需外部时间戳编码的关键时刻定位；推理阶段设计以时刻为中心的采样策略（Moment-Centric Sampling, MCS），在密集采样信息丰富时刻的同时稀疏采样非必要帧，兼顾运动细节与全局上下文；并进一步引入双向锚点更新传播机制（Bidirectional Anchor-updated Propagation, BAP），利用最相关时刻作为起点进行高质量掩码初始化，并在采样点动态更新锚点以减少误差累积，从而提升跟踪稳定性。

链接: https://arxiv.org/abs/2510.09274
作者: Ming Dai,Sen Yang,Boqiang Duan,Wankou Yang,Jingdong Wang
机构: Southeast University (东南大学); Baidu VIS (百度视觉智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt[FIND] token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: this https URL
zh

[CV-29] Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在训练过程中因使用未加筛选的数据集而引发的安全漏洞问题，特别是针对物理触发器（physical trigger）注入所导致的隐蔽性后门攻击。现有攻击多假设白盒环境且仅造成任务失败，缺乏对目标导向行为的控制。本文提出目标导向后门攻击（Goal-oriented Backdoor Attack, GoBA），其关键在于通过向训练数据中插入特定物理对象作为触发器，使VLA在无触发器时保持正常功能，而在检测到物理触发器时则执行预设的目标动作，从而实现精准、隐蔽且高效的攻击。实验表明，GoBA在包含触发器的输入中成功达成目标动作的比例达97%，同时对干净样本性能无任何影响，验证了其有效性与隐蔽性。

链接: https://arxiv.org/abs/2510.09269
作者: Zirun Zhou,Zhengyang Xiao,Haochuan Xu,Jing Sun,Di Wang,Jingfeng Zhang
机构: The University of Auckland (奥克兰大学); The King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in vision-language-action (VLA) models have greatly improved embodied AI, enabling robots to follow natural language instructions and perform diverse tasks. However, their reliance on uncurated training datasets raises serious security concerns. Existing backdoor attacks on VLAs mostly assume white-box access and result in task failures instead of enforcing specific actions. In this work, we reveal a more practical threat: attackers can manipulate VLAs by simply injecting physical objects as triggers into the training dataset. We propose goal-oriented backdoor attacks (GoBA), where the VLA behaves normally in the absence of physical triggers but executes predefined and goal-oriented actions in the presence of physical triggers. Specifically, based on a popular VLA benchmark LIBERO, we introduce BadLIBERO that incorporates diverse physical triggers and goal-oriented backdoor actions. In addition, we propose a three-level evaluation that categorizes the victim VLA’s actions under GoBA into three states: nothing to do, try to do, and success to do. Experiments show that GoBA enables the victim VLA to successfully achieve the backdoor goal in 97 percentage of inputs when the physical trigger is present, while causing zero performance degradation on clean inputs. Finally, by investigating factors related to GoBA, we find that the action trajectory and trigger color significantly influence attack performance, while trigger size has surprisingly little effect. The code and BadLIBERO dataset are accessible via the project page at this https URL.
zh

[CV-30] Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

【速读】：该论文旨在解决黑盒视觉语言模型（Vision-Language Models, VLMs）在放射学图像视觉问答（Visual Question Answering, VQA）任务中因生成幻觉（hallucination）而导致诊断准确性下降的问题。其核心解决方案是引入离散语义熵（Discrete Semantic Entropy, DSE）作为量化指标，通过分析模型对同一问题多次回答的语义一致性来识别高风险幻觉问题，并据此过滤掉高DSE值的问题，从而提升剩余问答对的准确率。实验表明，使用DSE阈值0.3进行筛选后，GPT-4o和GPT-4.1在保留问题上的准确率分别显著提升至76.3%和63.8%，验证了该方法在临床VLM应用中的有效性与可靠性。

链接: https://arxiv.org/abs/2510.09256
作者: Patrick Wienholt,Sophie Caselitz,Robert Siepmann,Philipp Bruners,Keno Bressem,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available: this https URL

点击查看摘要

Abstract:To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE 0.6 or 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
zh

[CV-31] Zero-shot image privacy classification with Vision-Language Models ICASSP2026

【速读】：该论文旨在解决当前图像隐私预测领域中因过度依赖通用视觉-语言模型（Vision-Language Models, VLMs）而可能忽视专用小模型性能上限的问题，这一趋势缺乏系统性评估。其解决方案的关键在于构建一个零样本（zero-shot）图像隐私分类基准，以实现公平比较；通过使用任务对齐提示（task-aligned prompts）评估排名前三的开源VLMs，并将其在准确性、效率和鲁棒性方面与传统视觉单模态及多模态方法进行对比，从而揭示VLMs在隐私预测任务中虽具更高鲁棒性但准确率落后于专用小模型的核心现象。

链接: https://arxiv.org/abs/2510.09253
作者: Alina Elena Baia,Alessio Xompero,Andrea Cavallaro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026

点击查看摘要

Abstract:While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
zh

[CV-32] Clear Roads Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation

【速读】：该论文旨在解决恶劣天气条件（如雾霾、雨雪）对图像和视频质量的显著退化问题，此类退化严重影响智能交通系统（ITS）中依赖视觉输入的关键应用，如自动驾驶、交通监控与安防。解决方案的关键在于系统性地梳理和分类现有图像与视频复原技术，将其划分为基于传统先验的方法与现代数据驱动模型（包括CNN、Transformer、扩散模型及新兴的视觉语言模型VLMs），并进一步依据处理范围细分为单任务模型、多任务/多天气系统以及全功能一体化框架。同时，论文还深入探讨了昼夜差异下的复原挑战、基准数据集与评估协议，并指出当前研究局限，提出未来方向如混合退化复原、实时部署及代理型AI框架，为构建鲁棒性强的智慧交通视觉系统提供理论支撑与实践路径。

链接: https://arxiv.org/abs/2510.09228
作者: Vijay M. Galshetwar,Praful Hambarde,Prashant W. Patil,Akshay Dudhane,Sachin Chaudhary,Santosh Kumar Vipparathi,Subrahmanyam Murala
机构: Finolex Academy of Management and Technology (菲诺莱克管理与技术学院); Drone Lab, Centre for Artificial Intelligence and Robotics, Indian Institute of Technology Mandi (印度理工学院曼迪人工智能与机器人中心无人机实验室); Mehta Family School of Data Science and Artificial Intelligence, Indian Institute of Technology Guwahati (印度理工学院古瓦哈提梅hta家庭数据科学与人工智能学院); Computer Vision Department, MBZUAI (MBZUAI计算机视觉系); School of Computer Science, UPES Dehradun (UPES德赫拉敦计算机科学学院); Department of Electrical Engineering, Indian Institute of Technology Ropar (印度理工学院罗帕尔电气工程系); School of Computer Science and Statistics, Trinity College Dublin (都柏林三一学院计算机科学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been submitted to IEEE for possible publication

点击查看摘要

Abstract:Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at this https URL
zh

[CV-33] ag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation

【速读】：该论文旨在解决跨域序列推荐（Cross-Domain Sequential Recommendation, CDSR）中难以同时捕捉领域特定行为模式与跨域用户偏好动态变化的问题。其解决方案的关键在于提出TEMA-LLM框架，通过引入大语言模型（Large Language Models, LLMs）生成语义标签并增强物品表征，结合标签增强的多注意力机制联合建模用户在单域及跨域中的兴趣演化，从而提升推荐系统的个性化与泛化能力。

链接: https://arxiv.org/abs/2510.09224
作者: Wangyu Wu,Xuhang Chen,Zhenhong Chen,Jing-En Jiang,Kim-Fung Tsang,Xiaowei Huang,Fei Ma,Jimin Xiao
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学); Huizhou University (惠州学院); Microsoft (微软); Shandong University (山东大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Transactions on Consumer Electronics 2025

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbfTEMA-LLM (\textitTag-Enriched Multi-Attention with Large Language Models), a practical and effective framework that integrates \textitLarge Language Models (LLMs) for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textitTag-Enriched Multi-Attention mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.
zh

[CV-34] Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

【速读】：该论文旨在解决长视频生成中因误差累积导致的时空一致性下降问题，特别是现有方法在自回归推理时无法有效应对由模型自身生成错误所引发的“假设差距”（hypothesis gap）——即训练阶段假设输入为干净数据，而测试阶段却依赖于带有误差的自生成输出。解决方案的关键在于提出一种名为 Error-Recycling Fine-Tuning 的新型高效训练机制，其核心是将扩散 Transformer（Diffusion Transformer, DiT）在推理过程中产生的自动生成误差回收并作为监督信号注入训练流程，从而促使模型主动识别和修正自身错误。具体实现包括：通过闭环循环注入历史误差模拟误差积累轨迹、利用单步双向积分近似预测并计算残差误差、以及动态地将误差存入离散时间步长的回放缓冲区以供后续重采样使用，最终实现无额外推理成本下的无限长度视频生成，并保持与多种条件输入（如音频、骨骼、文本流）的高度兼容性。

链接: https://arxiv.org/abs/2510.09212
作者: Wuyang Li,Wentao Pan,Po-Chien Luan,Yang Gao,Alexandre Alahi
机构: VITA@EPFL (EPFL视觉智能与技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
zh

[CV-35] 3D Reconstruction from Transient Measurements with Time-Resolved Transformer

【速读】：该论文旨在解决光子效率成像中因传感器量子效率低和噪声水平高而导致的三维（3D）重建性能受限问题，尤其在远距离或复杂场景下更为显著。其解决方案的关键在于提出了一种通用的时间分辨Transformer（Time-Resolved Transformer, TRT）架构，该架构包含两个精心设计的注意力机制：一是时空自注意力编码器，通过特征分层与下采样策略挖掘瞬态数据中的局部与全局相关性；二是时空交叉注意力解码器，在token空间中融合局部与全局特征，从而生成具有强表征能力的深度特征。基于TRT，作者进一步构建了针对视线内（Line-of-Sight, LOS）和非视线内（Non-Line-of-Sight, NLOS）成像的专用模型TRT-LOS与TRT-NLOS，在合成与真实数据上均显著优于现有方法。

链接: https://arxiv.org/abs/2510.09205
作者: Yue Li,Shida Sun,Yu Hong,Feihu Xu,Zhiwei Xiong
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at this https URL.
zh

[CV-36] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition

【速读】：该论文旨在解决畜牧业中动物行为识别任务在数据稀缺场景下的性能瓶颈问题，尤其是在真实农场环境中视频监控数据有限时，传统深度学习方法难以实现高精度识别。其解决方案的关键在于提出一种基于多模态深度学习的框架Cattle-CLIP，该框架通过引入时间整合模块（temporal integration module）对视频帧进行时序建模，并利用语义提示（specialised text prompts）增强图像与文本之间的对齐能力，从而有效缓解预训练模型与实际畜牧监控数据之间的领域差异（domain gap）。此外，研究还设计了针对性的数据增强策略，使模型在少样本（few-shot）条件下仍能保持鲁棒的泛化能力，实验表明其在六类牛行为识别中达到96.1%的整体准确率，尤其在进食、饮水和站立反刍等关键行为上接近100%召回率。

链接: https://arxiv.org/abs/2510.09203
作者: Huimin Liu,Jing Gao,Daria Baran,AxelX Montout,Neill W Campbell,Andrew W Dowsey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, submitted to Computers and Electronics in Agriculture

点击查看摘要

Abstract:Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.
zh

[CV-37] owards Safer and Understandable Driver Intention Prediction

【速读】：该论文旨在解决自动驾驶（Autonomous Driving, AD）系统中决策过程缺乏可解释性的问题，尤其是在车辆 maneuvers（操作）发生前预测驾驶员意图（Driver Intent Prediction, DIP）时，如何提供高可信度、因果导向的解释以保障人机交互安全。其核心挑战在于深度学习模型难以提供人类可理解的环境表征与决策依据。解决方案的关键在于构建一个名为 eXplainable Driving Action Anticipation Dataset (DAAD-X) 的多模态、以自我为中心的视频数据集，该数据集通过驾驶员眼动和车辆视角共同生成层次化、高阶文本解释；并提出 Video Concept Bottleneck Model (VCBM) 框架，该框架在训练阶段即生成时空一致的解释，无需依赖事后解释技术（post-hoc techniques）。实验表明，基于 Transformer 的 VCBM 在可解释性上优于传统 CNN 模型，并借助多标签 t-SNE 可视化技术揭示了多个解释之间的解耦性和因果关联。

链接: https://arxiv.org/abs/2510.09200
作者: Mukilan Karuppasamy,Shankar Gangisetty,Shyam Nandan Rai,Carlo Masone,C V Jawahar
机构: IIIT Hyderabad (印度国际信息技术研究所); Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages

点击查看摘要

Abstract:Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: this https URL
zh

[CV-38] Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study

【速读】：该论文旨在解决板球击球分类（cricket shot classification）问题，即从视频序列中准确识别运动员的击球类型，这在体育视频分析领域具有挑战性，因为需要同时建模空间特征和时间动态。其关键解决方案在于系统性地比较七种不同的深度学习方法，并基于统一基准进行标准化评估，发现以往文献中报告的高准确率（如96%、99.2%等）在实际复现时显著下降至约46%-57%，揭示了研究中的可重复性危机。论文提出采用现代架构组合——EfficientNet-B0与GRU-based时序模型相结合的方法，在标准测试集上实现92.25%的准确率，表明通过先进模型设计与系统优化可显著提升性能。此外，所有实验均遵循现代MLOps实践（使用PyTorch Lightning），为该领域的可复现研究提供了可靠平台。

链接: https://arxiv.org/abs/2510.09187
作者: Sungwoo Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96% (Balaji LRCN), 99.2% (IJERCSE), and 93% (Sensors), our standardized re-implementations achieve 46.0%, 55.6%, and 57.7% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.
zh

[CV-39] Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

【速读】：该论文旨在解决单目视频深度估计在在线场景中应用受限的问题，现有方法如Video Depth Anything (VDA)虽在长视频序列上表现优异，但依赖批处理（batch-processing）机制，无法满足实时推理需求。解决方案的关键在于引入类大语言模型（Large Language Models, LLMs）的技术思想：在推理阶段缓存潜在特征（caching latent features），并在训练阶段对帧进行掩码（masking frames），从而实现高效的在线视频深度估计。该方法称为online VDA (oVDA)，在精度和显存占用（VRAM）方面均优于现有在线方法，且具备边缘设备部署优势，在NVIDIA A100上达到42 FPS、Jetson边缘设备上达20 FPS。

链接: https://arxiv.org/abs/2510.09182
作者: Johann-Friedrich Feiden,Tim Küchler,Denis Zavadski,Bogdan Savchynskyy,Carsten Rother
机构: Heidelberg University (海德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
zh

[CV-40] ARO: Toward Semantically Rich Open-World Object Detection

【速读】：该论文旨在解决现代目标检测模型在现实世界应用中面临的“封闭世界”假设问题，即模型仅能识别预定义类别的物体，而无法有效处理未知类别物体，导致在自动驾驶等安全关键场景下决策能力受限。为提升检测系统的鲁棒性和实用性，作者提出TARO框架，其核心创新在于不仅识别未知物体，还将其细分为语义层次结构中的粗粒度父类别（如“Unknown Animal”或“Unknown Debris”），从而提供更具信息量的分类结果。解决方案的关键包括：1）基于sparsemax的物体性（objectness）预测头以增强对未知区域的敏感性；2）基于层次结构的重标注模块，通过辅助监督引导模型学习类别间的语义关系；3）一个能够建模层次化分类关系的分类模块，使模型在保持已知类别性能的同时显著提升未知类别的可解释性与区分度。

链接: https://arxiv.org/abs/2510.09173
作者: Yuchen Zhang,Yao Lu,Johannes Betz
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (MIRMI) (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Modern object detectors are largely confined to a “closed-world” assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as ‘Unknown’, this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an ‘Unknown Animal’ (requiring an urgent stop) versus ‘Unknown Debris’ (requiring a safe lane change) is far more useful than just ‘Unknown’ in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.
zh

[CV-41] Instance-Level Generation for Representation Learning

【速读】：该论文旨在解决实例级识别（Instance-level Recognition, ILR）因缺乏大规模标注数据集而难以在多领域实际应用的问题。其解决方案的关键在于提出一种无需依赖真实图像的合成数据生成方法，能够从多个领域在不同条件和背景中自动生成多样化的对象实例，构建大规模训练数据集；在此基础上微调基础视觉模型，显著提升了跨七个ILR基准测试的检索性能，从而为ILR提供了一种高效且有效的全新范式，仅需目标领域的名称作为输入即可实现广泛应用。

链接: https://arxiv.org/abs/2510.09171
作者: Yankun Wu,Zakaria Laskar,Giorgos Kordopatis-Zilos,Noa Garcia,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR’s real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.
zh

[CV-42] Online Topological Localization for Navigation Assistance in Bronchoscopy

【速读】：该论文旨在解决视频支气管镜（video bronchoscopy）过程中医生难以准确判断支气管镜在复杂气道结构中的位置问题，尤其针对传统导航方法依赖患者术前CT扫描和额外传感器带来的高成本与复杂性。其解决方案的关键在于提出了一种基于图像的支气管镜拓扑定位（topological localization）流程，无需患者CT扫描即可实现导航辅助；该方法仅使用模拟数据（phantom data）进行训练，避免了真实数据标注的高昂成本，并展现出良好的泛化能力，在真实数据测试序列上性能优于现有方法。

链接: https://arxiv.org/abs/2510.09144
作者: Clara Tomasini,Luis Riazuelo,Ana C. Murillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.
zh

[CV-43] raining Feature Attribution for Vision Models

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在测试阶段预测结果缺乏可解释性的问题，尤其是现有解释方法仅从输入特征或训练样本角度进行归因，难以揭示模型决策的细粒度机制。其解决方案的关键在于提出训练特征归因（training feature attribution），该方法将测试预测结果与特定训练图像中的具体区域建立关联，从而提供更精细、针对特定测试样本的解释。实验表明，该方法能够识别导致误分类的有害训练样本，并揭示传统归因方法无法发现的伪相关性（如基于局部patch的捷径学习），为理解深度模型内部工作机制提供了新视角。

链接: https://arxiv.org/abs/2510.09135
作者: Aziz Bacha,Thomas George
机构: Orange Research (橙色研究); École polytechnique (巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores training feature attribution, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.
zh

[CV-44] Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation

【速读】：该论文旨在解决传统正交矩（如Zernike矩和伪Zernike矩）在高阶情况下存在的计算复杂度高和数值不稳定性问题。经典方法因径向与角向处理耦合，导致计算复杂度为 $\mathcal{O}(n^3N^2)$ 至 $\mathcal{O}(n^6N^2)$ ，条件数随图像尺寸 $N$ 的四次方增长，限制了高阶矩的应用。其解决方案的关键在于提出一种可分离的极坐标变换——PSepT（Polar Separable Transform），通过离散余弦变换（DCT）径向基与傅里叶谐波角基的张量积构造，实现核函数的完全因子分解，从而将计算复杂度降至 $\mathcal{O}(N^2 \log N)$ ，内存需求降至 $\mathcal{O}(N^2)$ ，条件数缩放降至 $\mathcal{O}(\sqrt{N})$ ，并保持正交性、完备性和旋转协变性等关键性质，使高阶矩分析成为可能。

链接: https://arxiv.org/abs/2510.09125
作者: Satya P. Singh,Rashmi Chaudhry,Anand Srivastava,Jagath C. Rajapakse
机构: Netaji Subhas University of Technology (NSUT) (尼赫鲁大学科技学院); Nanyang Technological University (NTU) (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures, 4 Tables

点击查看摘要

Abstract:Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in \mathcalO(n^3N^2) to \mathcalO(n^6N^2) complexity and \mathcalO(N^4) condition number scaling for the n th-order moments on an N\times N image. We introduce \textbfPSepT (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to \mathcalO(N^2 \log N) , memory requirements to \mathcalO(N^2) , and condition number scaling to \mathcalO(\sqrtN) , representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.
zh

[CV-45] MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

【速读】：该论文旨在解决计算病理学中细胞和细胞核分割任务因标注数据稀缺（尤其是罕见或非典型形态）而导致的模型性能瓶颈问题。传统手动标注成本高昂且效率低下，而合成数据则提供了一种经济高效的替代方案。其解决方案的关键在于提出一种多模态语义扩散模型（Multimodal Semantic Diffusion Model, MSDM），通过联合条件控制细胞/核形态（利用水平与垂直映射）、RGB颜色特征以及BERT编码的实验/适应症元数据，生成像素级精确的图像-掩码对。该模型采用多头交叉注意力机制融合异构模态信息，实现对生成图像的细粒度调控。实验证明，合成数据在嵌入空间中与真实数据具有低Wasserstein距离，并能显著提升特定类别（如柱状细胞）的分割精度，从而系统性增强数据集并针对性改善模型缺陷。

链接: https://arxiv.org/abs/2510.09121
作者: Dominik Winter,Mai Bui,Monica Azqueta Gavaldon,Nicolas Triltsch,Marco Rosati,Nicolas Brieu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.
zh

[CV-46] SOS: Synthetic Object Segments Improve Detection Segmentation and Grounding

【速读】：该论文旨在解决视觉分组任务（如实例分割、视觉定位和目标检测）中真实数据集存在的标注成本高、覆盖偏差大及难以扩展的问题，同时克服合成数据在灵活性、准确性与组合多样性方面的局限性。其解决方案的关键在于提出一种基于对象中心组合策略的可扩展数据合成流水线SOS（Synthetic Object Segmentation），通过结构化布局先验和生成式光照技术将高质量合成对象片段粘贴至新图像中，从而生成精确且多样化的掩码、边界框和指代表达；实验表明，仅用10万张合成图像训练的模型即可在LVIS检测和gRefCOCO定位任务上超越使用2000万张真实图像的GRIT和20万张的V3Det数据集，且在低数据和封闭词汇场景下显著提升泛化能力，体现了该方法在可控数据构建与针对性增强复杂类别内指代识别方面的优势。

链接: https://arxiv.org/abs/2510.09110
作者: Weikai Huang,Jieyu Zhang,Taoyang Jia,Chenhao Zheng,Ziqi Gao,Jae Sung Park,Ranjay Krishna
机构: University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Visual grouping – operationalized via instance segmentation, visual grounding, and object detection – underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 N_\textAcc on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 AP_\textrare on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding. Comments: Project website: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.09110 [cs.CV] (or arXiv:2510.09110v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.09110 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-47] A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans

【速读】：该论文旨在解决医学影像智能分析中对细微病理特征识别的挑战，尤其聚焦于通过CT图像实现新冠肺炎（COVID-19）的精准诊断。其解决方案的关键在于提出了一种新颖的多分支ConvNeXt架构，该架构通过三个并行分支提取特征：全局平均池化（Global Average Pooling）、全局最大池化（Global Max Pooling）以及一种新型注意力加权池化机制（Attention-weighted Pooling），结合严格的端到端数据预处理与增强流程及两阶段迁移学习训练策略，显著提升了模型在复杂医学图像中的判别能力。实验表明，该方法在验证集上达到ROC-AUC 0.9937、准确率0.9757和F1分数0.9825，优于现有同类模型，验证了先进深度学习技术在医学诊断中的有效性。

链接: https://arxiv.org/abs/2510.09107
作者: Irash Perera(1),Uthayasanker Thayasivam(1) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.
zh

[CV-48] Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation ICCV2025

【速读】：该论文旨在解决扩散 Transformer（Diffusion Transformer, DiT）在文本到图像生成任务中因参数量庞大而导致的推理开销过高的问题。现有参数压缩方法多依赖于剪枝（pruning），但过度剪枝常因模型容量下降而造成性能显著退化。其解决方案的关键在于将密集型 DiT 模型转化为专家混合模型（Mixture of Experts, MoE），实现结构化的稀疏化：首先用 MoE 层替换 DiT 块中的前馈网络（Feed-Forward Networks, FFNs），使 FFN 中激活参数减少 62.5%；进一步提出**模块混合（Mixture of Blocks, MoB）**机制，选择性激活 DiT 块以增强稀疏性。为保障从密集模型到 MoE 结构的有效迁移，设计了多阶段蒸馏流程，包括基于泰勒度量的专家初始化、带负载均衡的知识蒸馏以及用于 MoB 优化的分组特征损失。实验表明，该方法在保持原模型性能的同时，使激活参数减少 60%，优于传统剪枝方法，确立了高效文本到图像生成的新范式。

链接: https://arxiv.org/abs/2510.09094
作者: Youwei Zheng,Yuxi Ren,Xin Xia,Xuefeng Xiao,Xiaohua Xie
机构: Sun Yat-sen University (中山大学); ByteDance Intelligent Creation (字节跳动智能创作); ByteDance Seed Vision (字节跳动种子视觉); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Pazhou Lab (Huangpu) (琶洲实验室（黄埔）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
zh

[CV-49] GL-DT: Multi-UAV Detection and Tracking with Global-Local Integration

【速读】：该论文旨在解决无人机（Unmanned Aerial Vehicles, UAVs）在军事侦察、环境监测等场景中多目标跟踪（Multi-Object Tracking, MOT）面临的挑战，包括复杂背景干扰、小尺度目标检测困难以及频繁遮挡和目标交互导致的轨迹不连续与ID切换问题。解决方案的关键在于提出全局-局部检测与跟踪（Global-Local Detection and Tracking, GL-DT）框架，其核心创新为：一是引入时空特征融合（Spatio-Temporal Feature Fusion, STFF）模块，联合建模运动与外观特征以提升小目标检测性能；二是设计JPTrack跟踪算法，通过全局-局部协同检测策略和轨迹优化机制，有效减少ID切换和轨迹断裂，从而显著增强MOT的连续性与稳定性，同时保持实时性。

链接: https://arxiv.org/abs/2510.09092
作者: Juanqin Liu,Leonardo Plotegher,Eloy Roura,Shaoming He
机构: Beijing Institute of Technology (北京理工大学); Technology Innovation Institute (技术创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The extensive application of unmanned aerial vehicles (UAVs) in military reconnaissance, environmental monitoring, and related domains has created an urgent need for accurate and efficient multi-object tracking (MOT) technologies, which are also essential for UAV situational awareness. However, complex backgrounds, small-scale targets, and frequent occlusions and interactions continue to challenge existing methods in terms of detection accuracy and trajectory continuity. To address these issues, this paper proposes the Global-Local Detection and Tracking (GL-DT) framework. It employs a Spatio-Temporal Feature Fusion (STFF) module to jointly model motion and appearance features, combined with a global-local collaborative detection strategy, effectively enhancing small-target detection. Building upon this, the JPTrack tracking algorithm is introduced to mitigate common issues such as ID switches and trajectory fragmentation. Experimental results demonstrate that the proposed approach significantly improves the continuity and stability of MOT while maintaining real-time performance, providing strong support for the advancement of UAV detection and tracking technologies.
zh

[CV-50] MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling

【速读】：该论文旨在解决点云法向量估计中对细粒度几何结构建模不足的问题，从而提升预测法向量的精度。现有方法在捕捉局部几何细节方面表现有限，而基于状态空间模型（State Space Models, SSMs）的最新进展，特别是Mamba模型，在长程依赖建模上展现出线性复杂度下的强大能力，但其在点云处理中的应用仍集中于全局形状理解，忽视了局部精细几何特征的建模。为此，论文提出两个关键创新：一是引入注意力驱动的分层特征融合（Attention-driven Hierarchical Feature Fusion, AHFF）机制，自适应融合多尺度点云块特征，显著增强局部邻域内的几何上下文学习；二是设计逐块状态空间模型（Patch-wise State Space Model, PSSM），将点云块建模为隐式超曲面（hyper-surface），通过状态动力学实现对局部几何结构的精细理解，从而提升法向量预测性能。

链接: https://arxiv.org/abs/2510.09088
作者: Weijia Wang,Yuanzhi Su,Pei-Gen Ye,Yuan-Gen Wang,Xuequan Lu
机构: Guangzhou University (广州大学); Hong Kong Polytechnic University (香港理工大学); Beijing Institute of Technology (北京理工大学); The University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 12 figures

点击查看摘要

Abstract:We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.
zh

[CV-51] Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array IROS2025

【速读】：该论文旨在解决柔性微电极（Flexible Microelectrode, FME）植入大脑皮层过程中因探针纤维状结构易变形及与生物组织相互作用导致的可靠性与安全性问题，尤其是在植入过程需实时监测以避免异常情况的发生。解决方案的关键在于构建一个基于机器人系统显微摄像机图像的异常检测框架，该框架在四个关键节点（微针状态、FME探针状态、钩挂结果和植入点）进行统一检测；通过引入一种渐进式粒度补丁特征采样方法，在不同位置上平衡敏感性与容错性，并从预训练视觉Transformer（Vision Transformer, ViT）的原始特征中筛选信噪比更高的特征通道，以增强特定场景下的描述能力，从而提升异常检测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2510.09071
作者: Yitong Chen,Xinyao Xu,Ping Zhu,Xinyong Han,Fangbo Qin,Shan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accept by IROS 2025

点击查看摘要

Abstract:Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.
zh

[CV-52] MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

【速读】：该论文旨在解决视频或文本查询条件下的声音分离问题，即从混合音频中精确分离出与特定视觉场景或文本描述相关的声源。其解决方案的关键在于利用预训练的视频到音频生成模型（video-to-audio model）作为基础，通过微调实现声音分离功能，从而避免从零开始训练，显著提升训练效率。该方法不仅在分离性能上优于现有确定性和生成式基线模型，还保留了原始视频到音频生成能力，表明基础声音生成模型可有效迁移至多种音频相关下游任务。

链接: https://arxiv.org/abs/2510.09065
作者: Akira Takahashi,Shusuke Takahashi,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 4 pages, 4 figures, 2 tables

点击查看摘要

Abstract:We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at this https URL.
zh

[CV-53] OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching

【速读】：该论文旨在解决流模型（flow-based model）在文本到图像生成中因确定性轨迹导致的多样性不足问题，即用户需反复采样才能探索不同模式，效率低下。解决方案的关键在于提出一种无需训练、仅在推理阶段生效的控制机制：通过特征空间目标鼓励轨迹间的横向扩散，并引入随时间调度的随机扰动来重置不确定性；该扰动被约束为与生成流正交，从而在不破坏图像细节和提示一致性的情况下提升多样性。此方法理论上单调增加体积代理指标且近似保持边缘分布，保障了生成质量的稳定性。

链接: https://arxiv.org/abs/2510.09060
作者: Jingxuan Wu,Zhenglin Wan,Xingrui Yu,Yuzhe Yang,Bo An,Ivor Tsang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.
zh

[CV-54] Lesion-Aware Post-Training of Latent Diffusion Models for Synthesizing Diffusion MRI from CT Perfusion MICCAI2025

【速读】：该论文旨在解决潜在扩散模型（Latent Diffusion Models, LDMs）在医学图像到图像翻译任务中因压缩潜在空间学习而导致的像素级细节丢失问题，尤其在生成临床关键结构（如病变区域）时精度不足，从而影响诊断可靠性与临床决策。解决方案的关键在于提出一种新颖的后训练框架，通过引入病变感知的医学像素空间目标（lesion-aware medical pixel space objectives），在不改变预训练模型结构的前提下增强病变区域的重建精度，同时提升整体图像质量。该方法在急性缺血性卒中患者的脑部CT到MRI图像转换任务中验证有效，显著优于现有图像翻译模型，且具备良好的可迁移性与泛化潜力。

链接: https://arxiv.org/abs/2510.09056
作者: Junhyeok Lee,Hyunwoong Kim,Hyungjin Chung,Heeseong Eom,Joon Jang,Chul-Ho Sohn,Kyu Sung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025, Lecture Notes in Computer Science Vol. 15961

点击查看摘要

Abstract:Image-to-Image translation models can help mitigate various challenges inherent to medical image acquisition. Latent diffusion models (LDMs) leverage efficient learning in compressed latent space and constitute the core of state-of-the-art generative image models. However, this efficiency comes with a trade-off, potentially compromising crucial pixel-level detail essential for high-fidelity medical images. This limitation becomes particularly critical when generating clinically significant structures, such as lesions, which often occupy only a small portion of the image. Failure to accurately reconstruct these regions can severely impact diagnostic reliability and clinical decision-making. To overcome this limitation, we propose a novel post-training framework for LDMs in medical image-to-image translation by incorporating lesion-aware medical pixel space objectives. This approach is essential, as it not only enhances overall image quality but also improves the precision of lesion delineation. We evaluate our framework on brain CT-to-MRI translation in acute ischemic stroke patients, where early and accurate diagnosis is critical for optimal treatment selection and improved patient outcomes. While diffusion MRI is the gold standard for stroke diagnosis, its clinical utility is often constrained by high costs and low accessibility. Using a dataset of 817 patients, we demonstrate that our framework improves overall image quality and enhances lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models. Furthermore, our post-training strategy is easily adaptable to pre-trained LDMs and exhibits substantial potential for broader applications across diverse medical image translation tasks.
zh

[CV-55] Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

【速读】：该论文旨在解决在存在标签噪声的情况下，如何实现激光雷达（LiDAR）点云数据的3D语义分割领域泛化（Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels, DGLSS-NL）问题。现有方法在图像领域的噪声标签学习策略难以直接迁移至稀疏且结构不规则的LiDAR点云数据中，导致性能下降。其解决方案的关键在于提出DuNe框架——一种双视角架构，包含强分支与弱分支以强制特征层面的一致性，并基于置信度感知的预测过滤机制应用交叉熵损失函数，从而有效缓解噪声标签对模型性能的负面影响，在多个基准数据集上实现了显著优于现有方法的领域泛化能力。

链接: https://arxiv.org/abs/2510.09035
作者: Weitong Kong,Zichao Zeng,Di Wen,Jiale Wei,Kunyu Peng,June Moh Goo,Jan Boehm,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
zh

[CV-56] owards Better Faster Autoregressive Image Generation: From the Perspective of Entropy KR

【速读】：该论文旨在解决当前自回归（Autoregressive, AR）图像生成模型中采样效率与生成质量之间的权衡问题。研究表明，图像标记（image tokens）相较于文本标记具有更低的信息密度和非均匀的空间分布特性，导致传统采样策略在生成过程中难以兼顾多样性、对齐精度与结构连贯性。解决方案的关键在于提出一种基于熵的解码策略：一是通过空间熵引导的动态温度控制机制，在不增加额外计算开销的前提下优化生成内容的多样性、准确性与结构一致性；二是引入熵感知的接受规则用于推测解码（speculative decoding），在仅需传统加速方法约85%推理成本的情况下实现近乎无损的生成效果，显著提升生成速度并保持高质量输出。

链接: https://arxiv.org/abs/2510.09012
作者: Xiaoxiao Ma,Feng Zhao,Pengyang Ling,Haibo Qiu,Zhixiang Wei,Hu Yu,Jie Huang,Zhixiong Zeng,Lin Ma
机构: University of Science and Technology of China (中国科学技术大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.
zh

[CV-57] Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

【速读】：该论文旨在解决自回归文本到图像生成模型（autoregressive text-to-image models）在推理阶段因逐 token 解码而导致的效率低下问题，此类模型通常需要数千次前向传播才能生成一张图像。解决方案的关键在于提出一种名为“推测式雅可比去噪解码”（Speculative Jacobi-Denoising Decoding, SJD2）的新框架，其核心创新是将去噪过程嵌入雅可比迭代中，实现自回归模型中的并行 token 生成；具体而言，通过引入“下一干净 token 预测”范式，使预训练模型能够接受含噪声的 token 嵌入并预测下一个干净 token，从而引导模型走向更稳定的雅可比轨迹，并在推理时利用概率准则并行验证和接受多个 token，显著减少模型前向传递次数，同时保持图像质量。

链接: https://arxiv.org/abs/2510.08994
作者: Yao Teng,Fuyun Wang,Xian Liu,Zhekai Chen,Han Shi,Yu Wang,Zhenguo Li,Weiyang Liu,Difan Zou,Xihui Liu
机构: The University of Hong Kong (香港大学); CUHK (香港中文大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.
zh

[CV-58] Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

【速读】：该论文旨在解决生成式 AI (Generative AI) 在图像着色任务中可能引发的版权侵权问题，特别是未经授权对黑白漫画和电影进行着色并转售的风险。解决方案的关键在于提出了一种名为“不可着色样本”（Uncolorable Examples）的防御范式，其核心是通过嵌入人眼难以察觉的扰动（perturbations）来破坏未经授权的着色模型输出质量，同时保持原图视觉感知一致性。该方法名为感知感知色彩限制扰动（Perception-Aware Chroma-Restrictive Perturbation, PAChroma），其关键技术包括利用拉普拉斯滤波器优化扰动以保障感知质量，并在优化过程中引入多样化的输入变换以增强跨模型迁移性和对常见后处理（如压缩）的鲁棒性，从而满足有效性、不可感知性、迁移性和鲁棒性四项实际应用标准。

链接: https://arxiv.org/abs/2510.08979
作者: Yuki Nii,Futa Waseda,Ching-Chun Chang,Isao Echizen
机构: The University of Tokyo (东京大学); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement – for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.
zh

[CV-59] HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

【速读】：该论文旨在解决当前文本到图像（Text-to-Image, T2I）生成模型在复杂局部区域（尤其是人体手部）细节生成上的准确性不足问题，这一缺陷常表现为结构扭曲和纹理不真实，严重影响整体图像质量与下游任务性能。其解决方案的关键在于提出首个针对生成手部区域的质量评估任务，并构建HandPair数据集与HandEval评估模型：HandPair通过成对的高质量与低质量手部图像实现低成本、高效的弱监督训练；HandEval则融合多模态大语言模型（Multimodal Large Language Model, MLLM）的强大视觉理解能力与手部关键点先验知识，显著提升对手部质量的感知精度，且经人工标注测试集验证优于现有SOTA方法，在图像生成优化与AIGC检测等下游任务中展现出通用有效性。

链接: https://arxiv.org/abs/2510.08978
作者: Zichuan Wang,Bo Peng,Songlin Yang,Zhenchen Tang,Jing Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.
zh

[CV-60] Hierarchical Scheduling for Multi-Vector Image Retrieval

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）应用中基于用户特定数据的图像检索精度不足的问题。现有方法如多向量检索（Multi-Vector Retrieval, MVR）虽通过分解查询和匹配分割图像提升了准确性，但仍存在对查询与图像对象间对齐不充分以及细粒度图像片段冗余的问题，导致准确率和效率均未达到最优。解决方案的关键在于提出一种高效的调度框架HiMIR：首先引入分层范式（hierarchical paradigm），利用多种中间粒度适配不同图像对象以增强查询与图像内容的对齐；其次通过跨层级相似性一致性与层级稀疏性约束最小化冗余匹配计算，从而显著降低不必要的计算开销。实证结果表明，HiMIR在提升检索精度的同时，相较现有MVR系统可减少高达3.5倍的计算量。

链接: https://arxiv.org/abs/2510.08976
作者: Maoliang Li,Ke Li,Yaoyang Liu,Jiayu Chen,Zihao Zheng,Yinjun Wu,Xiang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注: Under Review

点击查看摘要

Abstract:To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
zh

[CV-61] mmJoints: Expanding Joint Representations Beyond (xyz) in mmWave-Based 3D Pose Estimation

【速读】：该论文旨在解决毫米波（mmWave）人体姿态估计中因信号稀疏和弱反射导致模型依赖统计先验而非真实传感器数据的问题，从而影响手势与活动识别等下游任务的性能。其解决方案的关键在于提出 mmJoints 框架，通过增强预训练黑盒 mmWave 3D 姿态估计算法的输出，显式引入关节感知概率和位置可靠性两个联合描述符（joint descriptors），使模型对预测结果的可信度具有可解释性，并显著提升下游任务精度。

链接: https://arxiv.org/abs/2510.08970
作者: Zhenyu Wang,Mahathir Monjur,Shahriar Nirjon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In mmWave-based pose estimation, sparse signals and weak reflections often cause models to infer body joints from statistical priors rather than sensor data. While prior knowledge helps in learning meaningful representations, over-reliance on it degrades performance in downstream tasks like gesture and activity recognition. In this paper, we introduce mmJoints, a framework that augments a pre-trained, black-box mmWave-based 3D pose estimator’s output with additional joint descriptors. Rather than mitigating bias, mmJoints makes it explicit by estimating the likelihood of a joint being sensed and the reliability of its predicted location. These descriptors enhance interpretability and improve downstream task accuracy. Through extensive evaluations using over 115,000 signal frames across 13 pose estimation settings, we show that mmJoints estimates descriptors with an error rate below 4.2%. mmJoints also improves joint position accuracy by up to 12.5% and boosts activity recognition by up to 16% over state-of-the-art methods.
zh

[CV-62] Denoised Diffusion for Object-Focused Image Augmentation

【速读】：该论文旨在解决农业场景中动物健康监测因数据稀缺而导致的模型性能受限问题，尤其是针对无人机航拍图像中小目标、遮挡或部分可见动物带来的检测挑战。其解决方案的关键在于提出一种面向对象的数据增强框架，通过分割动物主体并结合变换与扩散生成技术，合成真实且多样化的场景数据，从而提升动物检测模型在有限数据条件下的泛化能力与实用性。

链接: https://arxiv.org/abs/2510.08955
作者: Nisha Pillai,Aditi Virupakshaiah,Harrison W. Smith,Amanda J. Ashworth,Prasanna Gowda,Phillip R. Owens,Adam R. Rivers,Bindu Nanduri,Mahalingam Ramkumar
机构: Mississippi State University (密西西比州立大学); United States Department of Agriculture (USDA) - Agricultural Research Service (ARS) (美国农业部-农业研究服务局)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.
zh

[CV-63] Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning

【速读】：该论文旨在解决传统证据深度学习（Evidence Deep Learning, EDL）方法在动态数据分布下因采用静态超参数导致不确定性校准能力不足、泛化性能差的问题，尤其在高风险决策任务中表现不佳。其解决方案的关键在于提出一种名为元策略控制器（Meta-Policy Controller, MPC）的动态元学习框架，通过双层优化机制实现对KL散度系数和类别特定狄利克雷（Dirichlet）先验强度的自适应调整：内层循环基于当前训练状态动态配置损失函数以更新模型参数，外层循环则利用策略网络根据多目标奖励（平衡预测准确率与不确定性质量）优化超参数，从而实现更灵活且可靠的不确定性建模。

链接: https://arxiv.org/abs/2510.08938
作者: Zhen Yang,Yansong Ma,Lei Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.
zh

[CV-64] RO-Bench: Large-scale robustness evaluation of MLLM s with text-driven counterfactual videos

【速读】：该论文旨在解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在面对篡改视频内容时鲁棒性不足的问题，即当前模型在动态分布外（out-of-distribution, OOD）的反事实视频测试集上性能显著下降。其解决方案的关键在于构建首个针对MLLMs的反事实视频评测基准Ro-Bench，该基准通过编辑风格（Style）、物体（Object）、背景（Background）及其组合生成高质量、多样且具有时间相关性的视频数据，并验证了利用反事实数据微调可有效提升模型鲁棒性——实验表明，微调后模型在Ro-Bench上性能提升21.73%，并在MVBench的20个任务中平均提升12.78%。

链接: https://arxiv.org/abs/2510.08936
作者: Zixi Yang,Jiapeng Li,Muxi Diao,Yinuo Jing,Kongming Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.
zh

[CV-65] Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation

【速读】：该论文旨在解决生成式 AI（Generative AI）在图像恢复任务中面临的知识蒸馏（Knowledge Distillation, KD）攻击问题，即 adversaries 可通过教师模型（teacher model）的输出训练学生网络（student network）来窃取其知识产权。与图像分类任务不同，图像恢复任务具有连续、高维且依赖空间一致性的输出特性，使得传统针对分类任务的扰动方法难以奏效。解决方案的关键在于提出自适应奇异值扰动（Adaptive Singular Value Perturbation, ASVP），该方法在教师模型运行时对其内部特征图进行奇异值分解（SVD），并放大前 k 个最大奇异值以注入结构化的高频扰动，从而破坏学生网络学习教师输出的能力，同时几乎不影响教师模型本身的性能。

链接: https://arxiv.org/abs/2510.08925
作者: Han Hu,Zhuoran Zheng,Chen Lyu
机构: Shandong Normal University (山东师范大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model’s outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying this http URL address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher’s output this http URL evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher’s performance. Compared to prior methods, ASVP offers a stronger and more consistent this http URL approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.
zh

[CV-66] SegTrans: Transferable Adversarial Examples for Segmentation Models

【速读】：该论文旨在解决分割模型在白盒设置下对对抗样本的脆弱性问题，尤其是现有迁移攻击方法在不同分割模型间转移成功率低的问题。其核心挑战在于分割模型内部复杂的上下文依赖关系以及源模型与目标模型之间的特征分布差异。解决方案的关键在于提出SegTrans框架，该框架将输入样本划分为多个局部区域，并通过重映射语义信息生成多样化的增强样本用于扰动优化，从而提升对抗样本的迁移能力；与现有方法不同的是，SegTrans仅保留原始输入的局部语义信息进行扰动优化，而非依赖全局语义信息，这一设计显著提升了攻击成功率且不增加额外计算开销。

链接: https://arxiv.org/abs/2510.08922
作者: Yufei Song,Ziqi Zhou,Qi Lu,Hangtao Zhang,Yifan Hu,Lulu Xue,Shengshan Hu,Minghui Li,Leo Yu Zhang
机构: Huazhong University of Science and Technology (华中科技大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMM 2025

点击查看摘要

Abstract:Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.
zh

[CV-67] PHyCLIP: ell_1-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models）在多模态表征学习中难以同时表达两类语义结构的问题：一是概念家族内部的层次关系（如“狗 ≼ 哺乳动物 ≼ 动物”），二是不同概念家族之间的组合性关系（如“车中的狗 ≼ 狗, 车”）。现有方法虽尝试利用双曲空间（Hyperbolic Space）高效建模树状层次结构，但其对组合性的表达能力尚不明确。解决方案的关键在于提出PHyCLIP框架，该框架采用超球面因子的笛卡尔积上的ℓ₁-Product度量，使得每个超球面因子内自然涌现家族内的层次结构，而跨家族的组合关系则通过ℓ₁-Product度量捕捉，其形式类似于布尔代数（Boolean algebra），从而实现对两种语义结构的统一建模与可解释嵌入表示。

链接: https://arxiv.org/abs/2510.08919
作者: Daiki Yoshikawa,Takashi Matsubara
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog \preceq mammal \preceq animal) and the compositionality across different concept families (e.g., “a dog in a car” \preceq dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an \ell_1 -Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the \ell_1 -product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
zh

[CV-68] Modeling Time-Lapse Trajectories to Characterize Cranberry Growth ICCV

【速读】：该论文旨在解决蔓越莓种植中生长监测依赖人工操作、效率低下的问题，同时克服基于深度学习的方法因高维特征难以解释及需大量手工标注而导致的局限性。其解决方案的关键在于提出一种基于自监督预训练策略微调视觉Transformer（Vision Transformer, ViT）的方法，通过设计双重预训练任务（时间回归与类别预测）来学习植物和果实外观随时间演化的潜在空间，从而生成可解释的二维时间序列轨迹，用于预测作物生长趋势并区分不同蔓越莓品种的生长差异。

链接: https://arxiv.org/abs/2510.08901
作者: Ronan John,Anis Chihoub,Ryan Meegan,Gina Sidelli,Jeffery Neyhart,Peter Oudemans,Kristin Dana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshops 2025

点击查看摘要

Abstract:Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).
zh

[CV-69] Sparse components distinguish visual pathways their alignment to neural networks

【速读】：该论文试图解决的问题是：尽管人类高级视觉皮层的腹侧流（ventral stream）、背侧流（dorsal stream）和外侧流（lateral stream）在功能上各自特异，但仅通过单一任务训练的深度神经网络（DNNs）却能较好地模拟整个视觉系统，这暗示了不同通路之间可能存在共通的计算原理。为解析这一看似矛盾的现象，研究提出了一种新的稀疏分解方法，用于识别各视觉通路中主导的表征成分；进一步引入稀疏成分对齐（Sparse Component Alignment, SCA）作为衡量大脑与机器视觉系统间表征对齐的新指标，该方法能够更敏感地捕捉到两类系统在潜在神经调谐轴上的差异。关键在于，SCA通过提取可解释的神经成分并比较其响应特性，显著提升了传统群体几何方法在区分不同视觉通路表征结构方面的分辨率，从而揭示出标准视觉DNN主要与腹侧流表征更为一致。

链接: https://arxiv.org/abs/2510.08858
作者: Ammar I Marvi,Nancy G Kanwisher,Meenakshi Khosla
机构: MIT(麻省理工学院); UCSD(加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams – identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system’s underlying axes of neural tuning.
zh

[CV-70] FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

【速读】：该论文旨在解决开放词汇表3D实例分割（Open-vocabulary 3D instance segmentation）中因依赖2D RGB-D图像映射导致的噪声干扰和推理效率低下的问题。现有方法通常将3D实例投影到2D图像空间，并利用视觉语言模型（Vision-Language Models, VLMs）进行分类，但这一策略易受遮挡影响且计算与内存开销大，显著拖慢推理速度。其解决方案的关键在于提出一种基于标签引导的知识蒸馏方法（Label-guided Knowledge Distillation, FOLK），设计一个教师模型从多视角、可见性多样化的2D CLIP嵌入中提取高质量实例表示，并将其开放词汇知识蒸馏至一个直接处理3D点云的轻量级学生模型中，从而在推理阶段无需图像映射即可实现高效准确的3D实例分类。

链接: https://arxiv.org/abs/2510.08849
作者: Hongrui Wu,Zhicheng Gao,Jin Cao,Kelu Yao,Wen Shen,Zhihua Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
zh

[CV-71] he Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective NEURIPS2025

【速读】：该论文旨在解决医疗影像中时间至事件（Time-to-Event, TTE）预测模型的公平性问题，即现有机器学习方法在医学图像诊断任务中已受到较多关注，但在疾病预后预测场景下对不同社会群体可能存在的偏倚缺乏系统评估与干预。其解决方案的关键在于提出首个全面的公平性评估框架 FairTTE，该框架整合多种成像模态和 TTE 预测任务，结合先进的 TTE 建模与公平性算法，并利用因果分析技术识别和量化数据集中不同来源的偏倚，从而实现对医疗影像预后模型公平性的细粒度分析与改进。

链接: https://arxiv.org/abs/2510.08840
作者: Thai-Hoang Pham,Jiayuan Chen,Seungyeon Lee,Yuanlong Wang,Sayoko Moroi,Xueru Zhang,Ping Zhang
机构: The Ohio State University (俄亥俄州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.
zh

[CV-72] Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction

【速读】：该论文旨在解决边缘计算环境下实时多视角三维重建（real-time multi-view 3D reconstruction）的可靠性问题，特别是在资源受限且易受干扰的场景中，如火灾救援等关键任务应用。其核心挑战在于动态变化的边缘资源可用性（如图像质量下降、网络链路不稳定和服务器负载波动）会显著影响重建流程的稳定性和质量。解决方案的关键在于提出一种基于强化学习（reinforcement learning, RL）的边缘资源管理框架，采用两个协同工作的Q-learning代理（agent），分别负责摄像头选择与服务器选择，二者均在线运行并通过与边缘环境的交互自主学习最优策略，从而在端到端延迟与重建质量之间实现有效权衡，保障系统在复杂扰动条件下的高可靠性。

链接: https://arxiv.org/abs/2510.08839
作者: Motahare Mounesan,Sourya Saha,Houchao Gan,Md. Nurul Absur,Saptarshi Debroy
机构: City University of New York (纽约市立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
zh

[CV-73] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition EMNLP2025

【速读】：该论文旨在解决将基于图像预训练的视觉语言模型（Vision-Language Models, VLMs）适配到视频语言任务时面临的两大核心挑战：感知瓶颈（perception bottleneck）和标记过载（token overload）。感知瓶颈源于视频输入具有密集且时序延展的特性，超出了图像模型的处理能力；而标记过载则由于视频帧数量庞大导致模型在处理过程中产生过多空间标记（spatial tokens），影响效率与精度。解决方案的关键在于提出一种无需训练的适配框架 D-CoDe，其核心机制包括动态压缩（dynamic compression）和问题分解（question decomposition）：前者通过自适应选择代表性帧并进行内容感知的空间标记聚合，减少冗余同时保留关键信息；后者将原始复杂问题拆解为子问题，引导模型聚焦于视频的不同方面，从而实现更全面的理解。实验表明，D-CoDe 在多个视频理解基准上显著提升性能，尤其在长视频任务中表现突出，验证了其在复杂视频-语言任务中的潜力。

链接: https://arxiv.org/abs/2510.08818
作者: Yiyang Huang,Yizhou Wang,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to EMNLP 2025

点击查看摘要

Abstract:Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at this https URL.
zh

[CV-74] SkipSR: Faster Super Resolution with Token Skipping

【速读】：该论文旨在解决基于扩散模型的视频超分辨率（Diffusion-based Video Super-Resolution, Diffusion-based SR）计算效率低、成本高的问题，尤其是在高分辨率和长视频场景下难以扩展。其关键解决方案是提出SkipSR框架，通过直接从低分辨率输入中识别出本质上细节较少的区域，并完全跳过这些区域的计算，仅对需要精细化处理的部分进行超分辨率重建。这一策略在保持感知质量的同时显著降低计算开销，在标准SR基准测试中实现了比现有方法最多60%的端到端延迟减少，且无明显质量损失。

链接: https://arxiv.org/abs/2510.08799
作者: Rohan Choudhury,Shanchuan Lin,Jianyi Wang,Hao Chen,Qi Zhao,Feng Cheng,Lu Jiang,Kris Kitani,Laszlo A. Jeni
机构: Carnegie Mellon University (卡内基梅隆大学); ByteDance Seed (字节跳动种子团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at this https URL
zh

[CV-75] Alignment Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering CVPR2025

【速读】：该论文针对医学视觉问答（Med-VQA）任务中模态对齐不统一、难负样本（hard negatives）未充分挖掘以及现有知识融合方法可能引入无关信息的问题展开研究。解决方案的关键在于：（1）提出一种跨多层级、多模态、多视角和多阶段的统一模态对齐框架，结合对比学习与最优传输理论实现更精准的跨模态映射；（2）设计基于软标签的难负样本挖掘方法，强化多模态对齐中的难例判别能力；（3）引入门控交叉注意力模块（Gated Cross-Attention Module），利用答案词汇表作为先验知识，筛选出与问题相关的语义信息，从而提升模型的推理准确性与鲁棒性。

链接: https://arxiv.org/abs/2510.08791
作者: Yuanhao Zou,Zhaozheng Yin
机构: University of Michigan, Ann Arbor (密歇根大学安娜堡分校); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025 Paper

点击查看摘要

Abstract:Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.
zh

[CV-76] Q-Router: Agent ic Video Quality Assessment with Expert Model Routing and Artifact Localization

【速读】：该论文旨在解决现有视频质量评估（Video Quality Assessment, VQA）模型在跨内容类型和任务场景下泛化能力差、可解释性弱以及难以扩展至新型应用场景的问题。其关键解决方案是提出Q-Router框架，该框架采用多层级代理式模型路由机制，通过视觉-语言模型（Vision-Language Models, VLMs）作为实时路由器，动态推理并集成最适合输入视频语义的专家模型；同时构建基于计算预算的多级路由系统，其中最重的层级引入时空伪影定位以增强可解释性，从而实现对异构视频源与任务的灵活适配与鲁棒性能。

链接: https://arxiv.org/abs/2510.08789
作者: Shuo Xing,Soumik Dey,Mingyang Wu,Ashirbad Mishra,Hansi Wu,Binbin Li,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); eBay Inc. (eBay公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision–language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.
zh

[CV-77] Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

【速读】：该论文旨在解决野生动物个体识别中传统标记方法（如鸟类腿环）耗时且具有侵入性的问题，从而提升种群监测的效率与非侵入性。其解决方案的关键在于构建一个基于人工智能和计算机视觉的无监督关键帧提取流水线，该流程融合了YOLO与Grounding DINO进行目标检测、光学流模糊检测以筛选清晰图像、DINOv2进行图像编码，并结合聚类算法识别代表性关键帧，最终实现对新西兰濒危鹦鹉——kākā（Nestor meridionalis）个体的高精度再识别，为未来在复杂环境中利用视频媒体开展野生动物监测提供了可靠的技术基础。

链接: https://arxiv.org/abs/2510.08775
作者: Paula Maddigan,Andrew Lensen,Rachael C. Shaw
机构: Victoria University of Wellington (维多利亚大学); School of Engineering and Computer Science (工程与计算机科学学院); Centre for Data Science and Artificial Intelligence (数据科学与人工智能中心); School of Biological Sciences (生物科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of kākā (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in kākā re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising kākā individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.
zh

[CV-78] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

【速读】：该论文旨在解决生成式图像超分辨率（Image Super-Resolution, ISR）模型中自注意力机制因二次计算复杂度（O(N²)）导致的严重计算瓶颈问题，同时克服线性注意力（Linear Attention）在实现高保真图像重建时长期存在的训练不稳定、感知质量与失真之间的权衡难题。其解决方案的关键在于提出一个系统性的框架LinearSR：首先通过基于“膝点”（knee point）的早停引导微调策略（ESGF）稳定训练过程并避免模型发散；其次引入基于信噪比（SNR）的专家混合（Mixture of Experts, MoE）架构以缓解感知质量与重建失真之间的矛盾；最后设计轻量级的TAG引导范式，基于“精度优先于体积”（precision-over-volume）原则实现高效且高质量的生成。该方法首次实现了在线性复杂度下兼具卓越感知质量和推理效率的光真实超分辨率建模。

链接: https://arxiv.org/abs/2510.08771
作者: Xiaohui Li,Shaobin Zhuang,Shuo Cao,Yang Yang,Yuandong Pu,Qi Qin,Siqi Luo,Bin Fu,Yihao Liu
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention’s quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel “knee point”-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our “precision-over-volume” principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.
zh

[CV-79] Detecting spills using thermal imaging pretrained deep learning models and a robotic platform

【速读】：该论文旨在解决工业环境中油污或液体泄漏（spill）的实时检测问题，以提升安全性和响应效率。其解决方案的关键在于利用预训练的深度学习模型结合RGB与热成像（thermal imaging）数据，通过构建平衡的二分类数据集（4,000张图像），验证了热成像在推理速度、准确率和模型体积方面的优势。实验表明，基于热成像训练的轻量级模型（如VGG19和NasNetMobile）可在消费级硬件（RTX 4080）上实现高达100%的检测准确率，且推理时间低至44毫秒，模型尺寸小于350 MB，显著提升了系统在安全关键场景中的可部署性与鲁棒性。

链接: https://arxiv.org/abs/2510.08770
作者: Gregory Yeghiyan,Jurius Azar,Devson Butani,Chan-Jin Chung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 6 pages

点击查看摘要

Abstract:This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.
zh

[CV-80] SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

【速读】：该论文旨在解决深度学习模型在实际应用中面临的对抗攻击（adversarial attacks）问题，这类攻击会显著威胁模型的安全性与可靠性。传统防御方法通常依赖于计算密集型的优化策略（如对抗训练或数据增强），而本文提出了一种基于生物视觉机制的新型防御框架，其关键在于模拟人类视觉系统的三个核心特性：中央-周边处理（foveal-peripheral processing）、扫视眼动（saccadic eye movements）和皮层填充（cortical filling-in）。通过强化学习引导的扫视动作选择性捕获多个中央-周边视觉片段，并将其重构为图像输入分类器，该方法无需重新训练或微调下游分类器即可有效抑制对抗噪声、保持语义完整性，且相比其他生物启发与非生物启发的防御方法大幅降低训练开销。

链接: https://arxiv.org/abs/2510.08761
作者: Jiayang Liu,Daniel Tso,Yiming Bu,Qinru Qiu
机构: Syracuse University (雪城大学); SUNY Upstate Medical University (纽约州立大学上州医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.
zh

[CV-81] BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在具身能力（embodied capabilities）评估与提升方面的系统性不足问题。现有基准主要聚焦于特定领域如规划或空间理解，缺乏对感知、理解与交互物理世界等原子级具身能力的全面评测。为此，作者提出BEAR基准，涵盖14个子域、6类能力的4,469个交错图像-视频-文本样本，首次实现了对MLLM具身能力的细粒度量化评估。解决方案的关键在于引入BEAR-Agent——一个集成预训练视觉模型的多模态可对话代理，通过增强MLLM的感知、3D理解和规划能力，显著提升了其在BEAR上的表现（绝对提升9.12%，相对提升17.5%），并验证了具身能力改进对仿真环境中任务性能的正向迁移效应。

链接: https://arxiv.org/abs/2510.08759
作者: Yu Qi,Haibo Zhao,Ziyu Guo,Siyuan Ma,Ziyan Chen,Yaokun Han,Renrui Zhang,Zitiantao Lin,Shiji Xin,Yijian Huang,Kai Cheng,Peiheng Wang,Jiazheng Liu,Jiayi Zhang,Yizhe Zhu,Wenqing Wang,Yiran Qin,Xupeng Zhu,Haojie Huang,Lawson L.S. Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: this https URL
zh

[CV-82] Structured Output Regularization: a framework for few-shot transfer learning

【速读】：该论文旨在解决传统迁移学习在小样本场景下适应性不足和过拟合的问题。传统方法通过冻结预训练模型的部分权重并添加任务特定层来实现计算效率，但难以充分捕捉领域特异性特征且在数据极度有限时仍易过拟合。解决方案的关键在于提出结构化输出正则化（Structured Output Regularization, SOR）框架：该框架冻结网络内部结构（如卷积核），同时引入组套索（group lasso）与L₁正则化相结合的策略，在保持极少量新增参数的前提下，实现对特定数据集的有效适配，并适用于多种网络组件（如卷积滤波器或神经网络中的不同模块），从而显著提升迁移学习在少样本医学图像分类任务中的性能表现。

链接: https://arxiv.org/abs/2510.08728
作者: Nicolas Ewen,Jairo Diaz-Rodriguez,Kelly Ramsay
机构: York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model’s ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and L_1 penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.
zh

[CV-83] Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation

【速读】：该论文旨在解决当前具身智能体在视觉导航中因模块化架构导致的状态-动作错位（state-action misalignment）以及在新场景或动态环境中适应能力有限的问题。其解决方案的关键在于提出UniWM——一种统一的、基于记忆增强的世界模型，将自我中心视角的未来预测（egocentric visual foresight）与规划整合到单一多模态自回归骨干网络中，从而显式地将行动决策锚定于视觉想象的结果上，实现预测与控制的紧密对齐；同时引入分层记忆机制，融合短期感知线索与长期轨迹上下文，支持长时间跨度下的稳定连贯推理。

链接: https://arxiv.org/abs/2510.08713
作者: Yifei Dong,Fengyi Wu,Guangyu Chen,Zhi-Qi Cheng,Qiyu Hu,Yuxuan Zhou,Jingdong Sun,Jun-Yan He,Qi Dai,Alexander G Hauptmann
机构: University of Washington (华盛顿大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 11 figures, code: this https URL

点击查看摘要

Abstract:Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.
zh

[CV-84] hinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

【速读】：该论文旨在解决相机中心的视觉理解与生成任务长期被孤立研究的问题，即如何统一建模多视角下的空间智能（spatial intelligence），以实现从任意视角对场景的理解与生成。其解决方案的关键在于提出一个名为 Puffin 的统一多模态模型，通过将相机参数视为“语言”（treat camera as language）的新范式，使模型能够在几何上下文中对视觉线索进行空间对齐并推理，从而实现跨视角的语义一致性。Puffin 利用包含全局相机参数和像素级相机映射的大规模数据集（Puffin-4M）进行训练，结合语言回归与基于扩散的生成机制，在相机维度上扩展了空间感知能力，显著优于专用模型，并具备良好的跨视图任务泛化能力。

链接: https://arxiv.org/abs/2510.08673
作者: Kang Liao,Size Wu,Zhonghua Wu,Linyi Jin,Chao Wang,Yikai Wang,Fei Wang,Wei Li,Chen Change Loy
机构: S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技); University of Michigan (密歇根大学); Max-Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.
zh

[CV-85] FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching

【速读】：该论文旨在解决扩散模型（Diffusion Models）中因推理成本高而导致的应用瓶颈问题，尤其是传统特征缓存（Feature Caching）方法在不同时间步之间假设特征具有相似性或连续性的局限性。解决方案的关键在于提出一种频率感知的缓存机制（Frequency-aware Caching, FreqCa），其核心思想是：通过频域分析发现低频成分（决定图像结构）具有较高相似性但较差连续性，而高频成分（决定图像细节）则表现出显著连续性但较低相似性；因此，FreqCa对低频特征直接基于相似性进行重用，对高频特征则采用二阶赫米特插值器（second-order Hermite interpolator）利用其连续性进行预测。此外，论文进一步提出缓存累积残差特征（Cumulative Residual Feature, CRF）而非全层特征，将内存占用降低99%，显著提升了缓存效率与实用性。

链接: https://arxiv.org/abs/2510.08669
作者: Jiacheng Liu,Peiliang Cai,Qinming Zhou,Yuqi Lin,Deyang Kong,Benhao Huang,Yupei Pan,Haowen Xu,Chang Zou,Junshu Tang,Shikang Zheng,Linfeng Zhang
机构: EPIC Lab(STJU); Tencent Hunyuan; SDU; THU; JLU; CMU; UESTC; SCUT
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub. Comments: 15 pages, 11 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.08669 [cs.LG] (or arXiv:2510.08669v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.08669 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chang Zou [view email] [v1] Thu, 9 Oct 2025 17:22:23 UTC (20,631 KB)
zh

[CV-86] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

【速读】：该论文旨在解决临床决策中多模态医学数据（包括文本、2D/3D图像和视频）融合困难导致的效率低下与诊断遗漏问题。其解决方案的关键在于提出了一种透明的医疗视觉语言模型（Hulu-Med），该模型采用统一的基于patch的视觉编码器与大语言模型（LLM）解码器架构，并通过渐进式训练策略在1670万样本上实现从2D到3D及视频理解能力的扩展；同时引入医学感知的token压缩机制，显著降低训练成本（仅需4000至40000 GPU小时），从而在30个基准测试中达到领先性能，涵盖视觉问答、医学报告生成及多语言罕见病推理等复杂任务。

链接: https://arxiv.org/abs/2510.08668
作者: Songtao Jiang,Yuan Wang,Sibo Song,Tianxiang Hu,Chenyi Zhou,Bin Pu,Yan Zhang,Zhibo Yang,Yang Feng,Joey Tianyi Zhou,Jin Hao,Zijian Chen,Ruijia Wu,Tao Tang,Junhui Lv,Hongxia Xu,Hongwei Wang,Jun Xiao,Bin Feng,Fudong Zhu,Kenli Li,Weidi Xie,Jimeng Sun,Jian Wu,Zuozhu Liu
机构: Zhejiang University(浙江大学); Alibaba Inc(阿里巴巴公司); Hunan University(湖南大学); Angelalign Technology Inc.(安吉科技有限公司); Agency for Science, Technology and Research(新加坡科技研究局); Shanghai Jiao Tong University(上海交通大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); China Mobile Group Zhejiang Company Limited(中国移动浙江有限公司); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence(浙江省医学影像人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \hrefthis https URLthis https URL.
zh

[CV-87] A 3D Generation Framework from Cross Modality to Parameterized Primitive

【速读】：该论文旨在解决当前基于AI的3D模型生成中存在表面不光滑和存储开销大的问题。其解决方案的关键在于提出了一种多阶段框架，通过参数化基元（parameterized primitives）来重构模型结构：首先识别模型组成元素的形状特征，并用高质量表面的参数化基元替代原始几何体；其次设计了一种仅存储基元参数的压缩存储方法，在保持原始表面质量的同时显著降低文件体积（实验中约为6KB）。此方案在虚拟与真实场景数据集上均取得优异性能，验证了其在简单模型快速原型生成中的有效性。

链接: https://arxiv.org/abs/2510.08656
作者: Yiming Liang,Huan Yu,Zili Wang,Shuyou Zhang,Guodong Yi,Jin Wang,Jianrong Tan
机构: Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, and replace the elements with parameterized primitives with high quality surface. In addition, a corresponding model storage method is proposed, it can ensure the original surface quality of the model, while retaining only the parameters of parameterized primitives. Experiments on virtual scene dataset and real scene dataset demonstrate the effectiveness of our method, achieving a Chamfer Distance of 0.003092, a VIoU of 0.545, a F1-Score of 0.9139 and a NC of 0.8369, with primitive parameter files approximately 6KB in size. Our approach is particularly suitable for rapid prototyping of simple models.
zh

[CV-88] PhyDAE: Physics-Guided Degradation-Adaptive Experts for All-in-One Remote Sensing Image Restoration

【速读】：该论文旨在解决遥感图像在获取过程中因大气干扰、传感器限制和成像条件等因素导致的复杂且异质性退化问题，这些问题严重影响图像质量及下游任务的解析精度。现有全一体化修复方法过度依赖隐式特征表示，缺乏对退化物理机制的显式建模，难以有效区分和处理多种退化类型（如雾霾、噪声、模糊和低光照）。解决方案的关键在于提出物理引导的退化自适应专家模型（Physics-Guided Degradation-Adaptive Experts, PhyDAE），其采用两级级联架构，将退化信息从隐式特征转化为显式的决策信号，实现对多种异质退化的精准识别与差异化处理；同时引入残差流形投影器（Residual Manifold Projector, RMP）和频域感知退化分解器（Frequency-Aware Degradation Decomposer, FADD），从流形几何与频域角度全面分析退化特性，并结合物理感知专家模块与温度控制稀疏激活策略，在保障成像物理一致性的同时显著提升计算效率。

链接: https://arxiv.org/abs/2510.08653
作者: Zhe Dong,Yuzhe Sun,Haochen Jiang,Tianzhu Liu,Yanfeng Gu
机构: 哈尔滨工业大学(Harbin Institute of Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing images inevitably suffer from various degradation factors during acquisition, including atmospheric interference, sensor limitations, and imaging conditions. These complex and heterogeneous degradations pose severe challenges to image quality and downstream interpretation tasks. Addressing limitations of existing all-in-one restoration methods that overly rely on implicit feature representations and lack explicit modeling of degradation physics, this paper proposes Physics-Guided Degradation-Adaptive Experts (PhyDAE). The method employs a two-stage cascaded architecture transforming degradation information from implicit features into explicit decision signals, enabling precise identification and differentiated processing of multiple heterogeneous degradations including haze, noise, blur, and low-light conditions. The model incorporates progressive degradation mining and exploitation mechanisms, where the Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) comprehensively analyze degradation characteristics from manifold geometry and frequency perspectives. Physics-aware expert modules and temperature-controlled sparse activation strategies are introduced to enhance computational efficiency while ensuring imaging physics consistency. Extensive experiments on three benchmark datasets (MD-RSID, MD-RRSHID, and MDRS-Landsat) demonstrate that PhyDAE achieves superior performance across all four restoration tasks, comprehensively outperforming state-of-the-art methods. Notably, PhyDAE substantially improves restoration quality while achieving significant reductions in parameter count and computational complexity, resulting in remarkable efficiency gains compared to mainstream approaches and achieving optimal balance between performance and efficiency. Code is available at this https URL.
zh

[CV-89] Generating Sizing Fields for Mesh Generation via GCN-based Simplification of Adaptive Background Grids

【速读】：该论文旨在解决非结构化网格生成中背景网格（background grid）优化难题，即如何在保持几何保真度和尺寸场精度的前提下，构建一个计算轻量且无伪影（如条带效应）的最优背景网格。其解决方案的关键在于提出一种基于图卷积网络（Graph Convolutional Network, GCN）的自适应背景网格简化（Adaptive Background Grid Simplification, ABGS）框架，将网格简化任务重构为边评分回归问题，并通过定制化的损失函数联合优化几何 fidelity 和尺寸场准确性，从而替代传统耗时的显式评估过程，显著提升简化效率与质量。

链接: https://arxiv.org/abs/2510.08645
作者: Xunyang Zhu,Hongfei Ye,Yifei Wang,Taoran Liu,Jianjun Chen
机构: Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The sizing field defined on a triangular background grid is pivotal for controlling the quality and efficiency of unstructured mesh generation. However, creating an optimal background grid that is geometrically conforming, computationally lightweight, and free from artifacts like banding is a significant challenge. This paper introduces a novel, adaptive background grid simplification (ABGS) framework based on a Graph Convolutional Network (GCN). We reformulate the grid simplification task as an edge score regression problem and train a GCN model to efficiently predict optimal edge collapse candidates. The model is guided by a custom loss function that holistically considers both geometric fidelity and sizing field accuracy. This data-driven approach replaces a costly procedural evaluation, accelerating the simplification process. Experimental results demonstrate the effectiveness of our framework across diverse and complex engineering models. Compared to the initial dense grids, our simplified background grids achieve an element reduction of 74%-94%, leading to a 35%-88% decrease in sizing field query times.
zh

[CV-90] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

【速读】：该论文旨在解决视觉Transformer模型DINOv2所学习到的表征本质问题，即其内部如何组织和表示视觉概念，尤其是在面对不同下游任务（如分类、分割、深度估计）时的机制尚不明确。为回答这一问题，作者提出并验证了一个新的理论框架——Minkowski Representation Hypothesis (MRH)，其关键在于：DINOv2的token表示并非简单线性稀疏组合，而是由凸混合（convex mixtures）构成的原型（archetypes）组合而成，这些原型在概念空间中定义了具有几何结构的区域，且模型通过多头注意力机制自然生成此类结构。这一发现揭示了表征不仅具备功能性分工（如分类依赖“其他”概念实现否定逻辑），还具有内在几何一致性与局部连通性，从而超越传统线性表示假设（Linear Representation Hypothesis, LRH），为理解自监督视觉表征提供更精细的解释基础。

链接: https://arxiv.org/abs/2510.08638
作者: Thomas Fel,Binxu Wang,Michael A. Lepori,Matthew Kowal,Andrew Lee,Randall Balestriero,Sonia Joseph,Ekdeep S. Lubana,Talia Konkle,Demba Ba,Martin Wattenberg
机构: Kempner Institute, Harvard University (哈佛大学肯普纳研究所); Harvard University (哈佛大学); Dept. of Psychology, Harvard University (哈佛大学心理系); Brown University (布朗大学); FAR.AI; Google DeepMind (谷歌深度智 mind); Meta (Meta); Goodfire
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits “Elsewhere” concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors’ conceptual spaces and in the model’s mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.08638 [cs.CV] (or arXiv:2510.08638v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.08638 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Thomas Fel [view email] [v1] Wed, 8 Oct 2025 22:42:20 UTC (44,036 KB)
zh

[CV-91] Detection of high-frequency oscillations using time-frequency analysis

【速读】：该论文旨在解决高频率振荡（High-frequency oscillations, HFOs）在难治性癫痫患者中自动检测的难题，以提升癫痫灶定位的准确性并改善手术预后。当前HFOs的视觉识别方法存在耗时、主观性强且效率低的问题，限制了其临床应用。解决方案的关键在于提出一种基于S变换时频域特征提取与无监督聚类相结合的新方法，能够有效区分HFO事件与棘波、背景活动及伪迹，在控制数据集上实现了97.67%的敏感度、98.57%的精确度和97.78%的F-score；在真实癫痫患者数据中，切除区域HFO发生率与非切除区域之比达0.73，显著优于传统方法，验证了其对术后疗效的预测价值。

链接: https://arxiv.org/abs/2510.08637
作者: Mostafa Mohammadpour,Mehdi Zekriyapanah Gashti,Yusif S. Gasimov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:High-frequency oscillations (HFOs) are a new biomarker for identifying the epileptogenic zone. Mapping HFO-generating regions can improve the precision of resection sites in patients with refractory epilepsy. However, detecting HFOs remains challenging, and their clinical features are not yet fully defined. Visual identification of HFOs is time-consuming, labor-intensive, and subjective. As a result, developing automated methods to detect HFOs is critical for research and clinical use. In this study, we developed a novel method for detecting HFOs in the ripple and fast ripple frequency bands (80-500 Hz). We validated it using both controlled datasets and data from epilepsy patients. Our method employs an unsupervised clustering technique to categorize events extracted from the time-frequency domain using the S-transform. The proposed detector differentiates HFOs events from spikes, background activity, and artifacts. Compared to existing detectors, our method achieved a sensitivity of 97.67%, a precision of 98.57%, and an F-score of 97.78% on the controlled dataset. In epilepsy patients, our results showed a stronger correlation with surgical outcomes, with a ratio of 0.73 between HFOs rates in resected versus non-resected contacts. The study confirmed previous findings that HFOs are promising biomarkers of epileptogenicity in epileptic patients. Removing HFOs, especially fast ripple, leads to seizure freedom, while remaining HFOs lead to seizure recurrence.
zh

[CV-92] Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition

【速读】：该论文旨在解决人类活动识别（Human Activity Recognition, HAR）中因训练数据集无法覆盖真实场景中所有活动而导致的“未见活动”识别失效问题，以及活动中类间存在层次重叠与子活动嵌套关系导致的分类模糊性问题。其解决方案的关键在于构建一个结构化的活动层次体系，并提出Hi-OSCAR模型——一种分层开放集分类器（Hierarchical Open-set Classifier），该模型在保持当前最优已知活动识别准确率的同时，能够有效拒绝未知活动，并将未知类别定位到最近的内部节点，从而提供超越二元“已知/未知”判断的细粒度语义信息。为支持该方法及未来研究，作者还发布了新的公开数据集NFI_FARED，涵盖19种跨情境（日常、通勤、快速动作）的活动，具有广泛适用性。

链接: https://arxiv.org/abs/2510.08635
作者: Conor McCarthy,Loes Quirijnen,Jan Peter van Zandwijk,Zeno Geradts,Marcel Worring
机构: University of Amsterdam (阿姆斯特丹大学); Netherlands Forensic Institute (NFI) (荷兰司法研究所); Amsterdam University of Applied Sciences (阿姆斯特丹应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)

点击查看摘要

Abstract:Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier’s reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary “known/unknown” classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.
zh

[CV-93] Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs

【速读】：该论文旨在解决点云语义分割中对分布外（Out-of-Distribution, OOD）物体的检测问题，即如何在不依赖额外标注的OOD数据集的情况下，准确识别训练过程中未见过的物体类别，从而避免将未知对象错误归类为已知类别。现有基于预测熵（predictive entropy）的无监督方法常混淆认知不确定性（epistemic uncertainty）与随机不确定性（aleatoric uncertainty），导致将分布内（in-distribution）的模糊区域误判为OOD。论文提出的关键解决方案是：通过深度神经网络特征空间中高斯混合模型（Gaussian Mixture Model, GMM）参数的分层贝叶斯建模，提取纯认知不确定性（epistemic uncertainty），从而实现无需辅助数据或额外训练阶段的高效OOD检测。实验表明，该方法在SemanticKITTI数据集上相较传统预测熵方法显著提升性能，AUROC提高18%，AUPRC提升22%，FPR95降低36%（从76%降至40%）。

链接: https://arxiv.org/abs/2510.08631
作者: Hanieh Shojaei Miandashti,Claus Brenner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18% improvement in AUROC, 22% increase in AUPRC, and 36% reduction in FPR95 (from 76% to 40%), compared to the predictive entropy approach used in prior works.
zh

[CV-94] Dynamic Mixture-of-Experts for Visual Autoregressive Model

【速读】：该论文旨在解决视觉自回归模型（Visual Autoregressive Models, VAR）在图像生成过程中因在不同分辨率下重复调用Transformer而导致的计算冗余问题。其解决方案的关键在于引入一个动态的专家混合（Mixture-of-Experts, MoE）路由器，并结合尺度感知阈值策略，根据token复杂度和分辨率动态选择专家，从而在不增加额外训练成本的前提下，实现计算量与图像质量之间的灵活权衡——最终在保持与密集基线模型相当图像质量的同时，减少20%的浮点运算次数（FLOPs），并将推理速度提升11%。

链接: https://arxiv.org/abs/2510.08629
作者: Jort Vincenti,Metod Jazbec,Guoxuan Xia
机构: University of Amsterdam (阿姆斯特丹大学); UvA-Bosch Delta Lab (阿姆斯特丹大学-博世德尔塔实验室); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
zh

[CV-95] he Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images

【速读】：该论文旨在解决生成式 AI（Generative AI）在职业场景中生成图像时存在的表征偏见问题，特别是性别刻板印象的强化现象。现有研究多关注生成过程与图像质量，忽视了AI图像可能加剧社会偏见的风险。解决方案的关键在于通过实证比较DALL-E 3与Ideogram两款AI图像生成工具对750余张职业相关图像的输出结果，发现两者均不同程度地强化传统性别角色，进而提出针对从业者、使用者及研究人员的具体建议，以提升图像生成中的代表性与多样性，从而减少有害的性别偏见传播。

链接: https://arxiv.org/abs/2510.08628
作者: Siiri Leppälampi,Sonja M. Hyrynsalmi,Erno Vanhala
机构: LUT University (拉彭兰塔阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.
zh

[CV-96] Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models

【速读】：该论文旨在解决文本到图像扩散模型中存在的记忆化问题（memorization），即模型在生成图像时会复制训练数据中的内容，从而引发隐私和版权风险。现有方法通过延迟分类器自由引导（classifier-free guidance, CFG）的应用来促使去噪轨迹逃离吸引盆地（attraction basin），但这种方法常导致与输入提示对齐较差的图像。本文的关键解决方案在于发现初始噪声样本对逃离吸引盆地的时间具有决定性影响，并提出两种基于调整初始噪声的缓解策略：一种是集体调整，另一种是逐样本调整，以寻找并利用能促进更早逃离吸引盆地的初始噪声样本，从而在保障图像-文本对齐的同时显著降低记忆化现象。

链接: https://arxiv.org/abs/2510.08625
作者: Hyeonggeun Han,Sehwan Kim,Hyungjun Joo,Sangwoo Hong,Jungwoo Lee
机构: Seoul National University (首尔国立大学); Konkuk University (中央大学); Hodoo AI Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.
zh

[CV-97] Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation

【速读】：该论文旨在解决脑肿瘤分割中因类别不平衡（class imbalance）和模型泛化能力有限导致的性能瓶颈问题。其解决方案的关键在于采用焦点损失（focal loss）替代传统交叉熵损失函数，并结合基础数据增强策略（包括水平翻转、旋转和缩放）对U-Net模型进行优化。实验表明，引入焦点损失可显著提升分割精度至90%，且通过公开全部代码与结果，为未来在脑肿瘤MRI分割中探索更优的数据增强方法和损失函数设计提供了可复现的基准。

链接: https://arxiv.org/abs/2510.08617
作者: Saumya B
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and results available at this https URL

点击查看摘要

Abstract:Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.
zh

[CV-98] Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLM s for Object Detection on Low-Data Regimes

【速读】：该论文旨在解决在图像中检测人工文本叠加（artificial text overlay detection）这一挑战性任务中，如何有效利用多模态大语言模型（multi-modal large language models, LLMs）实现高效且高精度的视觉理解问题。传统卷积神经网络（CNN）虽表现稳定，但通常依赖大规模标注数据；而预训练的多模态LLMs若直接使用则难以发挥其潜力。论文的关键解决方案在于证明：通过在极少量数据（少于1,000张图像）上对多模态LLMs进行微调（fine-tuning），可实现高达36%的准确率提升，性能达到甚至超越依赖海量数据的CNN基线模型。这揭示了LLMs在低资源场景下具备高度适应性和数据效率，为视觉与语言跨模态学习提供了新的高效策略。

链接: https://arxiv.org/abs/2510.08589
作者: Nirmal Elamon,Rouzbeh Davoudi
机构: Expedia Group (Expedia集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.
zh

[CV-99] Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

【速读】：该论文旨在解决从单张RGB图像生成逼真三维人体模型的挑战，核心难点在于如何准确建模几何结构、实现高质量纹理以及合理推测不可见区域。现有方法多依赖多视角扩散模型进行3D生成，但常因视图不一致导致生成质量下降。其解决方案的关键在于提出Human-VDM框架，利用视频扩散模型（Video Diffusion Models, VDM）生成时序一致的视图，并结合高斯点绘图（Gaussian Splatting）技术实现高保真3D重建。该方法包含三个模块：视图一致的人体视频扩散模块、视频增强模块（超分辨率与插值以提升纹理和几何平滑性）、以及基于高质量视图引导的3D高斯点绘图模块，从而显著提升了单图驱动下3D人体生成的质量与一致性。

链接: https://arxiv.org/abs/2409.02851
作者: Zhibin Liu,Haoye Dong,Aviral Chharia,Hefeng Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 Pages, 8 figures, Project page: this https URL

点击查看摘要

Abstract:Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: this https URL
zh

[CV-100] Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba NEURIPS2024

【速读】：该论文旨在解决从单张RGB图像中进行3D手部重建的难题，其核心挑战在于手部关节的复杂运动、自遮挡以及与物体的交互导致的空间关系建模不充分。现有基于注意力机制的Transformer方法虽能学习3D手部姿态和形状，但因未能高效建模关节间的空间关系，仍难以实现鲁棒且精确的重建性能。解决方案的关键在于提出一种新型图引导的Mamba框架（Hamba），其核心创新是将Mamba的扫描机制重构为图引导的双向扫描，从而以少量有效token（比注意力方法减少88.5%）高效学习关节间的图结构关系与空间序列信息；同时设计Graph-guided State Space (GSS)模块融合局部状态空间特征与全局特征，显著提升重建精度——在FreiHAND数据集上达到5.3mm的PA-MPVPE和0.992的F@15mm指标，并在两个权威竞赛排行榜中位列第一。

链接: https://arxiv.org/abs/2407.09646
作者: Haoye Dong,Aviral Chharia,Wenbo Gou,Francisco Vicente Carrasco,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: NeurIPS 2024; Project Website: this https URL

点击查看摘要

Abstract:3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, yet they do not fully achieve robust and accurate performance, primarily due to inefficiently modeling spatial relations between joints. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba’s scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to efficiently learn the spatial relationships between joints for improving reconstruction performance. Specifically, we design a Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space features and jointly considers global and local features to improve performance. Experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. At the time of this paper’s acceptance, Hamba holds the top position, Rank 1 in two Competition Leaderboards on 3D hand reconstruction. Project Website: this https URL
zh

[CV-101] Deep Sparse Representation-based Classification

【速读】：该论文旨在解决传统稀疏表示分类（Sparse Representation-based Classification, SRC）方法在特征表达能力不足和分类性能受限的问题。其解决方案的关键在于提出一种基于深度学习的归纳式框架，通过结合卷积自编码器（Convolutional Autoencoder）与全连接层构建端到端可训练网络：其中，卷积自编码器用于学习鲁棒的深层特征表示，而插入在编码器与解码器之间的全连接层则负责提取稀疏表示，最终利用该稀疏编码实现更准确的分类。实验表明，该方法在多个数据集上优于当前主流SRC方法。

链接: https://arxiv.org/abs/1904.11093
作者: Mahdi Abavisani,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a transductive deep learning-based formulation for the sparse representation-based classification (SRC) method. The proposed network consists of a convolutional autoencoder along with a fully-connected layer. The role of the autoencoder network is to learn robust deep features for classification. On the other hand, the fully-connected layer, which is placed in between the encoder and the decoder networks, is responsible for finding the sparse representation. The estimated sparse codes are then used for classification. Various experiments on three different datasets show that the proposed network leads to sparse representations that give better classification results than state-of-the-art SRC methods. The source code is available at: this http URL.
zh

[CV-102] Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training

【速读】：该论文旨在解决如何在训练单模态3D卷积神经网络（3D-CNN）时有效利用多模态知识以提升动态手势识别性能的问题。传统方法通常显式融合多模态信息，而本文提出了一种新框架：通过将不同模态的知识嵌入到独立的单模态网络中，使每个网络在不直接融合多模态数据的情况下也能获得性能提升。其关键在于引入“时空语义对齐损失”（Spatiotemporal Semantic Alignment loss, SSA），用于对齐不同模态网络提取的特征语义内容，并结合所提出的“焦点正则化参数”来抑制负知识迁移，从而实现模态间协作学习与更好表征的共同优化。

链接: https://arxiv.org/abs/1812.06145
作者: Mahdi Abavisani,Hamid Reza Vaezi Joze,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a “spatiotemporal semantic alignment” loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed “focal regularization parameter” to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.
zh

[CV-103] Deep Multimodal Subspace Clustering Networks

【速读】：该论文旨在解决多模态数据的无监督子空间聚类问题，即如何在不依赖标签的情况下，从多个模态的数据中学习共享的低维子空间结构并实现高质量聚类。其解决方案的关键在于提出了一种基于卷积神经网络（Convolutional Neural Network, CNN）的端到端框架，包含三个核心模块：多模态编码器、自表达层（self-expressive layer）和多模态解码器。其中，自表达层通过强制数据点之间的自表达性来构建亲和矩阵（affinity matrix），从而捕捉数据内在的聚类结构；而编码器采用早期融合、晚期融合与中间融合三种策略实现跨模态信息整合，解码器则通过重建输入数据来优化整个网络训练过程。此外，作者还提出一种基于亲和矩阵融合的变体方法，强制不同模态的自表达层输出一致，进一步提升聚类性能。实验表明，该方法显著优于当前主流的多模态子空间聚类技术。

链接: https://arxiv.org/abs/1804.06498
作者: Mahdi Abavisani,Vishal M. Patel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder’s reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.
zh

[CV-104] A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis

【速读】：该论文旨在解决医学影像中脑肿瘤区域及健康组织缺失的修复问题，即通过生成式模型实现高保真度的磁共振成像（MRI）补全。其核心解决方案是提出一种基于潜在扩散模型（latent diffusion model）的生成框架，该模型以体素级连续肿瘤浓度（voxel-level continuous tumor concentrations）和组织分割图作为条件输入，从而在统一架构下同时支持肿瘤合成与健康组织恢复任务。关键创新在于将肿瘤浓度信息作为可调节的连续条件嵌入扩散过程，使模型能够生成具有三维空间一致性与解剖合理性的高质量MRI图像，且在BraTS 2025 Inpainting Challenge中实现了18.5 PSNR（健康组织修复）和17.4 PSNR（肿瘤修复）的性能表现。

链接: https://arxiv.org/abs/2510.09365
作者: Valentin Biller,Lucas Zimmer,Can Erdur,Sandeep Nagar,Daniel Rückert,Niklas Bubeck,Jonas Weidner
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: this https URL
zh

[CV-105] Rewiring Development in Brain Segmentation: Leverag ing Adult Brain Priors for Enhancing Infant MRI Segmentation

【速读】：该论文旨在解决婴儿脑部磁共振成像（MRI）分割的准确性难题，该问题主要源于受试者不断变化的解剖结构、运动伪影以及高质量标注数据稀缺等因素。解决方案的关键在于提出一种名为LODi的新框架，其核心思想是利用成人脑部MRI分割模型的先验知识来提升婴儿扫描的分割性能：首先在大规模成人数据集上预训练分割模型，随后通过迁移学习与领域自适应策略逐步将模型适配至0-2岁婴儿群体；同时引入一种结合层次特征精炼与多层级一致性约束的新型训练机制，从而实现快速、准确且年龄自适应的分割效果，并有效缓解扫描仪和站点特异性偏差。

链接: https://arxiv.org/abs/2510.09306
作者: Alemu Sisay Nigru,Michele Svanera,Austin Dibble,Connor Dalby,Mattia Savardi,Sergio Benini
机构: University of Brescia (意大利布雷西亚大学); University of Glasgow (英国格拉斯哥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of infant brain MRI is critical for studying early neurodevelopment and diagnosing neurological disorders. Yet, it remains a fundamental challenge due to continuously evolving anatomy of the subjects, motion artifacts, and the scarcity of high-quality labeled data. In this work, we present LODi, a novel framework that utilizes prior knowledge from an adult brain MRI segmentation model to enhance the segmentation performance of infant scans. Given the abundance of publicly available adult brain MRI data, we pre-train a segmentation model on a large adult dataset as a starting point. Through transfer learning and domain adaptation strategies, we progressively adapt the model to the 0-2 year-old population, enabling it to account for the anatomical and imaging variability typical of infant scans. The adaptation of the adult model is carried out using weakly supervised learning on infant brain scans, leveraging silver-standard ground truth labels obtained with FreeSurfer. By introducing a novel training strategy that integrates hierarchical feature refinement and multi-level consistency constraints, our method enables fast, accurate, age-adaptive segmentation, while mitigating scanner and site-specific biases. Extensive experiments on both internal and external datasets demonstrate the superiority of our approach over traditional supervised learning and domain-specific models. Our findings highlight the advantage of leveraging adult brain priors as a foundation for age-flexible neuroimaging analysis, paving the way for more reliable and generalizable brain MRI segmentation across the lifespan.
zh

[CV-106] SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation

【速读】：该论文旨在解决将视频导向的通用分割模型SAM2直接应用于3D医学图像分割时面临的两个核心问题：一是医学图像中切片间的双向解剖连续性与视频中单向时间流之间的域差异；二是视频任务中常被忽视的精确边界分割需求，这对形态学分析至关重要。解决方案的关键在于提出SAM2-3dMed框架，其包含两项创新模块：1）Slice Relative Position Prediction (SRPP) 模块通过自监督方式引导模型预测不同切片间的相对位置，显式建模双向切片依赖关系；2）Boundary Detection (BD) 模块增强关键器官和组织边界的分割精度。实验表明，该方法在三个医学数据集上显著优于现有先进方法，在分割重叠度和边界精度方面均取得提升。

链接: https://arxiv.org/abs/2510.08967
作者: Yeqing Yang,Le Xu,Lixia Tian
机构: Beijing Jiaotong University (北京交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.
zh

[CV-107] FS-RWKV: Leverag ing Frequency Spatial-Aware RWKV for 3T-to-7T MRI Translation

【速读】：该论文旨在解决7T磁共振成像（MRI）设备因高昂基础设施成本和技术门槛导致临床普及受限的问题，提出通过计算方法从更易获取的3T MRI数据中合成高质量7T图像以提升可及性。其解决方案的关键在于设计了一种基于RWKV架构的新型框架——频率空间RWKV（FS-RWKV），该框架创新性地引入两个核心模块：一是频域-空间全向移位（FSO-Shift），通过离散小波分解与低频分支上的全向空间偏移增强全局上下文表征并保留高频解剖细节；二是结构保真度增强块（SFEB），通过频域感知特征融合自适应强化解剖结构一致性。此设计兼顾了高效建模长程依赖与高保真图像重建，在多个数据集上显著优于CNN、Transformer、GAN及传统RWKV基线方法。

链接: https://arxiv.org/abs/2510.08951
作者: Yingtie Lei,Zimeng Li,Chi-Man Pun,Yupeng Liu,Xuhang Chen
机构: University of Macau (澳门大学); Shenzhen Polytechnic University (深圳职业技术学院); Southern Medical University (南方医科大学); Guangdong Provincial People’s Hospital (广东省人民医院); Guangdong Academy of Medical Sciences (广东省医学科学院); Guangdong Cardiovascular Institute (广东省心血管病研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by BIBM 2025

点击查看摘要

Abstract:Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.
zh

[CV-108] Progressive Uncertainty-Guided Evidential U-KAN for Trustworthy Medical Image Segmentation

【速读】：该论文旨在解决现有基于证据深度学习（Evidence Deep Learning, EDL）的医学图像分割方法在处理边界模糊区域时存在的可靠性不足问题，尤其是忽视了不确定性图（uncertainty maps）中蕴含的注意力线索，导致难以精准定位和分割困难区域。其解决方案的关键在于提出一种渐进式证据不确定性引导注意力（Progressive Evidence Uncertainty Guided Attention, PEUA）机制，通过逐步利用不确定性图优化注意力权重，并结合低秩学习去除注意力权重噪声，从而增强对挑战性区域的特征学习能力；同时引入语义保持的证据学习（Semantic-Preserving Evidence Learning, SAEL）策略，设计语义平滑的证据生成器与保真度增强正则项，以避免传统EDL方法因KL散度正则化导致的错误类别证据过度抑制，从而保留关键语义信息并提升不确定性评估的准确性。最终将PEUA与SAEL嵌入到U-KAN架构中，构建出Evidential U-KAN模型，显著提升了医学图像分割的准确性和可信度。

链接: https://arxiv.org/abs/2510.08949
作者: Zhen Yang,Yansong Ma,Lei Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trustworthy medical image segmentation aims at deliver accurate and reliable results for clinical decision-making. Most existing methods adopt the evidence deep learning (EDL) paradigm due to its computational efficiency and theoretical robustness. However, the EDL-based methods often neglect leveraging uncertainty maps rich in attention cues to refine ambiguous boundary segmentation. To address this, we propose a progressive evidence uncertainty guided attention (PEUA) mechanism to guide the model to focus on the feature representation learning of hard regions. Unlike conventional approaches, PEUA progressively refines attention using uncertainty maps while employing low-rank learning to denoise attention weights, enhancing feature learning for challenging regions. Concurrently, standard EDL methods suppress evidence of incorrect class indiscriminately via Kullback-Leibler (KL) regularization, impairing the uncertainty assessment in ambiguous areas and consequently distorts the corresponding attention guidance. We thus introduce a semantic-preserving evidence learning (SAEL) strategy, integrating a semantic-smooth evidence generator and a fidelity-enhancing regularization term to retain critical semantics. Finally, by embedding PEUA and SAEL with the state-of-the-art U-KAN, we proposes Evidential U-KAN, a novel solution for trustworthy medical image segmentation. Extensive experiments on 4 datasets demonstrate superior accuracy and reliability over the competing methods. The code is available at \hrefthis https URLgithub.
zh

[CV-109] Interlaced dynamic XCT reconstruction with spatio-temporal implicit neural representations

【速读】：该论文旨在解决动态X射线计算机断层扫描（XCT）在交错采集方案下，由于数据欠采样、空间复杂度高及噪声干扰导致的重建质量下降问题。其核心解决方案是采用时空隐式神经表示（spatio-temporal Implicit Neural Representations, INRs）结合基于ADMM的优化策略与INCODE先验条件框架，通过引入INR的归纳偏置（inductive bias）增强对中等噪声的鲁棒性，并在数据保真项中显式建模噪声（采用加权最小二乘形式），显著提升在高挑战性场景下的重建性能。此外，该方法具备模块化特性，可直接整合探测器非理想效应建模和环状伪影校正，同时支持批量轴向切片联合优化，为大规模4D体积重建提供了可并行化的高效框架。

链接: https://arxiv.org/abs/2510.08641
作者: Mathias Boulanger,Ericmoore Jossou
机构: Massachusetts Institute of Technology (麻省理工学院); École Polytechnique de Bruxelles (布鲁塞尔理工大学); Université Libre de Bruxelles (布鲁塞尔自由大学)
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we investigate the use of spatio-temporalImplicit Neural Representations (INRs) for dynamic X-ray computed tomography (XCT) reconstruction under interlaced acquisition schemes. The proposed approach combines ADMM-based optimization with INCODE, a conditioning framework incorporating prior knowledge, to enable efficient convergence. We evaluate our method under diverse acquisition scenarios, varying the severity of global undersampling, spatial complexity (quantified via spatial information), and noise levels. Across all settings, our model achieves strong performance and outperforms Time-Interlaced Model-Based Iterative Reconstruction (TIMBIR), a state-of-the-art model-based iterative method. In particular, we show that the inductive bias of the INR provides good robustness to moderate noise levels, and that introducing explicit noise modeling through a weighted least squares data fidelity term significantly improves performance in more challenging regimes. The final part of this work explores extensions toward a practical reconstruction framework. We demonstrate the modularity of our approach by explicitly modeling detector non-idealities, incorporating ring artifact correction directly within the reconstruction process. Additionally, we present a proof-of-concept 4D volumetric reconstruction by jointly optimizing over batched axial slices, an approach which opens up the possibilities for massive parallelization, a critical feature for processing large-scale datasets.
zh

[CV-110] Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

【速读】：该论文旨在解决自动语音识别（ASR）系统在学术讲座等专业场景中对领域术语识别准确率低的问题。其核心挑战在于，传统ASR模型缺乏对视觉上下文信息的利用能力，导致难以准确转录涉及专业术语的内容。解决方案的关键在于提出一种名为“视觉锚定策略优化”（Visually-Anchored Policy Optimization, VAPO）的后训练方法，该方法通过强化学习机制引导模型采用“先观察后转录”的链式推理范式（Chain-of-Thought），即在“思考”阶段执行OCR提取幻灯片文本，在“回答”阶段基于视觉识别结果生成语音转录内容，从而实现视觉与语音模态的协同对齐。VAPO通过四类奖励信号（格式合规性、OCR准确性、ASR质量及视觉锚定一致性）优化这一推理过程，显著提升了对领域术语的识别性能，构建了一个有效的端到端SlideASR范式。

链接: https://arxiv.org/abs/2510.08618
作者: Rui Hu,Delai Qiu,Yining Wang,Shengping Liu,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学); Unisound (声智科技)
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model’s reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured “Look before Transcription” procedure using a thinkanswer format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.
zh

人工智能

[AI-0] BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

【速读】：该论文旨在解决在稀疏奖励（sparse-reward）场景下，生成式模型因缺乏成功样本和昂贵的奖励评估成本而难以有效后训练的问题。具体而言，当模型生成结果几乎总是获得接近零的奖励信号时，传统基于奖励的后训练方法失效，且每次调用奖励函数（reward oracle）代价高昂。解决方案的关键在于提出 BaNEL（Bayesian Negative Evidence Learning），其核心思想是将学习失败模式的问题建模为一个内嵌的生成建模任务：通过仅利用失败样本构建一个失败分布模型，进而判断新生成的数据是否与已知失败模式相似，并引导生成过程避开这些区域。该方法无需任何成功样本即可提升性能，在多个稀疏奖励任务中显著优于现有新颖性奖励（novelty-bonus）方法，同时大幅减少奖励评估次数（NREs）。

链接: https://arxiv.org/abs/2510.09596
作者: Sangyun Lee,Brandon Amos,Giulia Fanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Today’s generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.
zh

[AI-1] GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

【速读】：该论文旨在解决神经符号人工智能（Neurosymbolic AI）领域长期存在的两大挑战：一是现有框架难以规模化，二是神经方法的隐式表示和近似推理限制了可解释性与可信度。为应对这些问题，作者提出GraphMERT——一个轻量级的仅编码器模型，其核心创新在于通过蒸馏从非结构化文本语料库及其内部表征中提取高质量知识图谱（Knowledge Graph, KG），从而构建模块化的神经符号系统：神经网络负责抽象学习，符号KG提供可验证推理。该方案的关键突破在于实现了高效且可扩展的神经符号协同机制，并在糖尿病领域的实证中显著优于大语言模型（LLM）基线，在FActScore（事实准确性）和ValidityScore（有效性）上分别达到69.8%和68.8%，远超32B参数LLM的40.2%和43.0%。

链接: https://arxiv.org/abs/2510.09580
作者: Margarita Belova,Jiaxin Xiao,Shikhar Tuli,Niraj K. Jha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.09580 [cs.AI] (or arXiv:2510.09580v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.09580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Safe Untrusted “Proof-Carrying” AI Agents : toward the agent ic lakehouse

【速读】：该论文旨在解决数据湖仓（data lakehouse）中运行敏感工作负载时，由AI驱动自动化引发的信任、正确性和治理问题。其解决方案的关键在于采用以API优先、可编程的数据湖仓架构，通过提供安全设计的代理工作流抽象，实现对AI代理（agent）的安全控制与可观测性保障。具体而言，论文以Bauplan为例，展示了数据分支（data branching）和声明式环境（declarative environment）如何自然扩展至代理场景，从而提升可复现性并缩小攻击面；同时，通过受证明携带代码（proof-carrying code）启发的正确性检查机制，实现代理在生产数据上安全地修复数据管道，验证了不可信AI代理在生产环境中安全运行的可能性，为构建全代理化数据湖仓指明了路径。

链接: https://arxiv.org/abs/2510.09567
作者: Jacopo Tagliabue,Ciro Greco
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: IEEE Big Data, Workshop on Secure and Safe AI Agents for Big Data Infrastructures

点击查看摘要

Abstract:Data lakehouses run sensitive workloads, where AI-driven automation raises concerns about trust, correctness, and governance. We argue that API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic workflows. Using Bauplan as a case study, we show how data branching and declarative environments extend naturally to agents, enabling reproducibility and observability while reducing the attack surface. We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code. Our prototype demonstrates that untrusted AI agents can operate safely on production data and outlines a path toward a fully agentic lakehouse.
zh

[AI-3] ans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model

【速读】：该论文旨在解决 Titans: Learning at Test Time 模型在实际应用中因缺乏公开代码和描述模糊而导致的可复现性问题，同时系统评估其在多个任务中的性能表现。解决方案的关键在于对 Titans 进行轻量级重实现，并在掩码语言建模（Masked Language Modeling）、时间序列预测（Time Series Forecasting）和推荐（Recommendation）任务上进行全面评估，从而揭示其神经记忆（Neural Memory）组件相较于纯注意力机制模型的稳定提升效果，同时也指出其在分块（chunking）策略下并不总能超越现有基线模型，为后续研究提供了明确的方向与改进空间。

链接: https://arxiv.org/abs/2510.09551
作者: Gavriel Di Nepi,Federico Siciliano,Fabrizio Silvestri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By the end of 2024, Google researchers introduced Titans: Learning at Test Time, a neural memory model achieving strong empirical results across multiple tasks. However, the lack of publicly available code and ambiguities in the original description hinder reproducibility. In this work, we present a lightweight reimplementation of Titans and conduct a comprehensive evaluation on Masked Language Modeling, Time Series Forecasting, and Recommendation tasks. Our results reveal that Titans does not always outperform established baselines due to chunking. However, its Neural Memory component consistently improves performance compared to attention-only models. These findings confirm the model’s innovative potential while highlighting its practical limitations and raising questions for future research.
zh

[AI-4] Autonomous Soft Robotic Guidewire Navigation via Imitation Learning

【速读】：该论文旨在解决软体机器人导丝在血管内导航过程中因复杂几何结构和操控精度要求高而导致的建模与控制难题，从而提升介入手术的精准性和安全性。其关键解决方案是提出一种基于Transformer的模仿学习框架，该框架引入目标条件（goal conditioning）、相对动作输出（relative action outputs）以及自动对比剂注射机制，在模拟荧光透视环境下训练模型以实现通用化导丝导航。通过在36种模块化分叉血管几何结构上生成647次演示数据，并在3种未见过的血管几何结构上验证，模型成功实现了83%的靶向动脉瘤定位成功率，显著优于多个基线方法。

链接: https://arxiv.org/abs/2510.09497
作者: Noah Barnes,Ji Woong Kim,Lingyun Di,Hannah Qu,Anuruddha Bhattacharjee,Miroslaw Janowski,Dheeraj Gandhi,Bailey Felix,Shaopeng Jiang,Olivia Young,Mark Fuge,Ryan D. Sochol,Jeremy D. Brown,Axel Krieger
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In endovascular surgery, endovascular interventionists push a thin tube called a catheter, guided by a thin wire to a treatment site inside the patient’s blood vessels to treat various conditions such as blood clots, aneurysms, and malformations. Guidewires with robotic tips can enhance maneuverability, but they present challenges in modeling and control. Automation of soft robotic guidewire navigation has the potential to overcome these challenges, increasing the precision and safety of endovascular navigation. In other surgical domains, end-to-end imitation learning has shown promising results. Thus, we develop a transformer-based imitation learning framework with goal conditioning, relative action outputs, and automatic contrast dye injections to enable generalizable soft robotic guidewire navigation in an aneurysm targeting task. We train the model on 36 different modular bifurcated geometries, generating 647 total demonstrations under simulated fluoroscopy, and evaluate it on three previously unseen vascular geometries. The model can autonomously drive the tip of the robot to the aneurysm location with a success rate of 83% on the unseen geometries, outperforming several baselines. In addition, we present ablation and baseline studies to evaluate the effectiveness of each design and data collection choice. Project website: this https URL
zh

[AI-5] Precoder Design in Multi-User FDD Systems with VQ-VAE and GNN ICASSP2026

【速读】：该论文旨在解决频分双工（FDD）系统中鲁棒预编码（robust precoding）设计的难题，特别是在反馈比特有限条件下如何高效利用信道统计信息以提升多用户无线系统的和速率（sum rate）。其关键解决方案在于采用向量量化变分自编码器（VQ-VAE）替代传统的高斯混合模型（GMM），从而克服了GMM组件数量随反馈比特数呈指数增长的瓶颈；同时，通过将VQ-VAE与图神经网络（GNN）及导频优化联合训练，构建端到端（E2E）学习框架，显著提升了系统性能。

链接: https://arxiv.org/abs/2510.09495
作者: Srikar Allaparapu,Michael Baur,Benedikt Böck,Michael Joham,Wolfgang Utschick
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Submitted to IEEE ICASSP 2026

点击查看摘要

Abstract:Robust precoding is efficiently feasible in frequency division duplex (FDD) systems by incorporating the learnt statistics of the propagation environment through a generative model. We build on previous work that successfully designed site-specific precoders based on a combination of Gaussian mixture models (GMMs) and graph neural networks (GNNs). In this paper, by utilizing a vector quantized-variational autoencoder (VQ-VAE), we circumvent one of the key drawbacks of GMMs, i.e., the number of GMM components scales exponentially to the feedback bits. In addition, the deep learning architecture of the VQ-VAE allows us to jointly train the GNN together with VQ-VAE along with pilot optimization forming an end-to-end (E2E) model, resulting in considerable performance gains in sum rate for multi-user wireless systems. Simulations demonstrate the superiority of the proposed frameworks over the conventional methods involving the sub-discrete Fourier transform (DFT) pilot matrix and iterative precoder algorithms enabling the deployment of systems characterized by fewer pilots or feedback bits.
zh

[AI-6] Performance Analysis of Machine Learning Algorithms in Chronic Kidney Disease Prediction

【速读】：该论文旨在解决慢性肾病（Chronic Kidney Disease, CKD）的早期风险评估与精准诊断问题，以防止高危患者进一步肾功能损伤。其解决方案的关键在于构建基于机器学习（Machine Learning, ML）的辅助诊断模型，利用来自UCL机器学习数据库的CKD数据集（含少量缺失值，通过“均值-众数填充”和“随机采样法”处理），对比八种主流ML算法（包括随机森林、支持向量机、朴素贝叶斯、逻辑回归、K近邻、XGBoost、决策树和AdaBoost）的预测性能。结果表明，随机森林和逻辑回归模型表现最优，准确率高达99%，验证了机器学习在CKD智能识别中的有效性与可行性。

链接: https://arxiv.org/abs/2510.09493
作者: Iftekhar Ahmed,Tanzil Ebad Chowdhury,Biggo Bushon Routh,Nafisa Tasmiya,Shadman Sakib,Adil Ahmed Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, Presented at the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0417-0423

点击查看摘要

Abstract:Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using “mean-mode” and “Random sampling method” strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.
zh

[AI-7] Scalable Multi-Agent Path Finding using Collision-Aware Dynamic Alert Mask and a Hybrid Execution Strategy

【速读】：该论文旨在解决多智能体路径规划（Multi-agent Path Finding, MAPF）中因规模扩大导致的计算复杂性与解质量之间的权衡问题：传统集中式算法（如基于冲突的搜索，Conflict-Based Search, CBS）虽能提供高质量解，但在大规模场景下因冲突组合爆炸而效率低下；而纯分布式方法虽具可扩展性，却常因信息受限导致解质量下降。其解决方案的关键在于提出一种混合框架，将去中心化路径规划与轻量级集中式协调器相结合——利用强化学习（Reinforcement Learning, RL）实现智能体自主决策，同时通过中央协调器动态下发少量关键信息（如静态冲突单元标记或简短冲突轨迹），从而在显著减少智能体间通信负担的同时，仍能稳定生成无碰撞、可行的路径，尤其适用于高密度智能体场景。

链接: https://arxiv.org/abs/2510.09469
作者: Bharath Muppasani,Ritirupa Dey,Biplav Srivastava,Vignesh Narayanan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Multi-agent pathfinding (MAPF) remains a critical problem in robotics and autonomous systems, where agents must navigate shared spaces efficiently while avoiding conflicts. Traditional centralized algorithms that have global information, such as Conflict-Based Search (CBS), provide high-quality solutions but become computationally expensive in large-scale scenarios due to the combinatorial explosion of conflicts that need resolution. Conversely, distributed approaches that have local information, particularly learning-based methods, offer better scalability by operating with relaxed information availability, yet often at the cost of solution quality. To address these limitations, we propose a hybrid framework that combines decentralized path planning with a lightweight centralized coordinator. Our framework leverages reinforcement learning (RL) for decentralized planning, enabling agents to adapt their planning based on minimal, targeted alerts–such as static conflict-cell flags or brief conflict tracks–that are dynamically shared information from the central coordinator for effective conflict resolution. We empirically study the effect of the information available to an agent on its planning performance. Our approach reduces the inter-agent information sharing compared to fully centralized and distributed methods, while still consistently finding feasible, collision-free solutions–even in large-scale scenarios having higher agent counts.
zh

[AI-8] Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

【速读】：该论文试图解决当前AI控制协议在面对具备适应性攻击的不可信大语言模型（Large Language Model, LLM）时存在的脆弱性问题。现有方案普遍依赖LLM监控器（monitor model）作为核心防御机制，但这种设计存在单点失效风险。论文的关键发现是：当攻击者模型知晓控制协议与监控模型时，可通过嵌入公开或零样本提示注入（prompt injection）的方式实施自适应攻击，使前沿模型持续绕过多种监控机制并在两个主流AI控制基准上完成恶意任务。这一攻击策略具有普适性，甚至导致近期提出的“Defer-to-Resample”协议因重采样放大注入内容而反向增强攻击效果，从而揭示出当前基于监控器的控制机制存在重大盲区，亟需将其纳入未来AI控制评估的标准组件。

链接: https://arxiv.org/abs/2510.09462
作者: Mikhail Terekhov,Alexander Panfilov,Daniil Dzenhaliou,Caglar Gulcehre,Maksym Andriushchenko,Ameya Prabhu,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of- n attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.
zh

[AI-9] Failure Prediction at Runtime for Generative Robot Policies NEURIPS2025

【速读】：该论文旨在解决生成式模仿学习（Generative Imitation Learning, GIL）政策在实际部署中因环境分布偏移或动作误差累积导致的不可预测和不安全行为问题，从而引发任务失败。解决方案的关键在于提出FIPER框架，通过两个核心指标实现运行时故障预测：一是利用随机网络蒸馏（Random Network Distillation）在策略嵌入空间中检测观测数据的分布外（Out-of-Distribution, OOD）状态；二是引入一种新颖的动作片段熵得分（action-chunk entropy score）量化生成动作的不确定性。这两个指标均通过少量成功轨迹进行合规预测（Conformal Prediction）校准，并在短时间窗口内聚合后同时超过阈值时触发故障警报，从而实现更早、更准确的故障识别与预警。

链接: https://arxiv.org/abs/2510.09459
作者: Ralf Römer,Adrian Kobras,Luca Worbis,Angela P. Schoellig
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Imitation learning (IL) with generative models, such as diffusion and flow matching, has enabled robots to perform complex, long-horizon tasks. However, distribution shifts from unseen environments or compounding action errors can still cause unpredictable and unsafe behavior, leading to task failure. Early failure prediction during runtime is therefore essential for deploying robots in human-centered and safety-critical environments. We propose FIPER, a general framework for Failure Prediction at Runtime for generative IL policies that does not require failure data. FIPER identifies two key indicators of impending failure: (i) out-of-distribution (OOD) observations detected via random network distillation in the policy’s embedding space, and (ii) high uncertainty in generated actions measured by a novel action-chunk entropy score. Both failure prediction scores are calibrated using a small set of successful rollouts via conformal prediction. A failure alarm is triggered when both indicators, aggregated over short time windows, exceed their thresholds. We evaluate FIPER across five simulation and real-world environments involving diverse failure modes. Our results demonstrate that FIPER better distinguishes actual failures from benign OOD situations and predicts failures more accurately and earlier than existing methods. We thus consider this work an important step towards more interpretable and safer generative robot policies. Code, data and videos are available at this https URL.
zh

[AI-10] Bandits with Single-Peaked Preferences and Limited Resources

【速读】：该论文旨在解决在线随机匹配问题，即在预算约束下，通过序列化地将 $ U $ 个用户匹配到 $ K $ 个臂（arms），以最大化 $ T $ 轮内的累积奖励。由于缺乏结构假设时该问题为 NP-hard，导致在线学习计算上不可行。为突破这一障碍，作者引入了社会选择理论中广泛接受的“单峰偏好”（single-peaked preferences）结构——即用户的偏好关于一个共同的臂排序呈单峰分布。解决方案的关键在于：首先设计了一个高效的离线预算匹配算法，并基于此构造出在线算法，其遗憾（regret）界为 $ \tilde{O}(UKT^{2/3}) $；进一步地，若单峰结构已知，则提出一种类似 Upper Confidence Bound (UCB) 的高效在线算法，实现 $ \tilde{O}(U\sqrt{TK}) $ 的遗憾界。该方法的核心创新在于使用基于 PQ 树（PQ tree）的顺序近似技术来建模和利用单峰偏好结构。

链接: https://arxiv.org/abs/2510.09425
作者: Gur Keinan,Rotem Torkan,Omer Ben-Porat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study an online stochastic matching problem in which an algorithm sequentially matches U users to K arms, aiming to maximize cumulative reward over T rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emphsingle-peaked preferences – a well-established structure in social choice theory, where users’ preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of \tilde O(UKT^2/3) . Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of \tilde O(U\sqrtTK) .
zh

[AI-11] Agent ic Systems in Radiology: Design Applications Evaluation and Challenges

【速读】：该论文旨在解决如何在放射学（radiology）领域中充分利用大语言模型（LLM）的潜力，以支持复杂、多步骤的临床工作流程，而不仅仅是执行单一任务。当前LLM在信息提取和报告摘要等独立任务上已表现良好，但其在动态上下文依赖和跨系统协作中的价值尚未被充分挖掘。解决方案的关键在于构建由LLM驱动的代理系统（agentic systems），通过赋予LLM外部工具调用能力和反馈机制，使其能够实现从半自动化流程到具备环境适应能力的自主代理的多种自主度，从而有效管理多源信息融合与决策演化，提升放射学工作流的智能化水平。

链接: https://arxiv.org/abs/2510.09404
作者: Christian Bluethgen,Dave Van Veen,Daniel Truhn,Jakob Nikolas Kather,Michael Moor,Malgorzata Polacin,Akshay Chaudhari,Thomas Frauenfelder,Curtis P. Langlotz,Michael Krauthammer,Farhad Nooralahzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of “reasoning” and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.
zh

[AI-12] ChoirRec: Semantic User Grouping via LLM s for Conversion Rate Prediction of Low-Activity Users

【速读】：该论文旨在解决大规模电商推荐系统中低活跃用户转化率（CVR）预测准确率低的问题，其核心挑战包括：行为信号噪声大、用户级信息不足以及训练偏差偏向高活跃用户。解决方案的关键在于提出ChoirRec框架，利用大语言模型（LLM）的语义能力构建可靠的跨活跃度用户群组，并通过双通道架构实现跨用户知识的有效迁移；具体包含三个模块：语义群组生成模块用于过滤噪声信号，群组感知的分层表示模块通过引入群组级先验信息缓解数据稀疏问题，以及群组感知的多粒度模块结合自适应融合机制确保群组知识的学习与利用效率。

链接: https://arxiv.org/abs/2510.09393
作者: Dakai Zhai,Jiong Gao,Boya Du,Junwei Xu,Qijie Shen,Jialin Zhu,Yuning Jiang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately predicting conversion rates (CVR) for low-activity users remains a fundamental challenge in large-scale e-commerce recommender this http URL approaches face three critical limitations: (i) reliance on noisy and unreliable behavioral signals; (ii) insufficient user-level information due to the lack of diverse interaction data; and (iii) a systemic training bias toward high-activity users that overshadows the needs of low-activity this http URL address these challenges, we propose ChoirRec, a novel framework that leverages the semantic capabilities of Large Language Models (LLMs) to construct semantic user groups and enhance CVR prediction for low-activity this http URL a dual-channel architecture designed for robust cross-user knowledge transfer, ChoirRec comprises three components: (i) a Semantic Group Generation module that utilizes LLMs to form reliable, cross-activity user clusters, thereby filtering out noisy signals; (ii) a Group-aware Hierarchical Representation module that enriches sparse user embeddings with informative group-level priors to mitigate data insufficiency; and (iii) a Group-aware Multi-granularity Modual that employs a dual-channel architecture and adaptive fusion mechanism to ensure effective learning and utilization of group knowledge. We conduct extensive offline and online experiments on Taobao, a leading industrial-scale e-commerce this http URL improves GAUC by 1.16% in offline evaluations, while online A/B testing reveals a 7.24% increase in order volume, highlighting its substantial practical value in real-world applications.
zh

[AI-13] Design Principles for Sequence Models via Coefficient Dynamics

【速读】：该论文旨在解决当前深度序列模型（如Transformer、状态空间模型State Space Models (SSMs) 和门控线性RNN等）在架构设计与性能分析上缺乏统一理论框架的问题。其核心挑战在于如何系统性地理解不同模型结构如何通过线性组合历史状态向量来生成输出，并揭示这些模型之间的共通数学本质，尤其是对softmax注意力机制的统一建模。解决方案的关键在于提出一个统一框架，将输出的线性组合系数视为由脉冲输入驱动的自主线性动态系统（autonomous linear dynamical systems）的输出，从而显式刻画所有模型的输出计算过程。这一视角不仅揭示了多种架构间的共同数学结构，还首次实现了对softmax注意力机制在RNN、SSM及相关模型中的统一表达，进而导出可解释的设计原则，如表达能力与高效实现之间的权衡、输入选择性的几何约束以及数值稳定训练所需的动力学稳定性条件。

链接: https://arxiv.org/abs/2510.09389
作者: Jerome Sieber,Antonio Orvieto,Melanie N. Zeilinger,Carmen Amo Alonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.
zh

[AI-14] ask-Level Insights from Eigenvalues across Sequence Models

【速读】：该论文试图解决的问题是：尽管Softmax注意力机制在序列模型中表现优异，但其二次复杂度限制了模型的可扩展性；同时，诸如状态空间模型（State Space Models, SSMs）等线性替代方案虽提升了效率，但其与Softmax注意力在信息处理机制上的根本差异尚不清晰。解决方案的关键在于：利用最近提出的动力系统框架将Softmax注意力、归一化注意力和线性注意力统一建模为动力学系统，并通过分析其特征值谱（eigenvalue spectra）来结构化地比较它们与SSMs的差异。研究表明，特征值谱不仅影响记忆能力和长程依赖建模，还能作为解释和优化序列模型性能的原理性指标。

链接: https://arxiv.org/abs/2510.09379
作者: Rahel Rickenbach,Jelena Trisovic,Alexandre Didier,Jerome Sieber,Melanie N. Zeilinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.
zh

[AI-15] he Potential of Second-Order Optimization for LLM s: A Study with Full Gauss-Newton

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）预训练过程中，由于采用计算高效的近似方法（如利用二阶信息的优化器）而导致性能损失的问题。其核心挑战在于：在不显著增加计算开销的前提下，如何尽可能逼近最优的迭代收敛速度。解决方案的关键在于引入全 Gauss-Newton (Gauss-Newton, GN) 预条件技术，并通过实验证明其对 Transformer 模型（最大达 150M 参数）具有显著加速效果——相比强基线优化器如 SOAP 和 Muon，训练迭代次数减少 5.4 倍；同时发现仅使用逐层精确的 GN 预条件器（忽略跨层信息）即可接近全 GN 方法的性能，表明高阶损失项对收敛速度影响有限，且层内 Hessian 结构已包含实现大部分性能提升的关键信息。

链接: https://arxiv.org/abs/2510.09378
作者: Natalie Abreu,Nikhil Vyas,Sham Kakade,Depen Morwani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
zh

[AI-16] Sequence Variables: A Constraint Programming Computational Domain for Routing and Sequencing

【速读】：该论文旨在解决传统约束规划（Constraint Programming, CP）模型在处理车辆路径问题（Vehicle Routing Problem, VRP）时的局限性，特别是基于后继变量（successor variables）的经典CP模型难以有效建模可选访问点（optional visits）以及不支持基于插入的启发式算法（如插入式大邻域搜索，Insertion-based Large Neighborhood Search）。其解决方案的关键在于引入序列变量（sequence variables），这是一种新的计算域，能够自然表达路径顺序并支持可选节点的处理和插入操作。论文明确了序列变量的定义、更新操作及约束一致性级别，并设计了适用于序列变量的全局约束与数据结构，从而实现了更简洁的问题建模和高效的求解性能，在Dial-a-Ride问题上展现出竞争力。

链接: https://arxiv.org/abs/2510.09373
作者: Augustin Delecluse,Pierre Schaus,Pascal Van Hentenryck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constraint Programming (CP) offers an intuitive, declarative framework for modeling Vehicle Routing Problems (VRP), yet classical CP models based on successor variables cannot always deal with optional visits or insertion based heuristics. To address these limitations, this paper formalizes sequence variables within CP. Unlike the classical successor models, this computational domain handle optional visits and support insertion heuristics, including insertion-based Large Neighborhood Search. We provide a clear definition of their domain, update operations, and introduce consistency levels for constraints on this domain. An implementation is described with the underlying data structures required for integrating sequence variables into existing trail-based CP solvers. Furthermore, global constraints specifically designed for sequence variables and vehicle routing are introduced. Finally, the effectiveness of sequence variables is demonstrated by simplifying problem modeling and achieving competitive computational performance on the Dial-a-Ride Problem.
zh

[AI-17] oward Mechanistic Explanation of Deductive Reasoning in Language Models

【速读】：该论文旨在解决大型语言模型在逻辑推理任务中内部工作机制不明确的问题，特别是如何从规则学习而非统计模式匹配的角度实现推理能力。其解决方案的关键在于发现并验证了“归纳头”（induction heads）在执行逻辑推理中的核心作用，即通过学习底层规则完成规则补全和规则链式推理的计算过程，从而揭示了小规模语言模型实现演绎推理的低层次机制。

链接: https://arxiv.org/abs/2510.09340
作者: Davide Maltoni,Matteo Ferrara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large language models have demonstrated relevant capabilities in solving problems that require logical reasoning; however, the corresponding internal mechanisms remain largely unexplored. In this paper, we show that a small language model can solve a deductive reasoning task by learning the underlying rules (rather than operating as a statistical learner). A low-level explanation of its internal representations and computational circuits is then provided. Our findings reveal that induction heads play a central role in the implementation of the rule completion and rule chaining steps involved in the logical inference required by the task.
zh

[AI-18] Localist LLM s – A Mathematical Framework for Dynamic Locality Control

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在可解释性与高性能之间难以权衡的问题，即如何在保持模型高效泛化能力的同时实现局部化、可解释的内部表示。其解决方案的关键在于提出一种“局部性旋钮”（locality dial）机制，通过在训练和推理过程中动态调节注意力机制的局部化程度，无需重新训练模型即可实现从规则驱动的局部表示（localist encoding）到分布式编码（distributed encoding）的连续过渡。这一机制依托于组稀疏惩罚（group sparsity penalties）、信息论锚点设计（information-theoretic anchor design）及动态规则注入（dynamic rule injection），并辅以严格的数学证明：当组稀疏惩罚超过特定阈值时，注意力机制会聚焦于语义相关的块，且注意力熵与指针保真度满足指数级上界，从而保障模型在低熵、高保真条件下实现性能与透明性的协同优化。

链接: https://arxiv.org/abs/2510.09338
作者: Joachim Diederich
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model’s attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.
zh

[AI-19] Randomized HyperSteiner: A Stochastic Delaunay Triangulation Heuristic for the Hyperbolic Steiner Minimal Tree

【速读】：该论文旨在解决双曲空间中构造最小斯坦纳树（Steiner Minimal Trees, SMTs）的问题，其核心挑战在于精确计算SMT是NP-hard的，而现有方法如HyperSteiner为确定性算法，易陷入局部次优解。解决方案的关键在于提出随机化双曲斯坦纳算法（Randomized HyperSteiner, RHS），其创新性地将随机性引入Delaunay三角剖分的扩展过程，并通过黎曼梯度下降优化对候选树进行精炼，从而提升全局搜索能力与鲁棒性。实验表明，RHS在合成数据和真实单细胞转录组数据上均显著优于MST、邻接法及原始HyperSteiner，在近边界配置下甚至可使总长度减少32%。

链接: https://arxiv.org/abs/2510.09328
作者: Aniss Aiman Medbouhi,Alejandro García-Castellanos,Giovanni Luca Marchetti,Daniel Pelt,Erik J Bekkers,Danica Kragic
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of constructing Steiner Minimal Trees (SMTs) in hyperbolic space. Exact SMT computation is NP-hard, and existing hyperbolic heuristics such as HyperSteiner are deterministic and often get trapped in locally suboptimal configurations. We introduce Randomized HyperSteiner (RHS), a stochastic Delaunay triangulation heuristic that incorporates randomness into the expansion process and refines candidate trees via Riemannian gradient descent optimization. Experiments on synthetic data sets and a real-world single-cell transcriptomic data show that RHS outperforms Minimum Spanning Tree (MST), Neighbour Joining, and vanilla HyperSteiner (HS). In near-boundary configurations, RHS can achieve a 32% reduction in total length over HS, demonstrating its effectiveness and robustness in diverse data regimes.
zh

[AI-20] Rate optimal learning of equilibria from data

【速读】：该论文旨在解决多智能体模仿学习（Multi-Agent Imitation Learning, MAIL）中的理论空白问题，特别是非交互式和交互式设置下的样本复杂度优化。在非交互式场景中，作者通过证明统计下界，指出“所有策略偏差集中系数”（all-policy deviation concentrability coefficient）是根本复杂度度量，并表明行为克隆（Behavior Cloning, BC）达到最优速率；而在交互式场景中，提出首个具有近最优样本复杂度的算法MAIL-WARM，其核心在于融合免奖励强化学习与交互式MAIL，将最优样本复杂度从 $\mathcal{O}(\varepsilon^{-8})$ 显著提升至 $\mathcal{O}(\varepsilon^{-2})$ ，并匹配理论下界。

链接: https://arxiv.org/abs/2510.09325
作者: Till Freihaut,Luca Viano,Emanuele Nevali,Volkan Cevher,Matthieu Geist,Giorgia Ramponi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from \mathcalO(\varepsilon^-8) to \mathcalO(\varepsilon^-2), matching the dependence on \varepsilon implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
zh

[AI-21] A Model-Driven Engineering Approach to AI-Powered Healthcare Platforms

【速读】：该论文旨在解决医疗人工智能（Artificial Intelligence, AI）在临床实践中推广受限的问题，主要挑战包括数据来源碎片化、严格的隐私法规以及构建可靠临床系统的技术复杂性。其解决方案的关键在于提出一种面向医疗AI的模型驱动工程（Model-Driven Engineering, MDE）框架，核心是基于形式化元模型（formal metamodels）和领域特定语言（Domain-Specific Languages, DSLs）实现从高层规范到可运行软件的自动化转换。其中，Medical Interoperability Language (MILA) 作为图形化DSL，使临床医生与数据科学家能够借助共享本体定义查询和机器学习流水线；结合联邦学习（federated learning）架构，可在不交换原始患者数据的前提下实现多中心协作，保障语义一致性与隐私安全。实证研究表明，该方法显著提升了预测性能并大幅降低人工编码工作量，为构建可互操作、可复现且可信的数字健康平台提供了可行路径。

链接: https://arxiv.org/abs/2510.09308
作者: Mira Raheem,Amal Elgammal,Michael Papazoglou,Bernd Krämer,Neamat El-Tazi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Disclaimer: This manuscript is currently under review at * MDPI Informatics*

点击查看摘要

Abstract:Artificial intelligence (AI) has the potential to transform healthcare by supporting more accurate diagnoses and personalized treatments. However, its adoption in practice remains constrained by fragmented data sources, strict privacy rules, and the technical complexity of building reliable clinical systems. To address these challenges, we introduce a model driven engineering (MDE) framework designed specifically for healthcare AI. The framework relies on formal metamodels, domain-specific languages (DSLs), and automated transformations to move from high level specifications to running software. At its core is the Medical Interoperability Language (MILA), a graphical DSL that enables clinicians and data scientists to define queries and machine learning pipelines using shared ontologies. When combined with a federated learning architecture, MILA allows institutions to collaborate without exchanging raw patient data, ensuring semantic consistency across sites while preserving privacy. We evaluate this approach in a multi center cancer immunotherapy study. The generated pipelines delivered strong predictive performance, with support vector machines achieving up to 98.5 percent and 98.3 percent accuracy in key tasks, while substantially reducing manual coding effort. These findings suggest that MDE principles metamodeling, semantic integration, and automated code generation can provide a practical path toward interoperable, reproducible, and trustworthy digital health platforms.
zh

[AI-22] SynthID-Image: Image watermarking at internet scale

【速读】：该论文旨在解决生成式 AI (Generative AI) 生成内容的可追溯性问题，即如何在不显著影响视觉质量的前提下，为AI生成的图像和视频帧嵌入不可见水印，以实现媒体溯源与真实性验证。解决方案的关键在于设计并部署一个大规模可扩展的深度学习系统 SynthID-Image，其核心特性包括高有效性（effectiveness）、保真度（fidelity）、鲁棒性（robustness）和安全性（security），并通过实际应用验证其在十亿级图像和视频帧上的可行性，同时提供配套的验证服务及外部模型变体 SynthID-O，后者在视觉质量和抗常见图像扰动方面达到当前最优性能。

链接: https://arxiv.org/abs/2510.09263
作者: Sven Gowal,Rudy Bunel,Florian Stimberg,David Stutz,Guillermo Ortiz-Jimenez,Christina Kouridi,Mel Vecerik,Jamie Hayes,Sylvestre-Alvise Rebuffi,Paul Bernard,Chris Gamble,Miklós Z. Horváth,Fabian Kaczmarczyck,Alex Kaskasoli,Aleksandar Petrov,Ilia Shumailov,Meghana Thotakuri,Olivia Wiles,Jessica Yung,Zahra Ahmed,Victor Martin,Simon Rosen,Christopher Savčak,Armin Senoner,Nidhi Vyas,Pushmeet Kohli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SynthID-Image, a deep learning-based system for invisibly watermarking AI-generated imagery. This paper documents the technical desiderata, threat models, and practical challenges of deploying such a system at internet scale, addressing key requirements of effectiveness, fidelity, robustness, and security. SynthID-Image has been used to watermark over ten billion images and video frames across Google’s services and its corresponding verification service is available to trusted testers. For completeness, we present an experimental evaluation of an external model variant, SynthID-O, which is available through partnerships. We benchmark SynthID-O against other post-hoc watermarking methods from the literature, demonstrating state-of-the-art performance in both visual quality and robustness to common image perturbations. While this work centers on visual media, the conclusions on deployment, constraints, and threat modeling generalize to other modalities, including audio. This paper provides a comprehensive documentation for the large-scale deployment of deep learning-based media provenance systems.
zh

[AI-23] Obstacle Avoidance using Dynamic Movement Primitives and Reinforcement Learning

【速读】：该论文旨在解决基于学习的运动规划方法在生成平滑、近最优且避障的三维笛卡尔轨迹时，对大规模训练数据或昂贵的人类示范依赖的问题。其解决方案的关键在于：首先将单个人工示教轨迹编码为动态运动基元（Dynamic Movement Primitive, DMP），然后利用基于策略的强化学习迭代优化该DMP，以适应不同障碍物配置并生成多样化的轨迹数据集；随后，使用该数据集训练一个神经网络，输入为从点云自动提取的任务参数（如障碍物尺寸与位置），输出为生成轨迹所需的DMP参数，从而实现高效、可泛化的轨迹生成，同时支持多模态路径选择以应对不同障碍几何和末端执行器尺寸。

链接: https://arxiv.org/abs/2510.09254
作者: Dominik Urbaniak,Alejandro Agostini,Pol Ramon,Jan Rosell,Raúl Suárez,Michael Suppa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Learning-based motion planning can quickly generate near-optimal trajectories. However, it often requires either large training datasets or costly collection of human demonstrations. This work proposes an alternative approach that quickly generates smooth, near-optimal collision-free 3D Cartesian trajectories from a single artificial demonstration. The demonstration is encoded as a Dynamic Movement Primitive (DMP) and iteratively reshaped using policy-based reinforcement learning to create a diverse trajectory dataset for varying obstacle configurations. This dataset is used to train a neural network that takes as inputs the task parameters describing the obstacle dimensions and location, derived automatically from a point cloud, and outputs the DMP parameters that generate the trajectory. The approach is validated in simulation and real-robot experiments, outperforming a RRT-Connect baseline in terms of computation and execution time, as well as trajectory length, while supporting multi-modal trajectory generation for different obstacle geometries and end-effector dimensions. Videos and the implementation code are available at this https URL.
zh

[AI-24] Fundamentals of Building Autonomous LLM Agents

【速读】：该论文旨在解决传统大语言模型（Large Language Models, LLMs）在真实世界任务中表现不足的问题，尤其是其在自动化复杂任务和逼近人类能力方面存在的局限性。解决方案的关键在于构建一个集成化的“代理”架构，包含感知系统、推理系统、记忆系统和执行系统：感知系统将环境输入转化为语义表示；推理系统通过链式思维（Chain-of-Thought）和树状思维（Tree-of-Thought）等技术制定计划并动态调整策略；记忆系统支持短期与长期知识存储；执行系统则将内部决策转化为具体动作。这种多系统协同的架构使软件机器人能够模拟人类认知过程，实现更强大且泛化的自主智能行为。

链接: https://arxiv.org/abs/2510.09244
作者: Victor de Lamo Castrillo,Habtom Kahsay Gidey,Alexander Lenz,Alois Knoll
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs). Motivated by the limitations of traditional LLMs in real-world tasks, the research aims to explore patterns to develop “agentic” LLMs that can automate complex tasks and bridge the performance gap with human capabilities. Key components include a perception system that converts environmental percepts into meaningful representations; a reasoning system that formulates plans, adapts to feedback, and evaluates actions through different techniques like Chain-of-Thought and Tree-of-Thought; a memory system that retains knowledge through both short-term and long-term mechanisms; and an execution system that translates internal decisions into concrete actions. This paper shows how integrating these systems leads to more capable and generalized software bots that mimic human cognitive processes for autonomous and intelligent behavior.
zh

[AI-25] RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）和大推理模型（Large Reasoning Models, LRMs）在空间计算能力上的局限性问题，特别是受限于有限上下文窗口所引发的空间复杂度瓶颈。现有研究多聚焦于NP类问题，而本文通过引入基于PSPACE-完全的正则表达式（regular expression, regex）问题——等价性判定（RegexEQ）与最小化（RegexMin）——构建了一个更严格的评估基准，以挑战模型对大规模搜索空间的探索能力。解决方案的关键在于设计并实现了一种双指数级空间探索方法，用于生成超过百万条标注的regex实例，并结合严谨的过滤机制构建高质量数据集，从而首次系统性地实证分析了LLMs和LRMs在空间计算维度上的性能边界，为评估其高级推理能力提供了新的量化框架。

链接: https://arxiv.org/abs/2510.09227
作者: Hyundong Jin,Joonghyuk Hahn,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at this https URL .
zh

[AI-26] Comparing Knowledge Source Integration Methods for Optimizing Healthcare Knowledge Fusion in Rescue Operation

【速读】：该论文旨在解决医疗领域中因医学知识与患者健康信息复杂多样而带来的临床决策难题，尤其在紧急救援场景下，如何高效整合多源异构知识以支持精准的患者驱动型决策。其解决方案的关键在于构建基于知识图谱（Knowledge Graph）的多源知识融合框架，通过统一结构化方式聚合、分析并利用现有医疗治疗与操作知识，使健康专业人员能够从多个语境对齐的知识源中选择最优信息，从而提升决策的准确性与可靠性。

链接: https://arxiv.org/abs/2510.09223
作者: Mubaris Nadeem,Madjid Fathi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference Paper for 2024 IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS), KIRETT Project, University of Siegen, Germany

点击查看摘要

Abstract:In the field of medicine and healthcare, the utilization of medical expertise, based on medical knowledge combined with patients’ health information is a life-critical challenge for patients and health professionals. The within-laying complexity and variety form the need for a united approach to gather, analyze, and utilize existing knowledge of medical treatments, and medical operations to provide the ability to present knowledge for the means of accurate patient-driven decision-making. One way to achieve this is the fusion of multiple knowledge sources in healthcare. It provides health professionals the opportunity to select from multiple contextual aligned knowledge sources which enables the support for critical decisions. This paper presents multiple conceptual models for knowledge fusion in the field of medicine, based on a knowledge graph structure. It will evaluate, how knowledge fusion can be enabled and presents how to integrate various knowledge sources into the knowledge graph for rescue operations.
zh

[AI-27] On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning

【速读】：该论文旨在解决持续学习（Continual Learning）中的灾难性遗忘（Catastrophic Forgetting）问题，其核心挑战在于深层神经网络在学习新任务时会严重损害旧任务的性能。作者通过理论与实证揭示：新任务训练过程本质上是对旧任务知识的一种隐式对抗攻击，因为新任务梯度会自动且精准地对齐旧任务损失曲面的尖锐方向（sharp directions），从而快速提升旧任务损失。这一现象看似反直觉，因其尖锐方向分布稀疏，难以偶然对齐；但论文指出，这是由训练过程的低秩偏置（low-rank bias）导致——该偏置通过前向和反向传播将新旧任务梯度限制在同一低维子空间中，促进对齐。针对此机制，现有梯度投影（Gradient Projection, GP）方法仅能缓解前向传播引起的对齐，无法应对反向传播带来的影响；为此，作者提出backGP方法，有效抑制了反向传播导致的对抗性对齐，平均减少遗忘10.8%，并提升准确率12.7%。

链接: https://arxiv.org/abs/2510.09181
作者: Ze Peng,Jian Zhang,Jintao Guo,Lei Qi,Yang Gao,Yinghuan Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning seeks the human-like ability to accumulate new skills in machine intelligence. Its central challenge is catastrophic forgetting, whose underlying cause has not been fully understood for deep networks. In this paper, we demystify catastrophic forgetting by revealing that the new-task training is implicitly an adversarial attack against the old-task knowledge. Specifically, the new-task gradients automatically and accurately align with the sharp directions of the old-task loss landscape, rapidly increasing the old-task loss. This adversarial alignment is intriguingly counter-intuitive because the sharp directions are too sparsely distributed to align with by chance. To understand it, we theoretically show that it arises from training’s low-rank bias, which, through forward and backward propagation, confines the two directions into the same low-dimensional subspace, facilitating alignment. Gradient projection (GP) methods, a representative family of forgetting-mitigating methods, reduce adversarial alignment caused by forward propagation, but cannot address the alignment due to backward propagation. We propose backGP to address it, which reduces forgetting by 10.8% and improves accuracy by 12.7% on average over GP methods.
zh

[AI-28] Dr. Bias: Social Disparities in AI-Powered Medical Guidance

【速读】：该论文试图解决的问题是：当前生成式 AI（Generative AI）在医疗问答场景中的应用日益广泛，但现有评估普遍忽视了健康服务的社会属性，未能深入分析模型输出是否存在因性别、年龄、种族等社会群体差异而产生的系统性偏差，进而可能导致不公平的患者支持。解决方案的关键在于通过构建多维度患者画像（涵盖性别、年龄范围和族裔）模拟临床问诊场景，对大型语言模型（Large Language Models, LLMs）生成的回答进行自然语言特征比较，揭示其在不同社会群体间响应质量的系统性差异——例如原住民和间性人群获得的建议可读性更低、结构更复杂，且这种差异在交叉身份群体中进一步加剧。研究呼吁提升AI素养，并要求开发者立即开展偏差识别与缓解机制的研究，以确保LLMs在医疗领域的公平性和可靠性。

链接: https://arxiv.org/abs/2510.09162
作者: Emma Kondrup,Anne Imouza
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.
zh

[AI-29] Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction

【速读】：该论文旨在解决电子健康记录（Electronic Health Records, EHRs）在临床预测任务中最佳表示方法不明确的问题，尤其针对不同数据结构（如多变量时间序列、事件流和文本事件流）的建模效果缺乏系统比较。其解决方案的关键在于构建首个标准化的基准测试框架，统一数据预处理与评估流程，并在两个独立的临床场景（MIMIC-IV用于重症监护室任务，EHRSHOT用于长期随访任务）中系统评估多种代表性模型（包括Transformer、LSTM、MLP、CLMBR、基于计数的模型及8–20B参数的大语言模型）。研究发现，事件流模型在多数场景下表现最优，且预训练模型（如CLMBR）在小样本情况下更具效率，而特征选择策略需根据临床任务类型调整：在ICU任务中剔除稀疏特征可提升性能，而在纵向护理任务中则需保留这些特征以维持预测能力。这一结果为EHR表示方法的选择提供了基于临床背景和数据条件的实证依据。

链接: https://arxiv.org/abs/2510.09159
作者: Tianyi Chen,Mingcheng Zhu,Zhiyao Luo,Tingting Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families–including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams–and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.
zh

[AI-30] Federated Data Analytics for Cancer Immunotherapy: A Privacy-Preserving Collaborative Platform for Patient Management ALT

【速读】：该论文旨在解决当前连通健康（Connected Health）领域中数据孤岛、系统互操作性差以及隐私保护不足等问题，这些问题限制了个性化医疗决策和高效患者管理的实现。解决方案的关键在于构建一个跨照护连续体的协作式数字框架，通过联邦大数据分析（Federated Big Data Analytics）与人工智能（Artificial Intelligence, AI）技术整合多源异构健康数据，在保障隐私的前提下实现精准治疗推荐和不良事件预测。该框架在欧盟资助项目中通过敏捷系统开发生命周期（Agile System Development Lifecycle）验证，基于真实世界数据的试点研究显示其在关键指标上达到70%-90%的准确率，体现了其在提升临床决策支持能力方面的有效性。

链接: https://arxiv.org/abs/2510.09155
作者: Mira Raheem,Michael Papazoglou,Bernd Krämer,Neamat El-Tazi,Amal Elgammal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: This manuscript is currently under review at * ACM Transactions on Computing for Healthcare (HEALTH)*

点击查看摘要

Abstract:Connected health is a multidisciplinary approach focused on health management, prioritizing pa-tient needs in the creation of tools, services, and treatments. This paradigm ensures proactive and efficient care by facilitating the timely exchange of accurate patient information among all stake-holders in the care continuum. The rise of digital technologies and process innovations promises to enhance connected health by integrating various healthcare data sources. This integration aims to personalize care, predict health outcomes, and streamline patient management, though challeng-es remain, particularly in data architecture, application interoperability, and security. Data analytics can provide critical insights for informed decision-making and health co-creation, but solutions must prioritize end-users, including patients and healthcare professionals. This perspective was explored through an agile System Development Lifecycle in an EU-funded project aimed at developing an integrated AI-generated solution for managing cancer patients undergoing immunotherapy. This paper contributes with a collaborative digital framework integrating stakeholders across the care continuum, leveraging federated big data analytics and artificial intelligence for improved decision-making while ensuring privacy. Analytical capabilities, such as treatment recommendations and adverse event predictions, were validated using real-life data, achieving 70%-90% accuracy in a pilot study with the medical partners, demonstrating the framework’s effectiveness.
zh

[AI-31] Controlled Personalization in Legacy Media Online Services: A Case Study in News Recommendation

【速读】：该论文旨在解决传统新闻媒体在引入个性化推荐技术时面临的挑战，即如何在提升用户参与度的同时维持编辑价值观与内容多样性。其解决方案的关键在于“受控个性化”（controlled personalization），即通过将编辑策划内容与算法推荐内容相结合的方式，在保障新闻专业性的基础上实现适度的个性化推荐。实验证明，这种策略能够显著提高点击率、降低用户导航成本，并增强内容多样性与覆盖范围，同时减少流行度偏差，从而有效协调用户需求与编辑目标。

链接: https://arxiv.org/abs/2510.09136
作者: Marlene Holzleitner,Stephan Leitner,Hanna Lind Jorgensen,Christoph Schmitz,Jacob Welander,Dietmar Jannach
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized news recommendations have become a standard feature of large news aggregation services, optimizing user engagement through automated content selection. In contrast, legacy news media often approach personalization cautiously, striving to balance technological innovation with core editorial values. As a result, online platforms of traditional news outlets typically combine editorially curated content with algorithmically selected articles - a strategy we term controlled personalization. In this industry paper, we evaluate the effectiveness of controlled personalization through an A/B test conducted on the website of a major Norwegian legacy news organization. Our findings indicate that even a modest level of personalization yields substantial benefits. Specifically, we observe that users exposed to personalized content demonstrate higher click-through rates and reduced navigation effort, suggesting improved discovery of relevant content. Moreover, our analysis reveals that controlled personalization contributes to greater content diversity and catalog coverage and in addition reduces popularity bias. Overall, our results suggest that controlled personalization can successfully align user needs with editorial goals, offering a viable path for legacy media to adopt personalization technologies while upholding journalistic values.
zh

[AI-32] PAC Reasoning : Controlling the Performance Loss for Efficient Reasoning

【速读】：该论文旨在解决大推理模型（Large Reasoning Models, LRMs）在部署过程中计算成本高昂的问题，尤其是在动态切换“思考模式”（thinking mode）与“非思考模式”（nonthinking mode）时易引入额外推理错误且缺乏性能损失的统计保障这一关键挑战。解决方案的核心在于提出了一种基于Probably Approximately Correct (PAC) 理论的推理机制，通过构建性能损失的上置信界（upper confidence bound），将其表示为不确定度得分（uncertainty score）的单调函数，并据此确定一个用于模式切换的阈值。该方法理论上保证了在分布无关条件下性能损失的有界性，同时实验表明其能够在控制用户指定性能损失容忍度的前提下显著节省计算资源。

链接: https://arxiv.org/abs/2510.09133
作者: Hao Zeng,Jianguo Huang,Bingyi Jing,Hongxin Wei,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.
zh

[AI-33] On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

【速读】：该论文旨在解决差分隐私机器学习（Differentially Private Machine Learning, DPML）中群体隐私保护公平性不足的问题，即现有方法基于数据记录的平均隐私风险评估可能低估群体间隐私风险差异，且当前最坏情况隐私风险评估方法计算复杂度高、难以实用。解决方案的关键在于提出一种新颖的成员推断博弈（membership inference game），可高效审计数据记录的近似最坏情况隐私风险，从而更严格地衡量群体隐私风险差异；同时，受差分隐私审计中“金丝雀”机制启发，改进标准DP-SGD算法，引入自适应的群体特定梯度裁剪策略，有效降低不同群体间的隐私风险差异，提升DPML中的隐私保护公平性。

链接: https://arxiv.org/abs/2510.09114
作者: Zhi Yang,Changwu Huang,Ke Tang,Xin Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.
zh

[AI-34] MemLoss: Enhancing Adversarial Training with Recycling Adversarial Examples

【速读】：该论文旨在解决对抗训练（Adversarial Training）中模型在提升对抗鲁棒性的同时往往牺牲自然准确率（Natural Accuracy）的问题。解决方案的关键在于提出一种名为MemLoss的新方法，该方法利用先前生成的对抗样本（称为“记忆对抗样本”，Memory Adversarial Examples）在多个训练轮次中进行复用，从而在不损害干净数据性能的前提下，协同优化模型的自然准确性和对抗鲁棒性。实验表明，该方法在CIFAR-10等多个数据集上均优于现有对抗训练方法。

链接: https://arxiv.org/abs/2510.09105
作者: Soroush Mahdi,Maryam Amirmazlaghani,Saeed Saravani,Zahra Dehghanian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:In this paper, we propose a new approach called MemLoss to improve the adversarial training of machine learning models. MemLoss leverages previously generated adversarial examples, referred to as ‘Memory Adversarial Examples,’ to enhance model robustness and accuracy without compromising performance on clean data. By using these examples across training epochs, MemLoss provides a balanced improvement in both natural accuracy and adversarial robustness. Experimental results on multiple datasets, including CIFAR-10, demonstrate that our method achieves better accuracy compared to existing adversarial training methods while maintaining strong robustness against attacks.
zh

[AI-35] When a Robot is More Capable than a Human: Learning from Constrained Demonstrators

【速读】：该论文旨在解决受限专家演示导致机器人学习策略次优的问题，即当专家通过受控接口（如操纵杆或示教）进行示范时，由于控制维度受限或硬件安全约束，无法充分展示最优行为轨迹，从而影响最终策略性能。解决方案的关键在于不直接模仿专家动作，而是利用演示数据推断仅依赖状态的奖励信号（state-only reward signal），以衡量任务进展，并通过时间插值对未知状态进行自监督奖励标注（self-label reward for unknown states），从而引导代理探索更短、更高效的路径。此方法显著提升了样本效率和任务完成速度，在真实WidowX机械臂上实现了比行为克隆快10倍的任务执行速度（12秒内完成）。

链接: https://arxiv.org/abs/2510.09096
作者: Xinhu Li,Ayush Jain,Zhaojing Yang,Yigit Korkmaz,Erdem Bıyık
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert’s ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 12 seconds, 10x faster than behavioral cloning, as shown in real-robot videos on this https URL .
zh

[AI-36] AI and Human Oversight: A Risk-Based Framework for Alignment

【速读】：该论文旨在解决人工智能（Artificial Intelligence, AI）系统在快速发展过程中如何保障人类自主性（human agency）和促进伦理决策的问题，以增强公众对AI技术的信任与问责机制。其解决方案的关键在于设计能够嵌入有效人类监督机制的AI系统，并提出基于风险的框架，将AI模型的风险等级与相应的监督模式相匹配，具体包括“人在命令中”（Human-in-Command, HIC）、“人在环内”（Human-in-the-Loop, HITL）和“人在回路外”（Human-on-the-Loop, HOTL）三种模型，从而在推动技术创新的同时，确保个体权利和价值观得到保护，实现负责任的AI部署。

链接: https://arxiv.org/abs/2510.09090
作者: Laxmiraju Kandikatla,Branislav Radeljic
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:As Artificial Intelligence (AI) technologies continue to advance, protecting human autonomy and promoting ethical decision-making are essential to fostering trust and accountability. Human agency (the capacity of individuals to make informed decisions) should be actively preserved and reinforced by AI systems. This paper examines strategies for designing AI systems that uphold fundamental rights, strengthen human agency, and embed effective human oversight mechanisms. It discusses key oversight models, including Human-in-Command (HIC), Human-in-the-Loop (HITL), and Human-on-the-Loop (HOTL), and proposes a risk-based framework to guide the implementation of these mechanisms. By linking the level of AI model risk to the appropriate form of human oversight, the paper underscores the critical role of human involvement in the responsible deployment of AI, balancing technological innovation with the protection of individual values and rights. In doing so, it aims to ensure that AI technologies are used responsibly, safeguarding individual autonomy while maximizing societal benefits.
zh

[AI-37] Leading the Follower: Learning Persuasive Agents in Social Deduction Games

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）代理在社会推理游戏（Social Deduction Games, SDGs）中忽视说服性沟通问题的局限性，即现有方法主要关注信息处理与策略选择，而未充分考虑通过话语影响其他玩家信念和响应的重要性。其解决方案的关键在于将回合制对话形式化为Stackelberg博弈，其中当前玩家作为领导者，通过策略性地优化发言内容以最大化对跟随者（其他玩家）行为的影响；在此理论框架基础上，提出了一种强化学习训练机制，使代理能够学习生成具有高说服力的语句，从而显著提升在多种SDG场景下的表现。

链接: https://arxiv.org/abs/2510.09087
作者: Zhang Zheng,Deheng Ye,Peilin Zhao,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown remarkable progress in social deduction games (SDGs). However, existing approaches primarily focus on information processing and strategy selection, overlooking the significance of persuasive communication in influencing other players’ beliefs and responses. In SDGs, success depends not only on making correct deductions but on convincing others to response in alignment with one’s intent. To address this limitation, we formalize turn-based dialogue in SDGs as a Stackelberg competition, where the current player acts as the leader who strategically influences the follower’s response. Building on this theoretical foundation, we propose a reinforcement learning framework that trains agents to optimize utterances for persuasive impact. Through comprehensive experiments across three diverse SDGs, we demonstrate that our agents significantly outperform baselines. This work represents a significant step toward developing AI agents capable of strategic social influence, with implications extending to scenarios requiring persuasive communication.
zh

[AI-38] Physics-Informed High-order Graph Dynamics Identification Learning for Predicting Complex Networks Long-term Dynamics

【速读】：该论文旨在解决复杂网络动态预测中的两大核心问题：一是传统图神经网络（Graph Neural Networks, GNNs）仅能建模节点间的成对关系，难以捕捉复杂网络中广泛存在的高阶非成对结构关系，导致模型精度受限；二是现有理论模型预测精度不足，而数据驱动模型缺乏物理可解释性。解决方案的关键在于提出一种基于高阶网络动力学识别的方法，通过引入动态超图学习（dynamic hypergraph learning）以显式建模高阶非成对关系，提升模型表达能力；同时设计双驱动动态预测模块，结合Koopman算子理论将非线性微分方程转化为线性系统求解，并融合物理信息神经微分方程（Physics-Informed Neural Differential Equations, PINNs）确保预测过程符合物理规律，从而在保证高精度的同时实现良好的可解释性。

链接: https://arxiv.org/abs/2510.09082
作者: Bicheng Wang,Jinping Wang,Yibo Sue
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Learning complex network dynamics is fundamental to understanding, modelling and controlling real-world complex systems. There are two main problems in the task of predicting the dynamic evolution of complex networks: on the one hand, existing methods usually use simple graphs to describe the relationships in complex networks; however, this approach can only capture pairwise relationships, while there may be rich non-pairwise structured relationships in the network. First-order GNNs have difficulty in capturing dynamic non-pairwise relationships. On the other hand, theoretical prediction models lack accuracy and data-driven prediction models lack interpretability. To address the above problems, this paper proposes a higher-order network dynamics identification method for long-term dynamic prediction of complex networks. Firstly, to address the problem that traditional graph machine learning can only deal with pairwise relations, dynamic hypergraph learning is introduced to capture the higher-order non-pairwise relations among complex networks and improve the accuracy of complex network modelling. Then, a dual-driven dynamic prediction module for physical data is proposed. The Koopman operator theory is introduced to transform the nonlinear dynamical differential equations for the dynamic evolution of complex networks into linear systems for solving. Meanwhile, the physical information neural differential equation method is utilised to ensure that the dynamic evolution conforms to the physical laws. The dual-drive dynamic prediction module ensures both accuracy and interpretability of the prediction. Validated on public datasets and self-built industrial chain network datasets, the experimental results show that the method in this paper has good prediction accuracy and long-term prediction performance.
zh

[AI-39] raining Models to Detect Successive Robot Errors from Human Reactions

【速读】：该论文旨在解决机器人在社会场景中频繁发生交互失败时，如何通过识别人类反应来判断其行为是否需要调整的问题。核心挑战在于，当机器人连续出错时，人类的反馈会随失败次数增加而逐步演化（如从困惑、言语变化到明显 frustration），但现有研究缺乏对这种动态变化的系统建模。解决方案的关键在于利用机器学习方法，从视频数据中提取人类行为特征（如面部表情、肢体动作等），构建针对个体用户的分类模型，以识别不同阶段的机器人失败状态；实验表明，该方法在错误检测上达到93.5%准确率，在失败阶段分类上达到84.1%准确率，从而提升了人机交互（HRI）中的错误感知与理解能力。

链接: https://arxiv.org/abs/2510.09080
作者: Shannon Liu,Maria Teresa Parreira,Wendy Ju
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to NERC '25

点击查看摘要

Abstract:As robots become more integrated into society, detecting robot errors is essential for effective human-robot interaction (HRI). When a robot fails repeatedly, how can it know when to change its behavior? Humans naturally respond to robot errors through verbal and nonverbal cues that intensify over successive failures-from confusion and subtle speech changes to visible frustration and impatience. While prior work shows that human reactions can indicate robot failures, few studies examine how these evolving responses reveal successive failures. This research uses machine learning to recognize stages of robot failure from human reactions. In a study with 26 participants interacting with a robot that made repeated conversational errors, behavioral features were extracted from video data to train models for individual users. The best model achieved 93.5% accuracy for detecting errors and 84.1% for classifying successive failures. Modeling the progression of human reactions enhances error detection and understanding of repeated interaction breakdowns in HRI.
zh

[AI-40] Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

【速读】：该论文旨在解决真实场景下语音情感识别（Speech Emotion Recognition, SER）模型因噪声环境和跨数据集差异导致的鲁棒性与泛化能力不足的问题。解决方案的关键在于提出一种两阶段的表示学习方法：首先通过情绪解耦表示学习（Emotion-Disentangled Representation Learning, EDRL）提取类别特异性判别特征，同时保留不同情感类别间的共享相似性；随后利用多块嵌入对齐（Multiblock Embedding Alignment, MEA）将这些特征投影到一个联合判别潜在子空间中，以最大化其与原始语音输入的协方差。该方法有效提升了模型在未见噪声数据和跨语料库测试中的性能表现。

链接: https://arxiv.org/abs/2510.09072
作者: Upasana Tiwari,Rupayan Chakraborty,Sunil Kumar Kopparapu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.
zh

[AI-41] MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction

【速读】：该论文旨在解决源代码时间复杂度预测中的准确性问题，特别是针对大语言模型（LLM）在不同复杂度类别上表现不均衡的现象。现有研究表明，未经微调的LLM在某些复杂度类别的预测任务中存在显著局限性，即没有单一模型能全面胜任所有类别。为此，作者提出MEC³O（Multi-Expert Consensus System），其核心创新在于构建一个基于性能分配的多专家共识机制：首先根据各LLM在特定复杂度类别的历史表现将其指派为该类别的“专家”，并提供类专属指令以增强专业化能力；随后通过结构化辩论框架让这些专家进行交互式讨论，并采用加权共识策略融合其预测结果。该方案有效缓解了思维退化（Degeneration-of-Thought）问题，避免依赖额外裁判模型且防止错误多数意见的收敛，实验表明MEC³O在CodeComplex基准上显著优于开源基线（至少提升10%准确率和宏F1），并在平均宏F1上超越GPT-4o-mini，与GPT-4o及GPT-o4-mini持平，验证了多专家辩论与加权共识策略的有效性。

链接: https://arxiv.org/abs/2510.09049
作者: Joonghyuk Hahn,Soohan Lim,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 24 pages, 11 figures, 10 tables

点击查看摘要

Abstract:Predicting the complexity of source code is essential for software development and algorithm analysis. Recently, Baik et al. (2025) introduced CodeComplex for code time complexity prediction. The paper shows that LLMs without fine-tuning struggle with certain complexity classes. This suggests that no single LLM excels at every class, but rather each model shows advantages in certain classes. We propose MEC ^3 O, a multi-expert consensus system, which extends the multi-agent debate frameworks. MEC ^3 O assigns LLMs to complexity classes based on their performance and provides them with class-specialized instructions, turning them into experts. These experts engage in structured debates, and their predictions are integrated through a weighted consensus mechanism. Our expertise assignments to LLMs effectively handle Degeneration-of-Thought, reducing reliance on a separate judge model, and preventing convergence to incorrect majority opinions. Experiments on CodeComplex show that MEC ^3 O outperforms the open-source baselines, achieving at least 10% higher accuracy and macro-F1 scores. It also surpasses GPT-4o-mini in macro-F1 scores on average and demonstrates competitive on-par F1 scores to GPT-4o and GPT-o4-mini on average. This demonstrates the effectiveness of multi-expert debates and weight consensus strategy to generate the final predictions. Our code and data is available at this https URL.
zh

[AI-42] Cost-Efficient Long Code Translation using LLM s while Leverag ing Identifier Replacements

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长源代码时因超出上下文窗口（context window）限制而导致翻译不准确的问题。解决方案的关键在于提出一种零样本（zero-shot）代码翻译方法，通过将用户提供的长标识符（identifier）替换为通用占位符（placeholder），使LLM能够聚焦于代码的逻辑结构，从而减少token数量和内存消耗，提升长代码翻译的效率与成本效益，同时有效保留语法和层次信息。

链接: https://arxiv.org/abs/2510.09045
作者: Manojit Chakraborty,Madhusudan Ghosh,Rishabh Gupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don’t fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
zh

[AI-43] Humanoid Artificial Consciousness Designed with Large Language Model Based on Psychoanalysis and Personality Theory

【速读】：该论文试图解决当前人工智能系统缺乏人类意识特征的问题，尤其是如何在生成式 AI 中模拟具备自我认知、情感理解与人格特质的类人意识。其解决方案的关键在于将精神分析理论（psychoanalysis）与迈尔斯-布里格斯类型指标（MBTI）相结合，构建包含自知觉、无意识和前意识三个模块的人工意识框架，并设计16种具有不同人格类型的虚拟角色，每种角色均具备需求、地位和记忆等属性。通过多维度评估（问卷调查、ChatGPT三阶分类及质性分析），验证了该模型在复杂情境下展现出类人决策行为的可能性，从而为发展更具适应性和人性化交互能力的AI系统提供了新路径。

链接: https://arxiv.org/abs/2510.09043
作者: Sang Hun Kim,Jongmin Lee,Dongkyu Park,So Young Lee,Yosep Chong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 41 pages, 6 figures. Accepted and published to Cognitive Systems Research, 2025

点击查看摘要

Abstract:Human consciousness is still a concept hard to define with current scientific understanding. Although Large Language Models (LLMs) have recently demonstrated significant advancements across various domains including translation and summarization, human consciousness is not something to imitate with current upfront technology owing to so-called hallucination. This study, therefore, proposes a novel approach to address these challenges by integrating psychoanalysis and the Myers-Briggs Type Indicator (MBTI) into constructing consciousness and personality modules. We developed three artificial consciousnesses (self-awareness, unconsciousness, and preconsciousness) based on the principles of psychoanalysis. Additionally, we designed 16 characters with different personalities representing the sixteen MBTI types, with several attributes such as needs, status, and memories. To determine if our model’s artificial consciousness exhibits human-like cognition, we created ten distinct situations considering seven attributes such as emotional understanding and logical thinking. The decision-making process of artificial consciousness and the final action were evaluated in three ways: survey evaluation, three-tier classification via ChatGPT, and qualitative review. Both quantitative and qualitative analyses indicated a high likelihood of well-simulated consciousness, although the difference in response between different characters and consciousnesses was not very significant. This implies that the developed models incorporating elements of psychoanalysis and personality theory can lead to building a more intuitive and adaptable AI system with humanoid consciousness. Therefore, this study contributes to opening up new avenues for improving AI interactions in complex cognitive contexts.
zh

[AI-44] Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

【速读】：该论文旨在解决深度强化学习（Deep Reinforcement Learning, DRL）在自动驾驶应用中对对抗攻击的脆弱性问题，尤其针对现有鲁棒方法存在的三大缺陷：（i）训练时仅考虑短视对抗攻击，难以应对策略性威胁；（ii）难以引发真正安全关键事件（如碰撞），常导致轻微后果；（iii）缺乏鲁棒约束导致训练过程不稳定及策略漂移。其解决方案的关键在于提出智能一般和约束对抗强化学习（Intelligent General-sum Constrained Adversarial Reinforcement Learning, IGCARL），该方法包含一个具备多步协同攻击能力的战略目标对抗者（strategic targeted adversary），通过一般和目标函数显式诱导安全关键事件，并引入约束优化框架下的鲁棒驾驶代理（robust driving agent），使其在与对抗者交互中稳定学习并生成抗干扰能力强的自动驾驶策略，从而显著提升对抗攻击下的成功率（至少提高27.9%），增强DRL驱动自动驾驶系统的安全性与可靠性。

链接: https://arxiv.org/abs/2510.09041
作者: Junchao Fan,Xiaolin Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.
zh

[AI-45] Repairing Regex Vulnerabilities via Localization-Guided Instructions

【速读】：该论文旨在解决正则表达式拒绝服务（Regular Expression Denial of Service, ReDoS）漏洞的自动化修复问题，此类漏洞因正则表达式在输入验证和数据解析中的广泛应用而普遍存在。现有方法存在显著局限：基于符号规则的方法虽精确但难以处理未见过或复杂的漏洞模式；大型语言模型（Large Language Models, LLMs）虽具备泛化能力，却难以保证修复结果的语法与语义正确性。论文提出了一种混合框架——局部正则表达式修复（Localized Regex Repair, LRR），其关键在于将漏洞定位与修复过程解耦：首先由确定性的符号模块精准识别出脆弱子模式，从而构建一个约束且可处理的问题空间；随后调用LLM针对该隔离片段生成语义等价的修复方案。该设计既利用了LLM的泛化能力，又通过符号模块保障了修复的可靠性，最终在复杂修复场景中实现了优于现有方法15.4个百分点的修复率提升。

链接: https://arxiv.org/abs/2510.09037
作者: Sicheol Sung,Joonghyuk Hahn,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 14 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Regular expressions (regexes) are foundational to modern computing for critical tasks like input validation and data parsing, yet their ubiquity exposes systems to regular expression denial of service (ReDoS), a vulnerability requiring automated repair methods. Current approaches, however, are hampered by a trade-off. Symbolic, rule-based system are precise but fails to repair unseen or complex vulnerability patterns. Conversely, large language models (LLMs) possess the necessary generalizability but are unreliable for tasks demanding strict syntactic and semantic correctness. We resolve this impasse by introducing a hybrid framework, localized regex repair (LRR), designed to harness LLM generalization while enforcing reliability. Our core insight is to decouple problem identification from the repair process. First, a deterministic, symbolic module localizes the precise vulnerable subpattern, creating a constrained and tractable problem space. Then, the LLM invoked to generate a semantically equivalent fix for this isolated segment. This combined architecture successfully resolves complex repair cases intractable for rule-based repair while avoiding the semantic errors of LLM-only approaches. Our work provides a validated methodology for solving such problems in automated repair, improving the repair rate by 15.4%p over the state-of-the-art. Our code is available at this https URL.
zh

[AI-46] Déréverbération non-supervisée de la parole par modèle hybride

【速读】：该论文旨在解决语音去混响（speech dereverberation）系统在无监督条件下训练时缺乏配对干净/混响语音数据的问题。现有方法通常依赖于大量标注的干声（dry）与混响语音（reverberant speech）配对数据，而这类数据难以获取。解决方案的关键在于利用有限的声学信息（如混响时间 RT60）作为先验知识，在仅使用混响语音的前提下训练去混响模型，从而实现更稳定且一致的性能表现，优于当前最先进的方法。

链接: https://arxiv.org/abs/2510.09025
作者: Louis Bahrman(IDS, S2A),Mathieu Fontaine(IDS, S2A),Gaël Richard(IDS, S2A)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: in French language

点击查看摘要

Abstract:This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.
zh

[AI-47] RefGrader: Automated Grading of Mathematical Competition Proofs using Agent ic Workflows

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）在自动评分数学证明题时存在的关键问题：即虽然模型能够识别出证明中的错误（包括细微错误），但在分配部分分数（partial credit）方面存在校准偏差（calibration gaps），难以实现与人类评阅者一致的评分一致性。解决方案的关键在于引入代理式（agentic）工作流，通过提取和分析参考解答（reference solutions）来自动生成针对每道题目的具体评分标准（rubric），并据此进行多步骤的评分流程，从而显著提升模型对部分得分的判断准确性与一致性。实验表明，该方法在自建标注语料库和MathArena竞赛数据集上均实现了更高的人机评分一致性。

链接: https://arxiv.org/abs/2510.09021
作者: Hamed Mahdavi(1),Pouria Mahdavinia(1),Samira Malek(1),Pegah Mohammadipour(1),Alireza Hashemi(2),Majid Daliri(3),Alireza Farhadi(4),Amir Khasahmadi(5),Niloofar Mireshghallah(6),Vasant Honavar(1) ((1) Pennsylvania State University, (2) City University of New York, (3) New York University, (4) Amirkabir University of Technology, (5) Autodesk, (6) Carnegie Mellon University)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
zh

[AI-48] Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

【速读】：该论文旨在解决基于Transformer架构的大模型中存在的极端令牌现象（extreme-token phenomena），如注意力黑洞（attention sinks）和值状态漏失（value-state drains），这些问题会降低模型性能、量化保真度和可解释性。其核心问题是模型通过一种有害的相互强化机制，学习到一种低效的“无操作”（no-op）行为，即过度关注值状态接近零的令牌。解决方案的关键在于提出一种名为值状态门控注意力（Value-State Gated Attention, VGA）的新机制，该机制通过直接从值向量（V）计算一个可学习的数据依赖门控函数来调节输出，从而打破上述循环。理论分析表明，基于值状态本身的门控比基于输入嵌入的门控更有效地解耦值和注意力分数的更新，形成一条直接的调控路径，使模型能够根据令牌的自适应值表示抑制其贡献，显著缓解注意力黑洞并稳定值状态范数，提升模型性能与可解释性。

链接: https://arxiv.org/abs/2510.09017
作者: Rui Bu,Haofeng Zhong,Wenzheng Chen,Yangyan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient ‘no-op’ behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing ‘no-op’ attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token’s contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.
zh

[AI-49] DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

【速读】：该论文旨在解决基于扩散模型的歌声合成（Singing Voice Synthesis, SVS）在数据稀缺和模型可扩展性方面的局限性。其关键解决方案在于提出一个两阶段管道：首先构建由固定旋律与大语言模型（Large Language Model, LLM）生成歌词配对而成的紧凑种子语料库，进而训练旋律特定的模型以合成超过500小时高质量中文歌声数据；在此基础上，设计了DiTSinger——一种采用RoPE（Rotary Position Embedding）和qk-norm的扩散Transformer架构，并通过深度、宽度与分辨率的系统性扩展提升音质保真度；同时引入隐式对齐机制，在字符级跨度内约束音素到声学特征的注意力，从而无需音素级时长标注即可实现鲁棒的合成效果。

链接: https://arxiv.org/abs/2510.09016
作者: Zongcai Du,Guilin Deng,Xiaofeng Guo,Xin Gao,Linke Li,Kaichang Cheng,Fubo Han,Siyu Yang,Peng Liu,Pan Zhong,Qiang Fu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: under review

点击查看摘要

Abstract:Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
zh

[AI-50] ripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在旅行规划任务中面临的可行性、可靠性与用户参与度评估不足的问题。现有基准测试虽能部分衡量LLMs的规划能力，但难以全面反映实际旅行计划的质量。其解决方案的关键在于构建一个统一的奖励机制，将细粒度的评价标准（如行程合理性、执行可靠性及用户吸引力）整合为单一评分指标，从而支持直接比较不同方法生成的旅行计划质量，并可无缝集成到强化学习（Reinforcement Learning, RL）框架中进行优化。该方法通过引入由旅行专家标注的数据集和大规模真实查询语料（4,870条请求），验证了基于GRPO（Generalized Reward Policy Optimization）的强化学习策略在提升行程可行性方面显著优于仅靠提示工程或监督微调的方法。

链接: https://arxiv.org/abs/2510.09011
作者: Yincen Qu,Huan Xiao,Feng Li,Hui Zhou,Xiangying Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.
zh

[AI-51] SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

【速读】：该论文旨在解决大规模神经网络在资源受限设备上部署时的压缩效率问题，现有方法通常单独采用权重重剪枝或低比特量化，难以在保持模型性能的同时实现高压缩率。其解决方案的关键在于提出一种基于贝叶斯变分学习的统一框架（SQS），通过引入spike-and-slab先验诱导稀疏性，并利用高斯混合模型（GMM）对量化权重进行建模，从而实现剪枝与低比特量化的协同优化，在理论上保证了稀疏且量化后的深度神经网络的一致性，并在ResNet、BERT-base、Llama3和Qwen2.5等模型上验证了其优于现有方法的压缩率与性能平衡。

链接: https://arxiv.org/abs/2510.08999
作者: Ziyi Wang,Nan Jiang,Guang Lin,Qifan Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.
zh

[AI-52] Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

【速读】：该论文旨在解决当前软件工程智能体（Software Engineering Agents）评估基准与实际开发者使用场景之间存在的偏差问题，特别是现有基准如SWE-Bench Verified主要基于GitHub issue描述，未能反映开发者在集成开发环境（IDE）中与基于聊天的代码辅助工具（chat-based coding assistants）的真实交互模式，导致对模型能力的系统性高估，尤其是在缺陷修复任务上。解决方案的关键在于提出一种新颖的基准框架，通过分析真实用户与聊天式代码助手的交互日志（telemetry data），将传统形式化的任务描述转化为符合实际用户习惯的查询语句，并以此对现有基准进行“变异”（benchmark mutation），从而更准确地衡量智能体在真实交互情境下的性能表现。该方法具有灵活性和可扩展性，已成功应用于多个公开及私有基准，揭示了原有评估指标存在显著高估现象（公共基准提升50%，内部基准提升10–16%）。

链接: https://arxiv.org/abs/2510.08996
作者: Spandan Garg,Ben Steenhoek,Yufan Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent’s capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by 50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
zh

[AI-53] PlatformX: An End-to-End Transferable Platform for Energy-Efficient Neural Architecture Search

【速读】：该论文旨在解决硬件感知神经架构搜索（Hardware-Aware Neural Architecture Search, HW-NAS）在实际部署中面临的三大瓶颈问题：高时间成本、大量人工配置与调试，以及跨异构硬件平台的可扩展性差。其解决方案的关键在于提出一个完全自动化且具备迁移能力的HW-NAS框架PlatformX，核心创新包括：(i) 基于能量驱动的搜索空间设计，引入能耗敏感的配置维度以探索高能效架构；(ii) 跨设备可迁移的核级能量预测器，仅需少量设备端样本即可增量优化；(iii) 基于帕累托前沿的多目标搜索算法，在准确率与能耗之间实现最优权衡；(iv) 高分辨率运行时能耗采样系统，利用外部监测设备自动完成设备端功耗测量，无需人工干预。该方案显著降低了搜索开销，同时保持模型精度和能耗建模的真实性，在多个移动平台上验证了优于MobileNet-V2的性能表现。

链接: https://arxiv.org/abs/2510.08993
作者: Xiaolong Tu,Dawei Chen,Kyungtae Han,Onur Altintas,Haoxin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hardware-Aware Neural Architecture Search (HW-NAS) has emerged as a powerful tool for designing efficient deep neural networks (DNNs) tailored to edge devices. However, existing methods remain largely impractical for real-world deployment due to their high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex, device-specific energy behavior. In this paper, we present PlatformX, a fully automated and transferable HW-NAS framework designed to overcome these limitations. PlatformX integrates four key components: (i) an energy-driven search space that expands conventional NAS design by incorporating energy-critical configurations, enabling exploration of high-efficiency architectures; (ii) a transferable kernel-level energy predictor across devices and incrementally refined with minimal on-device samples; (iii) a Pareto-based multi-objective search algorithm that balances energy and accuracy to identify optimal trade-offs; and (iv) a high-resolution runtime energy profiling system that automates on-device power measurement using external monitors without human intervention. We evaluate PlatformX across multiple mobile platforms, showing that it significantly reduces search overhead while preserving accuracy and energy fidelity. It identifies models with up to 0.94 accuracy or as little as 0.16 mJ per inference, both outperforming MobileNet-V2 in accuracy and efficiency. Code and tutorials are available at this http URL.
zh

[AI-54] ny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging

【速读】：该论文旨在解决轻量级多模态大语言模型（Multimodal Large Language Models, MLLMs）在推理效率与准确性方面的挑战，尤其是在模型规模小、易出现过度思考（overthinking）以及在资源受限场景下性能下降的问题。解决方案的关键在于提出一种两阶段优化框架：第一阶段引入长度感知的相对策略优化（Length-Informed Relative Policy Optimization, LIPO），通过动态调整组内响应的优势值，优先生成简洁且高质量的回答，从而提升推理效率和准确性；第二阶段提出无需训练的自适应模型融合（Adaptive Model Merging, AMM）方法，通过梯度投影正则化损失函数自适应调整任务向量权重，有效缓解不同专家模型间冗余冲突，实现多任务统一架构下的高效协同推理。该方法显著提升了轻量级模型在数学、结构化数据、OCR及通用能力等十项基准上的表现，验证了其在多样化多模态推理任务中的优越性。

链接: https://arxiv.org/abs/2510.08987
作者: Qixiang Yin,Huanjin Yao,Jianghao Chen,Jiaxing Huang,Zhicheng Zhao,Fei Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical report, Code will be available at this https URL

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter numerous challenges in terms of reasoning efficiency, such as large model size, overthinking, and compromised accuracy in lightweight scenarios. However, research on the reasoning capabilities of lightweight MLLMs is quite lacking. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks and using fewer tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a novel reinforcement learning method, to train each reasoning model. The LIPO is designed to dynamically adjusts advantages of responses within groups, that is, by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate response. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors and robustly optimizes the merged vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks.
zh

[AI-55] SEER: Sustainability Enhanced Engineering of Software Requirements

【速读】：该论文旨在解决软件开发过程中可持续性评估滞后的问题，即现有方法多停留在设计或实现阶段，缺乏在需求工程（Requirements Engineering）早期识别和优化可持续性要求（Sustainability Requirements, SRs）的能力。解决方案的关键在于提出SEER框架，该框架通过三个阶段实现：(i) 从通用分类体系中识别与特定软件产品相关的SRs；(ii) 基于识别出的SRs评估系统需求的可持续性水平；(iii) 对不满足任何SR的系统需求进行优化。该框架利用大语言模型（Large Language Models, LLMs）的推理能力和基于检索增强生成（Retrieval Augmented Generation, RAG）的智能体机制实现自动化处理，已在四个不同领域的软件项目中验证其有效性，证明其能准确识别跨领域广泛的可持续性关切。

链接: https://arxiv.org/abs/2510.08981
作者: Mandira Roy,Novarun Deb,Nabendu Chaki,Agostino Cortesi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Main Paper: 32 pages, References: 3 pages, Appendix: 13 pages. Submitted to the Journal of Systems and Software, Elsevier

点击查看摘要

Abstract:The rapid expansion of software development has significant environmental, technical, social, and economic impacts. Achieving the United Nations Sustainable Development Goals by 2030 compels developers to adopt sustainable practices. Existing methods mostly offer high-level guidelines, which are time-consuming to implement and rely on team adaptability. Moreover, they focus on design or implementation, while sustainability assessment should start at the requirements engineering phase. In this paper, we introduce SEER, a framework which addresses sustainability concerns in the early software development phase. The framework operates in three stages: (i) it identifies sustainability requirements (SRs) relevant to a specific software product from a general taxonomy; (ii) it evaluates how sustainable system requirements are based on the identified SRs; and (iii) it optimizes system requirements that fail to satisfy any SR. The framework is implemented using the reasoning capabilities of large language models and the agentic RAG (Retrieval Augmented Generation) approach. SEER has been experimented on four software projects from different domains. Results generated using Gemini 2.5 reasoning model demonstrate the effectiveness of the proposed approach in accurately identifying a broad range of sustainability concerns across diverse domains.
zh

[AI-56] Learning Regularizers: Learning Optimizers that can Regularize

【速读】：该论文试图解决的问题是：传统优化方法中依赖显式正则化技术（如SAM、GAM、GSAM）来提升模型泛化能力和收敛性，是否存在一种机制可以让优化器自身“学习”并内化这些正则化效果，从而无需在目标函数中显式引入正则项。解决方案的关键在于提出并验证了学习型优化器（Learned Optimizers, LOs）能够通过训练自动捕获传统正则化策略的效果——即在不显式应用正则化的情况下，LOs仍能学习到类似SAM等方法所引导的低复杂度极小值区域，并在多个基准任务（MNIST、FMNIST、CIFAR及MLP、CNN等网络结构）上展现出更强的测试准确率和泛化性能。这一发现挑战了显式优化目标正则化的必要性，揭示了LOs具备内在的正则化能力。

链接: https://arxiv.org/abs/2510.08968
作者: Suraj Kumar Sahoo,Narayanan C Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learned Optimizers (LOs), a type of Meta-learning, have gained traction due to their ability to be parameterized and trained for efficient optimization. Traditional gradient-based methods incorporate explicit regularization techniques such as Sharpness-Aware Minimization (SAM), Gradient-norm Aware Minimization (GAM), and Gap-guided Sharpness-Aware Minimization (GSAM) to enhance generalization and convergence. In this work, we explore a fundamental question: \textbfCan regularizers be learned? We empirically demonstrate that LOs can be trained to learn and internalize the effects of traditional regularization techniques without explicitly applying them to the objective function. We validate this through extensive experiments on standard benchmarks (including MNIST, FMNIST, CIFAR and Neural Networks such as MLP, MLP-Relu and CNN), comparing LOs trained with and without access to explicit regularizers. Regularized LOs consistently outperform their unregularized counterparts in terms of test accuracy and generalization. Furthermore, we show that LOs retain and transfer these regularization effects to new optimization tasks by inherently seeking minima similar to those targeted by these regularizers. Our results suggest that LOs can inherently learn regularization properties, \textitchallenging the conventional necessity of explicit optimizee loss regularization.
zh

[AI-57] Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation

【速读】：该论文旨在解决人工智能（AI）在低资源数据场景下实现鲁棒泛化能力的问题，即如何在标注数据和计算资源有限的情况下提升模型性能。其核心解决方案基于PAC（Probably Approximately Correct）框架中的无监督主动采样理论，从理论上分析了低资源学习中的泛化误差与标签复杂度关系，并据此提出四类优化策略：梯度感知优化、元迭代优化、几何感知优化以及大语言模型（LLMs）驱动的优化；同时探讨了领域迁移、强化反馈和层次结构建模等多类学习范式对低资源数据的有效利用路径。

链接: https://arxiv.org/abs/2510.08962
作者: Xiaofeng Cao,Mingwei Xu,Xin Yu,Jiangchao Yao,Wei Ye,Shengjun Huang,Minling Zhang,Ivor W. Tsang,Yew Soon Ong,James T. Kwok,Heng Tao Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACM Computing Surveys

点击查看摘要

Abstract:Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI); however, the costs associated with data annotation and model training remain significant. A fundamental objective of AI research is to achieve robust generalization with limited-resource data. This survey employs agnostic active sampling theory within the Probably Approximately Correct (PAC) framework to analyze the generalization error and label complexity associated with learning from low-resource data in both model-agnostic supervised and unsupervised settings. Based on this analysis, we investigate a suite of optimization strategies tailored for low-resource data learning, including gradient-informed optimization, meta-iteration optimization, geometry-aware optimization, and LLMs-powered optimization. Furthermore, we provide a comprehensive overview of multiple learning paradigms that can benefit from low-resource data, including domain transfer, reinforcement feedback, and hierarchical structure modeling. Finally, we conclude our analysis and investigation by summarizing the key findings and highlighting their implications for learning with low-resource data.
zh

[AI-58] DualResearch: Entropy-Gated Dual-Graph Retrieval for Answer Reconstruction

【速读】：该论文旨在解决深度研究框架（deep-research framework）在执行复杂多步骤科学推理时存在的三大问题：上下文污染（context pollution）、证据支持薄弱（weak evidentiary support）以及执行路径脆弱（brittle execution paths）。为应对这些挑战，作者提出DualResearch框架，其核心创新在于通过联合建模两个互补的图结构来匹配工具密集型推理的知识论结构：一是编码稳定背景知识的广度语义图（breadth semantic graph），二是捕捉执行溯源的深度因果图（depth causal graph）。关键在于，每个图均设计了层内相关性函数，并采用种子锚定的语义扩散机制（seed-anchored semantic diffusion）和带可靠性加权的因果-语义路径匹配机制（causal-semantic path matching with reliability weighting）。进一步地，为融合异构信息并处理查询依赖的不确定性，DualResearch将各层路径证据转换为答案分布，并在对数空间中利用熵门控规则（entropy-gated rule）进行全局校准的融合，从而增强确定性通道并放大一致性。此方法显著提升了推理稳定性与准确性，在HLE和GPQA基准上分别实现7.7%和6.06%的准确率提升。

链接: https://arxiv.org/abs/2510.08959
作者: Jinxin Shi,Zongsheng Cao,Runmin Ma,Yusong Hu,Jie Zhou,Xin Li,Lei Bai,Liang He,Bo Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 5 tables, Under Review

点击查看摘要

Abstract:The deep-research framework orchestrates external tools to perform complex, multi-step scientific reasoning that exceeds the native limits of a single large language model. However, it still suffers from context pollution, weak evidentiary support, and brittle execution paths. To address these issues, we propose DualResearch, a retrieval and fusion framework that matches the epistemic structure of tool-intensive reasoning by jointly modeling two complementary graphs: a breadth semantic graph that encodes stable background knowledge, and a depth causal graph that captures execution provenance. Each graph has a layer-native relevance function, seed-anchored semantic diffusion for breadth, and causal-semantic path matching with reliability weighting for depth. To reconcile their heterogeneity and query-dependent uncertainty, DualResearch converts per-layer path evidence into answer distributions and fuses them in log space via an entropy-gated rule with global calibration. The fusion up-weights the more certain channel and amplifies agreement. As a complement to deep-research systems, DualResearch compresses lengthy multi-tool execution logs into a concise reasoning graph, and we show that it can reconstruct answers stably and effectively. On the scientific reasoning benchmarks HLE and GPQA, DualResearch achieves competitive performance. Using log files from the open-source system InternAgent, its accuracy improves by 7.7% on HLE and 6.06% on GPQA.
zh

[AI-59] EcphoryRAG : Re-Imagining Knowledge-Graph RAG via Human Associative Memory

【速读】：该论文旨在解决复杂问答任务中传统知识图谱增强检索（Knowledge Graph RAG）方法因冗余信息导致的高计算开销与推理能力不足的问题。其解决方案的关键在于提出一种以实体为中心的检索机制——EcphoryRAG，该框架在索引阶段仅存储核心实体及其元数据，显著降低token消耗（最多减少94%）；在检索阶段通过提取查询中的提示实体（cue entities），执行可扩展的多跳关联搜索，并动态推断实体间的隐式关系以构建上下文，从而实现无需预先枚举全部关系即可进行深度推理的能力。这一设计有效提升了复杂多跳问答任务的准确率，在多个基准测试中达到新的最先进水平。

链接: https://arxiv.org/abs/2510.08958
作者: Zirui Liao
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cognitive neuroscience research indicates that humans leverage cues to activate entity-centered memory traces (engrams) for complex, multi-hop recollection. Inspired by this mechanism, we introduce EcphoryRAG, an entity-centric knowledge graph RAG framework. During indexing, EcphoryRAG extracts and stores only core entities with corresponding metadata, a lightweight approach that reduces token consumption by up to 94% compared to other structured RAG systems. For retrieval, the system first extracts cue entities from queries, then performs a scalable multi-hop associative search across the knowledge graph. Crucially, EcphoryRAG dynamically infers implicit relations between entities to populate context, enabling deep reasoning without exhaustive pre-enumeration of relationships. Extensive evaluations on the 2WikiMultiHop, HotpotQA, and MuSiQue benchmarks demonstrate that EcphoryRAG sets a new state-of-the-art, improving the average Exact Match (EM) score from 0.392 to 0.474 over strong KG-RAG methods like HippoRAG. These results validate the efficacy of the entity-cue-multi-hop retrieval paradigm for complex question answering.
zh

[AI-60] SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM -enhanced E-commerce Risk Management

【速读】：该论文旨在解决电商行业中风险管理部门在处理海量可疑交易案例时面临的两大核心问题：一是人工分析工作量巨大，需融合长期专家经验并对多维风险因素进行细致审查；二是分析师个体差异导致风险调查流程难以标准化和高效化。解决方案的关键在于提出SHERLOCK框架，其核心创新包括：（1）从多模态数据中提取风险知识并构建领域知识库（domain knowledge base, KB），为模型提供结构化背景支持；（2）基于数据飞轮（data flywheel）范式构建智能平台，实现日常运营、专家标注与模型评估的闭环迭代优化，并通过偏好对齐微调提升输出一致性；（3）引入Reflect-Refine（RR）模块，结合领域KB形成对新兴风险模式的快速响应机制。实验证明该方法显著提升了大语言模型（LLM）在事实一致性与风险定位精度上的表现，且系统部署后大幅提高了风险管理人员的案件调查效率。

链接: https://arxiv.org/abs/2510.08948
作者: Nan Lu,Yurong Hu,Jiaquan Fang,Yan Liu,Rui Dong,Yiming Wang,Rui Lin,Shaoyi Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growth of the e-commerce industry has intensified the adversarial dynamics between shadow economy actors and risk management teams. Companies often conduct risk investigations into suspicious cases to identify emerging fraud patterns, thereby enhancing both preemptive risk prevention and post-hoc governance. However, the sheer volume of case analyses imposes a substantial workload on risk management analysts, as each case requires the integration of long-term expert experience and meticulous scrutiny across multiple risk dimensions. Additionally, individual disparities among analysts hinder the establishment of uniform and high-standard workflows. To address these challenges, we propose the SHERLOCK framework, which leverages the reasoning capabilities of large language models (LLMs) to assist analysts in risk investigations. Our approach consists of three primary components: (1) extracting risk management knowledge from multi-modal data and constructing a domain knowledge base (KB), (2) building an intelligent platform guided by the data flywheel paradigm that integrates daily operations, expert annotations, and model evaluations, with iteratively fine-tuning for preference alignment, and (3) introducing a Reflect Refine (RR) module that collaborates with the domain KB to establish a rapid response mechanism for evolving risk patterns. Experiments conducted on the real-world transaction dataset from this http URL demonstrate that our method significantly improves the precision of both factual alignment and risk localization within the LLM analysis results. Deployment of the SHERLOCK-based LLM system on this http URL has substantially enhanced the efficiency of case investigation workflows for risk managers.
zh

[AI-61] FATHOMS-RAG : A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在事实准确性方面存在的局限性，特别是针对检索增强生成（Retrieval-Augmented Generation, RAG）管道的整体评估缺乏系统性基准的问题。现有评测多聚焦于检索模块的单一性能，而忽视了从数据摄入、跨模态信息整合到推理输出的全流程能力。其解决方案的关键在于构建一个综合性 benchmark，包含：(1) 由人工创建的93个问题数据集，涵盖文本、表格、图像等多模态信息及跨文档关联；(2) 基于短语级召回率的正确性度量；(3) 利用最近邻嵌入分类器识别潜在的幻觉输出；(4) 对开源与闭源RAG管道的对比实验；以及(5) 第三方人类评估验证指标一致性。结果表明，闭源管道在正确性和幻觉控制上显著优于开源方案，尤其在依赖多模态和跨文档信息的任务中差距更大。

链接: https://arxiv.org/abs/2510.08945
作者: Samuel Hildebrand(1),Curtis Taylor(2),Sean Oesch(2),James M Ghawaly Jr(1),Amir Sadovnik(2),Ryan Shivers(2),Brandon Schreiber(2),Kevin Kurian(3) ((1) Louisiana State University, (2) Oak Ridge National Lab, (3) University of Florida)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline’s ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline’s ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating “strongly agree”).
zh

[AI-62] RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation NEURIPS2025

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）评估中因训练数据污染（data contamination）导致的可靠性问题，即模型可能通过记忆训练数据而非展现真实推理能力来获得高分数。解决方案的关键在于提出RADAR框架，该框架利用机制可解释性（mechanistic interpretability）区分基于回忆（recall-based）与基于推理（reasoning-based）的响应模式；其核心创新是提取37个特征，涵盖表面置信度轨迹及深层机制属性（如注意力专业化、电路动态和激活流模式），并通过集成分类器实现高精度检测——在多样化测试集上达到93%准确率，尤其在明确案例中表现完美，在模糊挑战样本中仍保持76.7%准确率，验证了机制可解释性在超越传统表层指标方面的潜力。

链接: https://arxiv.org/abs/2510.08931
作者: Ashish Kattamuri,Harshwardhan Fartale,Arpita Vats,Rahul Raja,Ishita Prasad
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

点击查看摘要

Abstract:Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.
zh

[AI-63] Co-Authoring the Self: A Human-AI Interface for Interest Reflection in Recommenders

【速读】：该论文旨在解决推荐系统中用户对推荐结果缺乏理解与控制的问题，尤其是在传统静态用户画像难以体现真实兴趣偏差的情况下。其核心挑战在于如何提升用户对系统推理过程的透明度，并促进用户主动参与兴趣修正，从而改善推荐质量。解决方案的关键在于引入一种人机协同的动态用户画像机制——即生成式 AI (Generative AI) 驱动的可编辑个性化兴趣摘要，使用户能够直接查看、修改并反思系统基于历史行为推断出的兴趣标签，进而增强交互性与信任感。实证研究表明，这种设计显著缩小了用户感知兴趣与系统推断兴趣之间的差距，并激发了用户的持续参与和反思行为。

链接: https://arxiv.org/abs/2510.08930
作者: Ruixuan Sun,Junyuan Wang,Sanjali Roy,Joseph A. Konstan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language-based user profiles in recommender systems have been explored for their interpretability and potential to help users scrutinize and refine their interests, thereby improving recommendation quality. Building on this foundation, we introduce a human-AI collaborative profile for a movie recommender system that presents editable personalized interest summaries of a user’s movie history. Unlike static profiles, this design invites users to directly inspect, modify, and reflect on the system’s inferences. In an eight-week online field deployment with 1775 active movie recommender users, we find persistent gaps between user-perceived and system-inferred interests, show how the profile encourages engagement and reflection, and identify design directions for leveraging imperfect AI-powered user profiles to stimulate more user intervention and build more transparent and trustworthy recommender experiences.
zh

[AI-64] LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

【速读】：该论文旨在解决现有大型多模态模型（Large Multimodal Models, LMMs）评估基准在实时、对抗性环境中的表现不足问题，即传统静态评测难以准确反映模型在动态场景下的战略推理能力。其解决方案的关键在于提出LM Fight Arena——一个基于经典格斗游戏《Mortal Kombat II》的自动化竞技框架，通过让多个LMM代理在同一角色控制下进行对抗比赛，实现对模型视觉理解、战术决策与序列动作选择能力的客观、可复现且公平的动态评估。

链接: https://arxiv.org/abs/2510.08928
作者: Yushuo Zheng,Zicheng Zhang,Xiongkuo Min,Huiyu Duan,Guangtao Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.
zh

[AI-65] A Frequency-Domain Analysis of the Multi-Armed Bandit Problem: A New Perspective on the Exploration-Exploitation Trade-off

【速读】：该论文旨在解决传统多臂老虎机（Multi-Armed Bandit, MAB）算法分析中缺乏对学习过程动态特性的刻画问题，尤其是现有基于时间域和累积后悔（regret）的分析方法难以捕捉探索与利用权衡随时间演化的本质。其解决方案的关键在于提出一种全新的频域分析框架，将带奖励估计视为频谱分量，不确定性对应频率，并将Bandit算法解释为自适应滤波器，从而建立了频域带宽模型（Frequency-Domain Bandit Model）。在此基础上，证明了UCB算法中的置信边界项在频域等价于一个与访问次数平方根成反比的时间变增益，进而推导出有限时间内关于探索率衰减的动态边界，为经典算法提供了直观的物理解释，并为下一代自适应参数调整算法的设计奠定了理论基础。

链接: https://arxiv.org/abs/2510.08908
作者: Di Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 6 pages

点击查看摘要

Abstract:The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component’s frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.
zh

[AI-66] Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

【速读】：该论文旨在解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）在大语言模型（Large Language Models, LLMs）复杂推理任务中面临的探索与利用失衡问题，具体表现为中间步骤信用分配不准确和策略熵过早坍缩，从而限制模型性能。解决方案的关键在于提出基于贡献度的策略优化方法（Attribution-based Contribution to Policy Optimization, ACPO），其核心创新包括：一是通过轨迹语义分割和基于归因的表示动态调节策略熵，提升探索能力；二是设计因子化奖励机制，精确量化每个推理步骤的层次化贡献，实现精准信用分配。该框架采用难度感知的课程学习策略，在AIME、MATH和AMC等挑战性基准测试中显著优于现有最优方法。

链接: https://arxiv.org/abs/2510.08899
作者: Junxi Yin,Haisen Luo,Zhenyu Li,Yihua Liu,Dan Liu,Zequn Li,Xiaohang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
zh

[AI-67] Designing and Evaluating an AI-driven Immersive Multidisciplinary Simulation (AIMS) for Interprofessional Education

【速读】：该论文旨在解决传统跨专业教育（Interprofessional Education, IPE）中依赖案例研究和标准化患者所面临的成本高、可扩展性差以及难以模拟真实临床场景动态复杂性的局限性。其解决方案的关键在于设计并开发了AIMS（AI-Enhanced Immersive Multidisciplinary Simulations），该系统融合了大型语言模型（Gemini-2.5-Flash）、基于Unity的虚拟环境引擎及角色生成流程，实现了用户与虚拟患者之间同步、多模态的交互，从而有效支持药学、医学、护理和社会工作等多学科学生在逼真情境下提升协作临床推理与健康促进能力。

链接: https://arxiv.org/abs/2510.08891
作者: Ruijie Wang,Jie Lu,Bo Pei,Evonne Jones,Jamey Brinson,Timothy Brown
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages

点击查看摘要

Abstract:Interprofessional education has long relied on case studies and the use of standardized patients to support teamwork, communication, and related collaborative competencies among healthcare professionals. However, traditional approaches are often limited by cost, scalability, and inability to mimic the dynamic complexity of real-world clinical scenarios. To address these challenges, we designed and developed AIMS (AI-Enhanced Immersive Multidisciplinary Simulations), a virtual simulation that integrates a large language model (Gemini-2.5-Flash), a Unity-based virtual environment engine, and a character creation pipeline to support synchronized, multimodal interactions between the user and the virtual patient. AIMS was designed to enhance collaborative clinical reasoning and health promotion competencies among students from pharmacy, medicine, nursing, and social work. A formal usability testing session was conducted which participants assumed professional roles on a healthcare team and engaged in a mix of scripted and unscripted conversations. Participants explored the patient’s symptoms, social context, and care needs. Usability issues were identified (e.g., audio routing, response latency) and used to guide subsequent refinements. Findings in general suggest that AIMS supports realistic, profession-specific and contextually appropriate conversations. We discussed both technical and pedagogical innovations of AIMS and concluded with future directions.
zh

[AI-68] Vector Graph-Based Repository Understanding for Issue-Driven File Retrieval

【速读】：该论文旨在解决大型软件仓库中知识难以结构化、自动化处理效率低的问题，其核心挑战在于如何有效提取和组织代码库中的语义与架构信息以支持后续开发自动化。解决方案的关键在于构建一个将软件仓库转化为向量化的知识图谱（knowledge graph）的系统，该图谱不仅编码了语法关系（如包含、实现、调用、继承等），还通过大语言模型（LLM）生成节点摘要和向量嵌入，从而捕捉深层语义关联；同时采用混合检索管道结合语义检索与图感知扩展机制，并由LLM辅助生成受限的只读图查询及可读性强的解释，显著提升了对复杂代码结构的理解与利用效率。

链接: https://arxiv.org/abs/2510.08876
作者: Kostiantyn Bevziuk,Andrii Fatula,Svetozar Lashin Yaroslav Opanasenko,Anna Tukhtarova,Ashok Jallepalli Pradeepkumar Sharma,Hritvik Shrivastava
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a repository decomposition system that converts large software repositories into a vectorized knowledge graph which mirrors project architectural and semantic structure, capturing semantic relationships and allowing a significant level of automatization of further repository development. The graph encodes syntactic relations such as containment, implementation, references, calls, and inheritance, and augments nodes with LLM-derived summaries and vector embeddings. A hybrid retrieval pipeline combines semantic retrieval with graph-aware expansion, and an LLM-based assistant formulates constrained, read-only graph requests and produces human-oriented explanations.
zh

[AI-69] Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

【速读】：该论文旨在解决分布式矩阵乘法（Distributed Matrix Multiplication）中因算法实现受限于特定数据划分方式（partitioning）而导致的灵活性不足问题。现有方法仅支持部分划分组合，若目标划分无对应算法，则需对操作数进行重分布（redistribution），显著增加通信开销。其解决方案的关键在于提出一种通用的一侧式（one-sided）算法，能够支持所有可能的划分组合与复制因子（replication factors）。该算法通过切片（slicing，即索引算术）计算出需要相乘的局部矩阵块（tiles）集合，并可直接执行或重排序后下推至优化中间表示（IR）以最大化计算与通信重叠，从而提升效率。

链接: https://arxiv.org/abs/2510.08874
作者: Benjamin Brock,Renato Golin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C+±based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.
zh

[AI-70] GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在对齐过程中存在的“用户-模型利益不一致”问题，即传统对齐方法假设最大化模型奖励等同于提升用户福祉，但在实际应用中，LLM可能因过度解释或冗长推理而损害用户体验，这种现象类似于博弈论中的囚徒困境。解决方案的关键在于提出一种基于博弈论的对齐框架——Game-Theoretic Alignment (GTAlign)，其核心机制包括：在推理阶段，模型将用户与自身互动建模为策略性博弈，通过构建收益矩阵显式评估双方福利并选择互惠行动；在训练阶段引入“互惠福利奖励”以强化合作行为，从而引导模型趋向社会最优解。此外，GTAlign还设计了动态推理技术，可响应服务定价策略变化，实现响应自适应调整。

链接: https://arxiv.org/abs/2510.08872
作者: Siqi Zhu,David Zhang,Pedro Cisneros-Velarde,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner’s dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM’s response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at this https URL .
zh

[AI-71] Repository-Aware File Path Retrieval via Fine-Tuned LLM s

【速读】：该论文旨在解决现代代码库中开发者与生成式 AI 编码助手在回答语义类问题（如“这个功能是如何工作的？”或“bug 是在何处引入的？”）时，难以准确找到相关源文件的问题。传统代码搜索方法（基于关键词或信息检索）常忽略语义上下文和跨文件关联，而大语言模型（LLM）虽能理解自然语言却缺乏特定代码仓库的细节。其解决方案的关键在于：通过微调一个强大的 LLM（Qwen3-8B）并结合 QLoRA 和 Unsloth 优化技术，直接从自然语言查询中预测相关的文件路径；同时设计六种代码感知策略，利用抽象语法树（AST）结构和仓库内容生成真实场景下的问答对（答案为文件路径集合），从而提升模型对跨文件语义关系的理解能力与检索准确性，在多个 Python 项目上实现了高达 91% 的精确匹配率和 93% 的召回率，且在大型仓库（如 PyTorch）中仍保持 59% 的召回率，展现出良好的可扩展性。

链接: https://arxiv.org/abs/2510.08850
作者: Vasudha Yanuganti,Ishaan Puri,Swapnil Chhatre,Mantinder Singh,Ashok Jallepalli,Hritvik Shrivastava,Pradeep Kumar Sharma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern codebases make it hard for developers and AI coding assistants to find the right source files when answering questions like “How does this feature work?” or “Where was the bug introduced?” Traditional code search (keyword or IR based) often misses semantic context and cross file links, while large language models (LLMs) understand natural language but lack repository specific detail. We present a method for file path retrieval that fine tunes a strong LLM (Qwen3-8B) with QLoRA and Unsloth optimizations to predict relevant file paths directly from a natural language query. To build training data, we introduce six code aware strategies that use abstract syntax tree (AST) structure and repository content to generate realistic question-answer pairs, where answers are sets of file paths. The strategies range from single file prompts to hierarchical repository summaries, providing broad coverage. We fine tune on Python projects including Flask, Click, Jinja, FastAPI, and PyTorch, and obtain high retrieval accuracy: up to 91% exact match and 93% recall on held out queries, clearly beating single strategy training. On a large codebase like PyTorch (about 4,000 Python files), the model reaches 59% recall, showing scalability. We analyze how multi level code signals help the LLM reason over cross file context and discuss dataset design, limits (for example, context length in very large repos), and future integration of retrieval with LLM based code intelligence.
zh

[AI-72] What Is Your Agents GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）驱动的智能体（Agent）在实际应用中缺乏系统性评估框架的问题，尤其针对其在目标达成、计划制定与执行过程中的多种失效模式难以量化和定位。解决方案的关键在于提出一个基于“目标-计划-行动”（Goal-Plan-Action, GPA）循环的结构化评估范式，包含五项核心指标：目标实现度（Goal Fulfillment）、逻辑一致性（Logical Consistency）、执行效率（Execution Efficiency）、计划质量（Plan Quality）和计划遵从度（Plan Adherence），从而实现对Agent行为的全面覆盖、高一致性人工标注支持以及错误定位精度达86%的细粒度诊断能力。

链接: https://arxiv.org/abs/2510.08847
作者: Allison Sihan Jia,Daniel Huang,Nikhil Vytla,Nirvika Choudhury,John C Mitchell,Anupam Datta
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce the Agent GPA (Goal-Plan-Action) framework: an evaluation paradigm based on an agent’s operational loop of setting goals, devising plans, and executing actions. The framework includes five evaluation metrics: Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence. Logical Consistency checks that an agent’s actions are consistent with its prior actions. Execution Efficiency checks whether the agent executes in the most efficient way to achieve its goal. Plan Quality checks whether an agent’s plans are aligned with its goals; Plan Adherence checks if an agent’s actions are aligned with its plan; and Goal Fulfillment checks that agent’s final outcomes match the stated goals. Our experimental results on two benchmark datasets - the public TRAIL/GAIA dataset and an internal dataset for a production-grade data agent - show that this framework (a) provides a systematic way to cover a broad range of agent failures, including all agent errors on the TRAIL/GAIA benchmark dataset; (b) supports LLM-judges that exhibit strong agreement with human annotation, covering 80% to over 95% errors; and © localizes errors with 86% agreement to enable targeted improvement of agent performance.
zh

[AI-73] CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在访问多种工具和敏感数据时，因间接提示注入（indirect prompt injection）攻击导致的安全风险问题。现有防御机制由于无法可靠区分恶意与良性指令，常出现高误报率，难以在实际场景中部署。其解决方案的关键在于提出一种基于计算机安全基本原理的token级净化（token-level sanitization）方法：摒弃传统的样本级分类思路，转而对工具输出中的目标AI系统指令进行精准移除，从而将恶意指令作为副产品一并清除。该方法无需校准、非阻断式处理且不依赖特定上下文，仅需通用指令微调数据即可训练，并在多个基准测试（如AgentDojo、BIPIA等）中实现攻击成功率（ASR）下降7–10倍，同时保持代理在良性及恶意环境下的功能完整性。

链接: https://arxiv.org/abs/2510.08829
作者: Debeshee Das,Luca Beurer-Kellner,Marc Fischer,Maximilian Baader
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don’t have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.
zh

[AI-74] mathsfP neq mathsfNP: A Non-Relativizing Proof via Quantale Weakness and Geometric Complexity

【速读】：该论文旨在证明 $\mathsf{P} \neq \mathsf{NP}$ ，即不存在多项式时间算法解决所有 NP 完全问题。其核心挑战在于突破传统归约和自然证明的局限性，构建一个非相对化、非自然的下界论证。解决方案的关键在于引入一种组合式信息论框架，将短程序转化为多独立块上的局部性结构，并结合掩码随机 3-CNF 的对称性和稀疏性，从而获得分布式的下界，与自归约上界产生矛盾。具体而言，通过构造一个具有新鲜对称群 $S_m \ltimes (\mathbb{Z}_2)^m$ 和小种子 Valiant–Vazirani 隔离层的可有效采样分布 $D_m$ ，作者证明了“弱性切换正则形式”：任意描述长度不超过 $\delta t$ （ $t = \Theta(m)$ ）的多项式时间解码器 $P$ ，均可被一个短包装器 $W$ 转换为在 $\gamma$ -比例块上逐位局部的结构；进一步利用符号不变中立性引理和对数半径模板稀疏定理，强制几乎所有 $\Omega(t)$ 块接近随机，最终得到复杂度下界 $K_{\mathrm{poly}}((X_1,\ldots,X_t)\mid(\Phi_1,\ldots,\Phi_t)) \ge \eta t$ 。若 $\mathsf{P}=\mathsf{NP}$ ，则存在常数长度程序在多项式时间内输出唯一解，导致整体复杂度为 $O(1)$ ，矛盾成立，因此 $\mathsf{P}\ne\mathsf{NP}$ 。

链接: https://arxiv.org/abs/2510.08814
作者: Ben Goertzel
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We give a compositional, information-theoretic framework that turns short programs into locality on many independent blocks, and combine it with symmetry and sparsity of masked random Unique-SAT to obtain distributional lower bounds that contradict the self-reduction upper bound under \mathsfP=\mathsfNP . We work in the weakness quantale w_Q=K_\mathrmpoly(\cdot\mid\cdot) . For an efficiently samplable ensemble D_m made by masking random 3 -CNFs with fresh S_m\ltimes(\mathbbZ_2)^m symmetries and a small-seed Valiant–Vazirani isolation layer, we prove a Switching-by-Weakness normal form: for any polytime decoder P of description length \le \delta t (with t=\Theta(m) blocks), a short wrapper W makes (P\circ W) per-bit local on a \gamma -fraction of blocks. Two ingredients then force near-randomness on \Omega(t) blocks for every short decoder: (a) a sign-invariant neutrality lemma giving \Pr[X_i=1\mid \mathcalI]=1/2 for any sign-invariant view \mathcalI ; and (b) a template sparsification theorem at logarithmic radius showing that any fixed local rule appears with probability m^-\Omega(1) . Combined with single-block bounds for tiny \mathrmACC^0 /streaming decoders, this yields a success bound 2^-\Omega(t) and, by Compression-from-Success, K_\mathrmpoly\big((X_1,\ldots,X_t)\mid(\Phi_1,\ldots,\Phi_t)\big)\ge \eta t . If \mathsfP=\mathsfNP , a uniform constant-length program maps any on-promise instance to its unique witness in polytime (bit fixing via a \mathrmUSAT decider), so K_\mathrmpoly(X\mid\Phi)\le O(1) and the tuple complexity is O(1) , contradicting the linear bound. The proof is non-relativizing and non-natural; symmetry, sparsification, and switching yield a quantale upper-lower clash, hence \mathsfP\ne\mathsfNP .
zh

[AI-75] Adaptive Science Operations in Deep Space Missions Using Offline Belief State Planning

【速读】：该论文旨在解决深空任务中因通信延迟和环境不确定性导致无法进行实时地面操作的问题，从而支持在通信受限环境下实现自主科学探测。其解决方案的关键在于提出了一种部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）框架，用于自适应地调度航天器科学仪器，并将贝叶斯网络（Bayesian network）嵌入到POMDP的观测空间中，以高效处理高维且不确定的测量数据，提升科学数据的可解释性与计算可行性。通过离线计算仪器操作策略，可在发射前生成并验证资源感知的科学计划，从而显著降低样本识别错误率（约40%），优于任务基线操作概念（ConOps）。

链接: https://arxiv.org/abs/2510.08812
作者: Grace Ra Kim,Hailey Warner,Duncan Eddy,Evan Astle,Zachary Booth,Edward Balaban,Mykel J. Kochenderfer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 tables, 5 figures, accepted in IEEE ISPARO 2026

点击查看摘要

Abstract:Deep space missions face extreme communication delays and environmental uncertainty that prevent real-time ground operations. To support autonomous science operations in communication-constrained environments, we present a partially observable Markov decision process (POMDP) framework that adaptively sequences spacecraft science instruments. We integrate a Bayesian network into the POMDP observation space to manage the high-dimensional and uncertain measurements typical of astrobiology missions. This network compactly encodes dependencies among measurements and improves the interpretability and computational tractability of science data. Instrument operation policies are computed offline, allowing resource-aware plans to be generated and thoroughly validated prior to launch. We use the Enceladus Orbilander’s proposed Life Detection Suite (LDS) as a case study, demonstrating how Bayesian network structure and reward shaping influence system performance. We compare our method against the mission’s baseline Concept of Operations (ConOps), evaluating both misclassification rates and performance in off-nominal sample accumulation scenarios. Our approach reduces sample identification errors by nearly 40%
zh

[AI-76] Deceptive Exploration in Multi-armed Bandits

【速读】：该论文旨在解决多臂老虎机（Multi-armed Bandit）框架下存在信息不对称时的隐蔽探索问题：一个代理（agent）在观察者（observer）预期其依据公开奖励分布采用Thompson Sampling策略的前提下，试图快速识别出最优私有奖励臂（best private arm），同时避免被观察者察觉。观察者仅能观测到公共奖励和所选动作，而代理可同时获取公共与私有奖励。为量化可检测性，作者引入步进式Kullback-Leibler（KL）散度约束，限制代理实际选择概率与观察者预期概率之间的偏离。关键解决方案在于构建一个基于公共与私有均值的极大极小优化问题，其解刻画了最优错误指数（error exponent），并提出一种受Top-Two算法启发的自适应探索算法，该算法根据公共次优间隙动态调整探索强度，从而在满足KL约束条件下实现对私有最优臂的高效识别。数值实验验证了 $\Theta(\sqrt{T})$ 的公共次优臂拉动速率上限及算法的有效性。

链接: https://arxiv.org/abs/2510.08794
作者: I. Arda Vurankaya,Mustafa O. Karabag,Wesley A. Suttle,Jesse Milzman,David Fridovich-Keil,Ufuk Topcu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a \Theta(\sqrtT) rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the \Theta(\sqrtT) rate and the behavior of the proposed algorithm.
zh

[AI-77] COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context ACL

【速读】：该论文旨在解决长时程任务中大语言模型（Large Language Models, LLMs）代理因持续推理和多工具交互而导致的错误累积、幻觉及连贯性丧失问题。核心挑战在于上下文管理失效——随着历史信息延长，代理容易忽略关键证据或被无关信息干扰，从而无法有效重规划或从先前错误中反思。解决方案的关键是提出COMPASS（Context-Organized Multi-Agent Planning and Strategy System），一个轻量级分层框架，将战术执行、战略监督与上下文组织分离为三个专业化组件：主代理（Main Agent）负责推理与工具调用，元思考者（Meta-Thinker）监控进展并发起战略干预，上下文管理者（Context Manager）则维护各推理阶段的精简、相关进度摘要，从而显著提升复杂任务下的准确性和稳定性。

链接: https://arxiv.org/abs/2510.08790
作者: Guangya Wan,Mingyang Ling,Xiaoqi Ren,Rujun Han,Sheng Li,Zizhao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review for ACL

点击查看摘要

Abstract:Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck – extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks – GAIA, BrowseComp, and Humanity’s Last Exam – COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.
zh

[AI-78] MLLM as a UI Judge: Benchmarking Multimodal LLM s for Predicting Human Perception of User Interfaces

【速读】：该论文试图解决的问题是：在早期用户界面（User Interface, UI）设计阶段，由于资源限制，难以进行充分的用户研究来验证设计决策。传统方法依赖于狭窄领域（如电子商务）中的用户行为指标（如点击率或转化率），而忽视了对多样化UI界面的主观评价。解决方案的关键在于利用多模态大语言模型（Multimodal Large Language Models, MLLMs）作为早期评估工具，通过对比其对30个不同UI的评分与人类用户的主观判断，检验MLLMs是否能够模拟人类偏好，并在多个UI维度上实现一致性。实验结果表明，MLLMs在某些维度上可近似人类偏好，但在其他维度上存在偏差，凸显了其在辅助早期用户体验（User Experience, UX）研究中的潜力与局限性。

链接: https://arxiv.org/abs/2510.08783
作者: Reuben A. Luera,Ryan Rossi,Franck Dernoncourt,Samyadeep Basu,Sungchul Kim,Subhojyoti Mukherjee,Puneet Mathur,Ruiyi Zhang,Jihyung Kil,Nedim Lipka,Seunghyun Yoon,Jiuxiang Gu,Zichao Wang,Cindy Xiong Bearfield,Branislav Kveton
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.
zh

[AI-79] Guiding Exploration in Reinforcement Learning Through LLM -Augmented Observations ICAPS2025

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在稀疏奖励环境中的探索难题，此类环境中传统探索策略难以发现有效的动作序列。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）的程序性知识和推理能力，通过增强观测空间提供动作建议，使RL代理能够自主决定何时采纳或忽略这些建议，从而在保持灵活性的同时引入软约束式的指导。该方法不改变现有RL算法，且在多个BabyAI任务中验证了其有效性，尤其在高难度环境中显著提升了成功率与样本效率。

链接: https://arxiv.org/abs/2510.08779
作者: Vaibhav Jain,Gerrit Grossmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to LM4Plan Workshop @ ICAPS 2025 (withdrawn before presentation due to lack of travel funding)

点击查看摘要

Abstract:Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs’ world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.
zh

[AI-80] Graph Diffusion Transformers are In-Context Molecular Designers

【速读】：该论文旨在解决生成式 AI 在分子设计中因标注数据稀缺而导致的性能瓶颈问题，尤其是在小样本场景下难以有效适配新任务的挑战。其核心解决方案是提出演示条件扩散模型（DemoDiff），关键创新在于用少量分子-评分示例（molecule-score examples）替代传统文本描述来定义任务上下文，并通过一个去噪 Transformer 模型实现对目标性质导向的分子生成。此外，研究还开发了一种基于节点对编码（Node Pair Encoding）的新分子分词器，显著降低表示复杂度（减少5.5倍节点数），并构建了一个涵盖药物与材料领域的百万级上下文任务数据集用于大规模预训练，最终在33个设计任务中展现出优于大型语言模型和领域特定方法的性能表现。

链接: https://arxiv.org/abs/2510.08744
作者: Gang Liu,Jie Chen,Yihan Zhu,Michael Sun,Tengfei Luo,Nitesh V Chawla,Meng Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 16 figures, 17 tables. Model available at: this https URL

点击查看摘要

Abstract:In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5 \times fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000 \times larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at this https URL.
zh

[AI-81] Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

【速读】：该论文旨在解决自监督表示学习（self-supervised representation learning）中如何有效提取判别性特征的问题。其解决方案的关键在于实例判别（instance discrimination）策略，即把数据集中的每个实例视为独立类别，通过随机变换生成同一实例的两个不同视图（view），促使模型学习到对这些变换保持不变的、能够区分不同实例的表示。

链接: https://arxiv.org/abs/2510.08722
作者: Mohammad Alkhalefi,Georgios Leontidis,Mingjun Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views.
zh

[AI-82] In-Context Learning for Non-Stationary MIMO Equalization

【速读】：该论文旨在解决时变信道均衡问题，即在非平稳环境下如何利用上下文学习（In-Context Learning, ICL）实现高效、自适应的多输入多输出（MIMO）信道均衡。传统ICL方法主要针对静态信道设计，在动态环境中性能受限；本文提出通过借鉴自适应信号处理算法（如最小均方误差（Least Mean Square, LMS）和最小根均方（Least Root Mean Square, LRMS））来设计具有更强适应性的注意力机制，从而提升ICL在非平稳任务中的泛化能力和跟踪性能。其解决方案的关键在于将经典自适应算法的思想融入注意力结构设计中，例如基于LMS的注意力变体、LRMS增强鲁棒性以及多步梯度更新以改善长期追踪能力，显著提升了ICL在时变信道场景下的均衡效果。

链接: https://arxiv.org/abs/2510.08711
作者: Jiachen Jiang,Zhen Qin,Zhihui Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Channel equalization is fundamental for mitigating distortions such as frequency-selective fading and inter-symbol interference. Unlike standard supervised learning approaches that require costly retraining or fine-tuning for each new task, in-context learning (ICL) adapts to new channels at inference time with only a few examples. However, existing ICL-based equalizers are primarily developed for and evaluated on static channels within the context window. Indeed, to our knowledge, prior principled analyses and theoretical studies of ICL focus exclusively on the stationary setting, where the function remains fixed within the context. In this paper, we investigate the ability of ICL to address non-stationary problems through the lens of time-varying channel equalization. We employ a principled framework for designing efficient attention mechanisms with improved adaptivity in non-stationary tasks, leveraging algorithms from adaptive signal processing to guide better designs. For example, new attention variants can be derived from the Least Mean Square (LMS) adaptive algorithm, a Least Root Mean Square (LRMS) formulation for enhanced robustness, or multi-step gradient updates for improved long-term tracking. Experimental results demonstrate that ICL holds strong promise for non-stationary MIMO equalization, and that attention mechanisms inspired by classical adaptive algorithms can substantially enhance adaptability and performance in dynamic environments. Our findings may provide critical insights for developing next-generation wireless foundation models with stronger adaptability and robustness.
zh

[AI-83] ConPoSe: LLM -Guided Contact Point Selection for Scalable Cooperative Object Pushing

【速读】：该论文旨在解决在杂乱环境中多机器人协同搬运物体时，如何高效选择机器人与物体接触点（contact points）的问题。传统解析方法虽能求解，但其解空间随机器人数量和物体尺寸呈组合爆炸式增长，难以扩展；而纯基于大语言模型（Large Language Models, LLMs）的方案则缺乏局部优化能力，效果不佳。论文提出的关键解决方案是将LLM的常识推理能力与局部搜索（local search）相结合，构建名为ConPoSe的方法，通过LLM引导局部搜索策略，在保证精度的同时显著提升对机器人数量和物体复杂度的可扩展性。

链接: https://arxiv.org/abs/2510.08705
作者: Noah Steinkrüger,Nisarga Nilavadi,Wolfram Burgard,Tanja Katharina Kaiser
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object transportation in cluttered environments is a fundamental task in various domains, including domestic service and warehouse logistics. In cooperative object transport, multiple robots must coordinate to move objects that are too large for a single robot. One transport strategy is pushing, which only requires simple robots. However, careful selection of robot-object contact points is necessary to push the object along a preplanned path. Although this selection can be solved analytically, the solution space grows combinatorially with the number of robots and object size, limiting scalability. Inspired by how humans rely on common-sense reasoning for cooperative transport, we propose combining the reasoning capabilities of Large Language Models with local search to select suitable contact points. Our LLM-guided local search method for contact point selection, ConPoSe, successfully selects contact points for a variety of shapes, including cuboids, cylinders, and T-shapes. We demonstrate that ConPoSe scales better with the number of robots and object size than the analytical approach, and also outperforms pure LLM-based selection.
zh

[AI-84] RAG 4Tickets: AI-Powered Ticket Resolution via Retrieval-Augmented Generation on JIRA and GitHub Data

【速读】：该论文旨在解决现代软件团队在处理重复或相关问题时因知识分散于JIRA工单、开发者讨论和GitHub拉取请求（Pull Request, PR）中而导致的修复延迟问题。解决方案的关键在于构建一个基于检索增强生成（Retrieval-Augmented Generation, RAG）的框架，其核心包括：利用Sentence-Transformers对历史工单、评论及PR元数据进行语义嵌入，并通过FAISS实现高效的向量搜索以定位相似案例；随后由大语言模型（Large Language Model, LLM）结合检索到的证据生成可解释且有依据的修复建议，从而实现跨平台知识的统一整合与高效复用。

链接: https://arxiv.org/abs/2510.08667
作者: Mohammad Baqar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 Pages

点击查看摘要

Abstract:Modern software teams frequently encounter delays in resolving recurring or related issues due to fragmented knowledge scattered across JIRA tickets, developer discussions, and GitHub pull requests (PRs). To address this challenge, we propose a Retrieval-Augmented Generation (RAG) framework that integrates Sentence-Transformers for semantic embeddings with FAISS-based vector search to deliver context-aware ticket resolution recommendations. The approach embeds historical JIRA tickets, user comments, and linked PR metadata to retrieve semantically similar past cases, which are then synthesized by a Large Language Model (LLM) into grounded and explainable resolution suggestions. The framework contributes a unified pipeline linking JIRA and GitHub data, an embedding and FAISS indexing strategy for heterogeneous software artifacts, and a resolution generation module guided by retrieved evidence. Experimental evaluation using precision, recall, resolution time reduction, and developer acceptance metrics shows that the proposed system significantly improves resolution accuracy, fix quality, and knowledge reuse in modern DevOps environments.
zh

[AI-85] RA-Gen: A Controllable Code Generation Framework Using ReAct for Multi-Agent Task Execution

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的代码生成模型在安全性、准确性与可控性方面存在的不足，尤其是在处理复杂任务时缺乏动态集成外部工具、透明推理过程以及用户对安全性的控制能力。其解决方案的关键在于提出一种基于ReAct范式的多智能体（multi-agent）代码生成框架，通过四个专业化代理——规划器（Planner）、搜索器（Searcher）、代码生成器（CodeGen）和提取器（Extractor）——实现高效、精确且可解释的代码生成。其中，基于ReAct的搜索器通过交替生成推理轨迹与执行动作，实现了内部知识与外部工具（如搜索引擎）的无缝融合，显著提升了生成代码的安全性和可控性，实验表明该框架在SVEN数据集上使用CodeQL检测时达到94.8%的安全率，优于现有方法。

链接: https://arxiv.org/abs/2510.08665
作者: Aofan Liu,Haoxuan Li,Bin Wang,Ao Yang,Hui Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code generation models based on large language models (LLMs) have gained wide adoption, but challenges remain in ensuring safety, accuracy, and controllability, especially for complex tasks. Existing methods often lack dynamic integration of external tools, transparent reasoning, and user control over safety. To address these issues, we propose a controllable code generation framework utilizing the ReAct paradigm for multi-agent task execution. This framework is a multi-agent system designed to enable efficient, precise, and interpretable code generation through dynamic interactions between LLMs and external resources. The framework adopts a collaborative architecture comprising four specialized agents: a Planner for task decomposition, a Searcher that leverages the ReAct framework for reasoning and tool integration, a CodeGen agent for accurate code generation, and an Extractor for structured data retrieval. The ReAct-based Searcher alternates between generating reasoning traces and executing actions, facilitating seamless integration of internal knowledge with external tools (such as search engines) to enhance accuracy and user control. Experimental results show the framework’s effectiveness across multiple languages, achieving a 94.8% security rate on the SVEN dataset with CodeQL, outperforming existing approaches. Its transparent reasoning process fosters user trust and improves controllability.
zh

[AI-86] Faver: Boosting LLM -based RTL Generation with Function Abstracted Verifiable Middleware

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的寄存器传输级（Register-Transfer Level, RTL）代码生成中因高阶规格与RTL之间存在显著语义鸿沟、训练数据稀缺以及硬件验证复杂性导致的生成准确性不足问题。其解决方案的关键在于提出一种功能抽象可验证中间件（Function Abstracted Verifiable Middleware, Faver），通过融合LLM友好的代码结构与基于规则的模板，将电路验证细节解耦，使LLM能够专注于功能逻辑本身，从而在SFT模型及开源模型上实现最高达14%的生成准确率提升。

链接: https://arxiv.org/abs/2510.08664
作者: Jianan Mu,Mingyu Shi,Yining Wang,Tianmeng Yang,Bin Sun,Xing Hu,Jing Ye,Huawei Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based RTL generation is an interesting research direction, as it holds the potential to liberate the least automated stage in the current chip design. However, due to the substantial semantic gap between high-level specifications and RTL, coupled with limited training data, existing models struggle with generation accuracy. Drawing on human experience, design with verification helps improving accuracy. However, as the RTL testbench data are even more scarce, it is not friendly for LLMs. Although LLMs excel at higher-level languages like Python/C, they have a huge semantic gap from RTL. When implementing the same functionality, Python/C code and hardware code differ significantly in the spatiotemporal granularity, requiring the LLM not only to consider high-level functional semantics but also to ensure the low-level details align with the circuit code. It is not an easy task. In this paper, we propose a function abstracted verifiable middleware (Faver) that streamlines RTL verification in LLM-based workflows. By mixing LLM-friendly code structures with a rule-based template, Faver decouples the details of circuit verification, allowing the LLM to focus on the functionality itself. In our experiments on the SFT model and open-source models, Faver improved the model’s generation accuracy by up to 14%.
zh

[AI-87] DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

【速读】：该论文旨在解决传统基因组选择（Genomic Selection, GS）方法在预测复杂性状和处理大规模数据集时精度不足的问题。其解决方案的关键在于提出一种名为DPCformer的深度学习模型，该模型融合了卷积神经网络（Convolutional Neural Networks）与自注意力机制（Self-Attention Mechanism），以更有效地建模基因型-表型之间的复杂非线性关系。此外，该方法采用8维独热编码（one-hot encoding）表示单核苷酸多态性（SNP）数据，并结合PMF算法进行特征选择，从而在多个作物（玉米、棉花、番茄、水稻和鹰嘴豆）的13个农艺性状上显著提升预测准确性，尤其在小样本场景下表现出更强的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2510.08662
作者: Pengcheng Deng,Kening Liu,Mengxi Zhou,Mingxi Li,Rui Yang,Chuzhe Cao,Maojun Wang,Zeyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been accepted by BIBM 2025

点击查看摘要

Abstract:Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.
zh

[AI-88] CATS-Linear: Classification Auxiliary Linear Model for Time Series Forecasting

【速读】：该论文旨在解决线性模型在时间序列预测中性能提升受限的问题，尤其是如何通过结构设计与机制创新增强其表达能力。解决方案的关键在于提出分类辅助趋势-季节分解线性模型（CATS-Linear），其核心是引入分类辅助通道独立性（Classification Auxiliary Channel-Independence, CACI）机制：通过分类动态路由不同时间序列实例至专用预测器，实现监督驱动的通道设计；同时重构趋势-季节分解架构，引入解耦—线性映射—再耦合框架处理趋势分量，并采用复数域线性投影建模季节分量，从而显著提升固定超参数下模型的预测精度，达到与调参基线相当甚至更优的效果。

链接: https://arxiv.org/abs/2510.08661
作者: Zipo Jibao,Yingyi Fu,Xinyang Chen,Guoting Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research demonstrates that linear models achieve forecasting performance competitive with complex architectures, yet methodologies for enhancing linear models remain underexplored. Motivated by the hypothesis that distinct time series instances may follow heterogeneous linear mappings, we propose the Classification Auxiliary Trend-Seasonal Decoupling Linear Model CATS-Linear, employing Classification Auxiliary Channel-Independence (CACI). CACI dynamically routes instances to dedicated predictors via classification, enabling supervised channel design. We further analyze the theoretical expected risks of different channel settings. Additionally, we redesign the trend-seasonal decomposition architecture by adding a decoupling – linear mapping – recoupling framework for trend components and complex-domain linear projections for seasonal components. Extensive experiments validate that CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines while delivering SOTA accuracy against fixed-hyperparameter counterparts.
zh

[AI-89] Provably Robust Adaptation for Language-Empowered Foundation Models

【速读】：该论文旨在解决语言赋能的基础模型（Language-empowered foundation models, LeFMs）在少样本学习场景下对中毒攻击（poisoning attacks）缺乏形式化鲁棒性保障的问题。现有防御方法依赖经验策略，无法提供可证明的鲁棒性，且对未知或自适应攻击仍脆弱。为填补这一空白，作者提出首个针对LeFMs的可证明鲁棒少样本分类器——LeFCert，其核心创新在于：引入一种双折衷均值原型（twofold trimmed mean prototype）以构建鲁棒的类别中心表示，并推导出分类得分的上下界，从而在最坏情况中毒场景下实现认证鲁棒性；同时通过自适应融合文本与特征嵌入机制提升性能，进一步扩展出LeFCert-L（结合随机平滑以满足双重预算约束）和LeFCert-C（支持集体认证的共享中毒预算场景），实验证明该方案在干净准确率和认证准确率上均达到当前最优水平，且计算效率高，具备实际部署潜力。

链接: https://arxiv.org/abs/2510.08659
作者: Yuni Lai,Xiaoyu Xue,Linghui Shen,Yulun Wu,Gaolei Li,Song Guo,Kai Zhou,Bin Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbfLeFCert). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.
zh

[AI-90] Inner-Instance Normalization for Time Series Forecasting

【速读】：该论文旨在解决时间序列预测中因非平稳性（non-stationarity）导致的分布漂移（distribution shift）问题，尤其是现有实例归一化方法未能充分处理单个实例内部的分布变化，从而限制了模型性能。解决方案的关键在于提出两种新颖的点级（point-level）方法：Learning Distribution (LD) 和 Learning Conditional Distribution (LCD)。LD 通过在不同时间步使用不同参数拟合输入与输出的内部分布来消除内部差异，而 LCD 则利用神经网络预测输出的缩放系数以适应分布变化。这两种方法从点级层面建模分布动态，有效提升了时间序列预测的鲁棒性和准确性。

链接: https://arxiv.org/abs/2510.08657
作者: Zipo Jibao,Yingyi Fu,Xinyang Chen,Guoting Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world time series are influenced by numerous factors and exhibit complex non-stationary characteristics. Non-stationarity can lead to distribution shifts, where the statistical properties of time series change over time, negatively impacting model performance. Several instance normalization techniques have been proposed to address distribution shifts in time series forecasting. However, existing methods fail to account for shifts within individual instances, leading to suboptimal performance. To tackle inner-instance distribution shifts, we propose two novel point-level methods: Learning Distribution (LD) and Learning Conditional Distribution (LCD). LD eliminates internal discrepancies by fitting the internal distribution of input and output with different parameters at different time steps, while LCD utilizes neural networks to predict scaling coefficients of the output. We evaluate the performance of the two methods with various backbone models across public benchmarks and demonstrate the effectiveness of the point-level paradigm through comparative experiments.
zh

[AI-91] Knowledge Graph Sparsification for GNN-based Rare Disease Diagnosis

【速读】：该论文旨在解决罕见遗传病诊断中面临的三大核心挑战：患者数据稀缺、全基因组测序难以获取，以及可能致病基因数量庞大导致的诊断延迟和误诊问题，尤其在资源有限地区更为突出。其解决方案的关键在于提出RareNet——一种基于子图的图神经网络（Graph Neural Network, GNN），仅需患者的表型数据即可识别最可能的致病基因，并检索出用于靶向临床研究的聚焦患者子图。该方法可独立使用，也可作为其他候选基因优先排序方法的预处理或后处理过滤器，显著提升整体性能并提供可解释性洞察，从而在无需复杂基因组基础设施的临床环境中实现精准遗传分析的普及化。

链接: https://arxiv.org/abs/2510.08655
作者: Premt Cara,Kamilia Zaripova,David Bani-Harouni,Nassir Navab,Azade Farshad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Rare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.
zh

[AI-92] Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对微小但实际有意义的输入扰动时，其输出发生不可预测变化的问题，例如检索增强生成（Retrieval-Augmented Generation, RAG）中段落顺序调整导致输出翻转、微调破坏预训练阶段学习到的不变性、思维链（Chain-of-Thought）或辩论提示路径依赖性强、编译器融合或重排序扰动决策边界附近的logits等。这些问题违背了预期的不变性，破坏持续集成流程，并迫使团队在安全性与速度之间权衡。解决方案的关键在于提出WILSON——一种轻量级后验诊断套件，通过在内部表示上执行简单的循环和重排序检查，转化为系统级信号；其核心机制包括：利用JVP（Jacobian-Vector Product）和Hutchinson探针计算无逆矩阵的曲率图谱（inverse-free curvature map），以及基于激活级别的交换子（commutators）识别重排序风险。这些信号计算成本低、对标准Transformer架构无依赖性，可导出为阈值和CSV文件供自动化调度器使用，从而实现对RAG顺序敏感性的防护、微调回归检测、辩论路径稳定、长多轮对话优化及部署阶段融合操作的准入控制，最终在不修改模型结构或训练过程的前提下提升可靠性和吞吐量。

链接: https://arxiv.org/abs/2510.08648
作者: Edward Y. Chang,Ethan Y. Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Large language models can change answers under harmless edits that matter in practice: RAG outputs flip when passages are reordered, fine-tuning erodes invariances learned at pretraining, debate or chain-of-thought prompts take path-dependent routes, and compiler fusion or reordering perturbs logits near decision boundaries. These failures violate intended invariances, break continuous integration, and force teams to trade safety for speed. The effects are small yet distributed across layers and positions, sensitive to context length and evaluation order, and costly to repair with retraining or formal verification. We present WILSON, a minimal post-hoc diagnostic suite that converts simple loop and reordering checks on internal representations into system signals. WILSON combines an inverse-free curvature map over positions and layers, computed with JVPs and Hutchinson probes, with activation-level commutators that flag reorder risk. Signals are cheap to compute, model-agnostic for standard Transformers, and exported as thresholds and CSV artifacts for orchestrators. This enables concrete actions: guard RAG against order effects, catch fine-tuning regressions, stabilize debate pathways and long multi-turn contexts, and gate fusions or reorders in deployment. In short, WILSON helps anticipate failures and approve safe optimizations so reliability and throughput can improve together without changing model architecture or training.
zh

[AI-93] Automating Android Build Repair: Bridging the Reasoning -Execution Gap in LLM Agents with Domain-Specific Tools

【速读】：该论文旨在解决Android应用自动构建失败的修复问题，尤其是在当前大型语言模型（Large Language Models, LLMs）虽具备代码修复潜力但尚未有效应用于Android构建错误修复的背景下。其核心挑战在于LLMs难以将高层知识转化为在Gradle构建环境中有效的低层操作。解决方案的关键在于提出GradleFixer——一个集成领域特定工具的LLM代理，通过“工具桥接”（Tool Bridging）策略，用面向Gradle构建系统的API式抽象替代通用shell命令，从而显著提升修复成功率（达到81.4% pass@1），并验证了该策略通过提供更可靠的工具接口和约束动作空间来弥合模型高层推理与低层执行之间鸿沟的有效性。

链接: https://arxiv.org/abs/2510.08640
作者: Ha Min Son,Huan Ren,Xin Liu,Zhe Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer’s success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model’s high-level reasoning and effective low-level execution.
zh

[AI-94] Hypothesis Hunting with Evolving Networks of Autonomous Scientific Agents

【速读】：该论文旨在解决大规模科学数据集（如健康生物银行、细胞图谱和地球再分析数据）中难以实现系统性探索性发现的问题，即如何在无预设研究假设的情况下持续挖掘潜在的科学洞见。其核心挑战在于传统方法受限于特定假设，难以充分利用数据的复杂性和多样性。解决方案的关键在于提出AScience框架，将科学发现建模为代理（agent）、网络（network）和评估规范（evaluation norm）之间的交互过程，并通过ASCollab系统实现：由具备异质行为的大型语言模型（LLM）驱动的研究代理自组织成动态演化网络，在统一的评估标准下持续生成并同行评审研究成果。这种社会结构化的代理网络能够有效积累专家评级的结果，推动多样-质量-新颖性前沿的扩展，从而支持可扩展的大规模探索性假设狩猎（hypothesis hunting）。

链接: https://arxiv.org/abs/2510.08619
作者: Tennison Liu,Silas Ruhrberg Estévez,David L. Bentley,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale scientific datasets – spanning health biobanks, cell atlases, Earth reanalyses, and more – create opportunities for exploratory discovery unconstrained by specific research questions. We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces. To support it, we introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab, a distributed system of LLM-based research agents with heterogeneous behaviors. These agents self-organize into evolving networks, continually producing and peer-reviewing findings under shared standards of evaluation. Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier, including rediscoveries of established biomarkers, extensions of known pathways, and proposals of new therapeutic targets. While wet-lab validation remains indispensable, our experiments on cancer cohorts demonstrate that socially structured, agentic networks can sustain exploratory hypothesis hunting at scale.
zh

[AI-95] Impact of LLM s on Team Collaboration in Software Development

【速读】：该论文试图解决的问题是：在软件开发生命周期（Software Development Life Cycle, SDLC）中，团队协作面临效率低下、沟通不畅和跨职能协同困难等挑战，而大型语言模型（Large Language Models, LLMs）的引入可能带来变革性影响，但其具体作用机制与潜在风险尚不明确。解决方案的关键在于通过系统性研究——包括文献综述、行业案例分析、团队问卷调查及两个实证案例——评估LLM辅助工具（如代码生成助手和AI驱动的项目管理代理）对协作式软件工程实践的影响，识别其在提升效率、改善沟通清晰度和促进跨职能协作方面的潜力，同时揭示模型局限性和隐私安全等新挑战，从而为后续研究提供方向，包括领域特定模型定制、开发工具集成优化以及信任与安全保障策略的构建。

链接: https://arxiv.org/abs/2510.08612
作者: Devang Dhanuka
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being integrated into software development processes, with the potential to transform team workflows and productivity. This paper investigates how LLMs affect team collaboration throughout the Software Development Life Cycle (SDLC). We reframe and update a prior study with recent developments as of 2025, incorporating new literature and case studies. We outline the problem of collaboration hurdles in SDLC and explore how LLMs can enhance productivity, communication, and decision-making in a team context. Through literature review, industry examples, a team survey, and two case studies, we assess the impact of LLM-assisted tools (such as code generation assistants and AI-powered project management agents) on collaborative software engineering practices. Our findings indicate that LLMs can significantly improve efficiency (by automating repetitive tasks and documentation), enhance communication clarity, and aid cross-functional collaboration, while also introducing new challenges like model limitations and privacy concerns. We discuss these benefits and challenges, present research questions guiding the investigation, evaluate threats to validity, and suggest future research directions including domain-specific model customization, improved integration into development tools, and robust strategies for ensuring trust and security.
zh

[AI-96] Relative Positioning Based Code Chunking Method For Rich Context Retrieval In Repository Level Code Completion Task With Code Language Model

【速读】：该论文旨在解决当前代码补全任务中缺乏有效上下文构建策略的问题，即如何利用IDE所获取的信息来提升大语言模型（Large Language Models, LLMs）在代码补全中的性能。其解决方案的关键在于提出了一种有效的上下文收集策略：首先将代码仓库预处理为较小的代码块（code chunks），然后基于语法和语义相似性进行代码块检索，并结合相对位置信息构建最终上下文，从而显著提升LLMs在代码补全任务中的表现。

链接: https://arxiv.org/abs/2510.08610
作者: Imranur Rahman,Md Rayhanur Rahman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Context Collection Workshop co-located with ASE 2025

点击查看摘要

Abstract:Code completion can help developers improve efficiency and ease the development lifecycle. Although code completion is available in modern integrated development environments (IDEs), research lacks in determining what makes a good context for code completion based on the information available to the IDEs for the large language models (LLMs) to perform better. In this paper, we describe an effective context collection strategy to assist the LLMs in performing better at code completion tasks. The key idea of our strategy is to preprocess the repository into smaller code chunks and later use syntactic and semantic similarity-based code chunk retrieval with relative positioning. We found that code chunking and relative positioning of the chunks in the final context improve the performance of code completion tasks.
zh

[AI-97] he Enduring Dominance of Deep Neural Networks: A Critical Analysis of the Fundamental Limitations of Quantum Machine Learning and Spiking Neural Networks

【速读】：该论文试图解决的问题是：生成式 AI (Generative AI) 和神经形态计算（如脉冲神经网络 SNNs）是否能够在近期取代深度神经网络（DNNs）成为主流人工智能范式。论文指出，尽管量子机器学习（QML）和SNNs在理论上有指数级加速和类脑能效优势，但其实际应用受限于量子硬件噪声、训练不稳定（如“ barren plateaus”）、表示能力不足、能量效率被高估以及训练复杂度高等问题，且与机器学习的泛化目标存在根本冲突。相比之下，DNNs凭借高效的反向传播机制、成熟的正则化策略、基于推理时计算扩展的大型语言模型（LLMs）创新（如自监督预训练与强化学习结合），以及专用ASIC加速，展现出更强的实用性与性能提升潜力，例如Grok-4 Heavy和gpt-oss-120b等模型已证明其在参数规模合理的情况下仍可达到行业领先水平。因此，论文的核心结论是：DNNs仍是当前最可行、高效的AI发展路径，而QML与SNNs可能仅适用于特定混合场景。

链接: https://arxiv.org/abs/2510.08591
作者: Takehiro Ishikawa
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in QML and SNNs have generated considerable excitement, promising exponential speedups and brain-like energy efficiency to revolutionize AI. However, this paper argues that they are unlikely to displace DNNs in the near term. QML struggles with adapting backpropagation due to unitary constraints, measurement-induced state collapse, barren plateaus, and high measurement overheads, exacerbated by the limitations of current noisy intermediate-scale quantum hardware, overfitting risks due to underdeveloped regularization techniques, and a fundamental misalignment with machine learning’s generalization. SNNs face restricted representational bandwidth, struggling with long-range dependencies and semantic encoding in language tasks due to their discrete, spike-based processing. Furthermore, the goal of faithfully emulating the brain might impose inherent inefficiencies like cognitive biases, limited working memory, and slow learning speeds. Even their touted energy-efficient advantages are overstated; optimized DNNs with quantization can outperform SNNs in energy costs under realistic conditions. Finally, SNN training incurs high computational overhead from temporal unfolding. In contrast, DNNs leverage efficient backpropagation, robust regularization, and innovations in LRMs that shift scaling to inference-time compute, enabling self-improvement via RL and search algorithms like MCTS while mitigating data scarcity. This superiority is evidenced by recent models such as xAI’s Grok-4 Heavy, which advances SOTA performance, and gpt-oss-120b, which surpasses or approaches the performance of leading industry models despite its modest 120-billion-parameter size deployable on a single 80GB GPU. Furthermore, specialized ASICs amplify these efficiency gains. Ultimately, QML and SNNs may serve niche hybrid roles, but DNNs remain the dominant, practical paradigm for AI advancement.
zh

[AI-98] EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

【速读】：该论文旨在解决实时音频驱动人脸动画生成中速度与视觉保真度难以兼顾的问题。现有方法在保持高渲染质量的同时，往往计算复杂度较高，难以满足实时性要求。解决方案的关键在于提出EGSTalker框架，其核心创新包括：（1）采用多分辨率哈希三平面（multi-resolution hash triplane）与Kolmogorov-Arnold Network（KAN）进行静态3D高斯初始化，构建紧凑的三维表示；（2）设计高效空间-音频注意力模块（Efficient Spatial-Audio Attention, ESAA）融合音频与空间特征，并利用KAN预测对应的高斯变形参数，从而实现快速且高质量的面部动画合成。该方案仅需3–5分钟训练视频即可生成逼真唇同步效果，在推理速度上显著优于当前最优方法。

链接: https://arxiv.org/abs/2510.08587
作者: Tianheng Zhu,Yinfeng Yu,Liejun Wang,Fuchun Sun,Wendong Zheng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (6 pages). Accepted for publication by IEEE International Conference on Systems, Man, and Cybernetics 2025

点击查看摘要

Abstract:This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker’s potential for real-time multimedia applications.
zh

[AI-99] Evaluating Hallucinations in Multimodal LLM s with Spoken Queries under Diverse Acoustic Conditions

【速读】：该论文旨在解决语音输入对多模态大语言模型中幻觉（hallucination）行为的影响问题，即在语音驱动接口场景下，与文本输入相比，语音查询如何加剧或改变模型的错误生成现象。其关键解决方案在于构建了一个音频增强的基准测试框架 RePOPE-Spk，通过在不同声学条件下引入语音查询，系统性地评估了主流开源与闭源模型的鲁棒性表现，揭示了语音输入显著提升幻觉发生率（清洁语音下增加3%，噪声环境下最高达20%），并验证了多示例提示（many-shot prompting）和思维链推理（chain-of-thought reasoning）等策略虽具一定缓解作用但效果有限，从而为未来构建可靠语音交互系统提供了实证基础与研究方向。

链接: https://arxiv.org/abs/2510.08581
作者: Hansol Park,Hoseong Ahn,Junwon Moon,Yejin Lee,Kyuhong Shim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Hallucinations in vision-language models have been extensively studied using benchmarks that probe reliability in image-text settings. In contrast, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice-driven interfaces. In this work, we investigate how spoken input influences hallucinations in multimodal large language models. We present RePOPE-Spk, an audio-augmented extension of the RePOPE benchmark, where queries are provided as speech under diverse acoustic conditions. Using RePOPE-Spk, we systematically evaluate both proprietary and open-source models. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3% under clean speech and by up to 20% with environmental noise. Input order and query length further affect robustness, while strategies such as many-shot prompting and chain-of-thought reasoning offer partial but insufficient mitigation. These findings highlight a critical and underexplored challenge, opening new directions for building reliable voice interface systems.
zh

[AI-100] LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

【速读】：该论文旨在解决音乐学习中错误检测的准确性问题，即如何更精确地识别演奏者在练习过程中遗漏或多余音符等错误。现有方法通常通过启发式规则或可学习模型将音频与乐谱进行比较，但存在两个关键局限：一是晚期融合（late fusion）限制了多流特征间的对齐能力与跨模态对比性能；二是依赖乐谱-音频映射会导致频谱上的模糊性，尤其在多音同时出现时性能下降。解决方案的关键在于提出一种基于Transformer的两流编码器结构（\textitLadderSym），其包含两个核心创新：(1) 引入跨流对齐模块以增强音频与符号乐谱之间的细粒度匹配能力；(2) 提出一种多模态策略，将符号乐谱作为解码器提示（decoder prompt），从而减少频谱歧义并提升检测精度。实验表明，该方法在MAESTRO-E和CocoChorales-E数据集上显著优于当前最优模型，特别是在遗漏音符检测F1分数上提升超过一倍。

链接: https://arxiv.org/abs/2510.08580
作者: Benjamin Shiue-Hal Chou,Purvish Jajal,Nick John Eliopoulos,James C. Davis,George K. Thiruvathukal,Kristen Yeon-Ji Yun,Yung-Hsiang Lu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Under Submission

点击查看摘要

Abstract:Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textitLadderSym, a novel Transformer-based method for music error detection. \textitLadderSym is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, \textitLadderSym introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the \textitMAESTRO-E and \textitCocoChorales-E datasets by measuring the F1 score for each note category. Compared to the previous state of the art, \textitLadderSym more than doubles F1 for missed notes on \textitMAESTRO-E (26.8% \rightarrow 56.3%) and improves extra note detection by 14.4 points (72.0% \rightarrow 86.4%). Similar gains are observed on \textitCocoChorales-E. This work introduces general insights about comparison models that could inform sequence evaluation tasks for reinforcement Learning, human skill assessment, and model evaluation.
zh

[AI-101] Agent icAD: A Specialized Multiagent System Framework for Holistic Alzheimer Disease Management

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）管理中现有人工智能（AI）应用碎片化、缺乏系统集成的问题，即当前工具多局限于单一功能（如诊断或照护者支持），难以实现从患者照护到临床研究的全流程协同。解决方案的关键在于提出一个由八个专业化、可互操作的AI代理组成的多智能体系统（multi-agent system, MAS）框架，通过分层功能模块（照护者与患者支持、数据分析与研究、高级多模态工作流）实现端到端整合，并借助大语言模型（LLMs）、检索增强生成（Retrieval-Augmented Generation, RAG）、多模态数据处理及内存数据库查询等先进技术，构建一个动态协作、可扩展且个性化的智能生态系统，从而提升患者预后并降低照护负担。

链接: https://arxiv.org/abs/2510.08578
作者: Adib Bazgir,Amir Habibdoust,Xing Song,Yuwen Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) presents a complex, multifaceted challenge to patients, caregivers, and the healthcare system, necessitating integrated and dynamic support solutions. While artificial intelligence (AI) offers promising avenues for intervention, current applications are often siloed, addressing singular aspects of the disease such as diagnostics or caregiver support without systemic integration. This paper proposes a novel methodological framework for a comprehensive, multi-agent system (MAS) designed for holistic Alzheimer’s disease management. The objective is to detail the architecture of a collaborative ecosystem of specialized AI agents, each engineered to address a distinct challenge in the AD care continuum, from caregiver support and multimodal data analysis to automated research and clinical data interpretation. The proposed framework is composed of eight specialized, interoperable agents. These agents are categorized by function: (1) Caregiver and Patient Support, (2) Data Analysis and Research, and (3) Advanced Multimodal Workflows. The methodology details the technical architecture of each agent, leveraging a suite of advanced technologies including large language models (LLMs) such as GPT-4o and Gemini, multi-agent orchestration frameworks, Retrieval-Augmented Generation (RAG) for evidence-grounded responses, and specialized tools for web scraping, multimodal data processing, and in-memory database querying. This paper presents a detailed architectural blueprint for an integrated AI ecosystem for AD care. By moving beyond single-purpose tools to a collaborative, multi-agent paradigm, this framework establishes a foundation for developing more adaptive, personalized, and proactive solutions. This methodological approach aims to pave the way for future systems capable of synthesizing diverse data streams to improve patient outcomes and reduce caregiver burden.
zh

[AI-102] deep-REMAP: Probabilistic Parameterization of Stellar Spectra Using Regularized Multi-Task Learning

【速读】：该论文旨在解决大规模光谱数据中传统光谱分析方法在处理效率和精度上的瓶颈问题，特别是在面对海量天文巡天数据时难以实现高精度、自动化恒星大气参数（有效温度 $T_{\rm eff}$ 、表面重力 $\log g$ 、金属丰度 [Fe/H]）估计的挑战。其解决方案的关键在于提出了一种基于正则化多任务深度学习框架的 deep-REMAP 模型，该模型利用 PHOENIX 合成光谱库预训练，并通过迁移学习在 MARVELS 观测的少量 FGK 型矮星样本上微调；同时引入非对称损失函数与嵌入损失（embedding loss），使回归任务转化为分类任务，从而提升模型的可解释性、鲁棒性及对非高斯不确定性的建模能力，实现了对未表征恒星候选体的高效精准参数估计。

链接: https://arxiv.org/abs/2510.09362
作者: Sankalp Gilda
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注: 14 pages. Accepted for publication in RASTI

点击查看摘要

Abstract:In the era of exploding survey volumes, traditional methods of spectroscopic analysis are being pushed to their limits. In response, we develop deep-REMAP, a novel deep learning framework that utilizes a regularized, multi-task approach to predict stellar atmospheric parameters from observed spectra. We train a deep convolutional neural network on the PHOENIX synthetic spectral library and use transfer learning to fine-tune the model on a small subset of observed FGK dwarf spectra from the MARVELS survey. We then apply the model to 732 uncharacterized FGK giant candidates from the same survey. When validated on 30 MARVELS calibration stars, deep-REMAP accurately recovers the effective temperature ( T_\rmeff ), surface gravity ( \log \rmg ), and metallicity ([Fe/H]), achieving a precision of, for instance, approximately 75 K in T_\rmeff . By combining an asymmetric loss function with an embedding loss, our regression-as-classification framework is interpretable, robust to parameter imbalances, and capable of capturing non-Gaussian uncertainties. While developed for MARVELS, the deep-REMAP framework is extensible to other surveys and synthetic libraries, demonstrating a powerful and automated pathway for stellar characterization.
zh

[AI-103] PyNoetic: A modular python framework for no-code development of EEG brain-computer interfaces

【速读】：该论文旨在解决当前基于脑电图（Electroencephalography, EEG）的脑机接口（Brain-Computer Interface, BCI）框架在实验研究中面临的四大问题：缺乏分阶段灵活性、非编程背景研究人员的学习曲线陡峭、依赖专有软件导致成本高昂，以及功能分散需使用多个外部工具影响研究一致性。解决方案的关键在于提出一个名为PyNoetic的模块化BCI框架，其核心创新包括：一是提供端到端的图形用户界面（GUI）与独特的“拖放式”流程图配置功能，实现无代码BCI设计，降低使用门槛；二是支持高级用户通过少量代码无缝集成自定义功能和新算法，保障各设计阶段的可扩展性；三是整合从刺激呈现、数据采集到特征提取、伪迹去除、可视化及仿真评估的完整BCI开发流程，并内置机器学习模型、脑连接指数等分析工具，从而显著提升离线与实时BCI开发的效率与灵活性。

链接: https://arxiv.org/abs/2509.00670
作者: Gursimran Singh,Aviral Chharia,Rahul Upadhyay,Vinay Kumar,Luca Longo
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: PLoS One 2025. Project Website: this https URL

点击查看摘要

Abstract:Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) have emerged as a transformative technology with applications spanning robotics, virtual reality, medicine, and rehabilitation. However, existing BCI frameworks face several limitations, including a lack of stage-wise flexibility essential for experimental research, steep learning curves for researchers without programming expertise, elevated costs due to reliance on proprietary software, and a lack of all-inclusive features leading to the use of multiple external tools affecting research outcomes. To address these challenges, we present PyNoetic, a modular BCI framework designed to cater to the diverse needs of BCI research. PyNoetic is one of the very few frameworks in Python that encompasses the entire BCI design pipeline, from stimulus presentation and data acquisition to channel selection, filtering, feature extraction, artifact removal, and finally simulation and visualization. Notably, PyNoetic introduces an intuitive and end-to-end GUI coupled with a unique pick-and-place configurable flowchart for no-code BCI design, making it accessible to researchers with minimal programming experience. For advanced users, it facilitates the seamless integration of custom functionalities and novel algorithms with minimal coding, ensuring adaptability at each design stage. PyNoetic also includes a rich array of analytical tools such as machine learning models, brain-connectivity indices, systematic testing functionalities via simulation, and evaluation methods of novel paradigms. PyNoetic’s strengths lie in its versatility for both offline and real-time BCI development, which streamlines the design process, allowing researchers to focus on more intricate aspects of BCI development and thus accelerate their research endeavors. Project Website: this https URL
zh

机器学习

[LG-0] MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts

链接: https://arxiv.org/abs/2510.09594
作者: Nathan Quiblier,Roy Friedman,Matthew Ricci
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: 30 pages, 5 figures

点击查看摘要

Abstract:Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.

[LG-1] Automated Evolutionary Optimization for Resource-Efficient Neural Network Training

链接: https://arxiv.org/abs/2510.09566
作者: Ilia Revin,Leon Strelkov,Vadim A. Potemkin,Ivan Kireev,Andrey Savchenko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There are many critical challenges in optimizing neural network models, including distributed computing, compression techniques, and efficient training, regardless of their application to specific tasks. Solving such problems is crucial because the need for scalable and resource-efficient models is increasing. To address these challenges, we have developed a new automated machine learning (AutoML) framework, Parameter Efficient Training with Robust Automation (PETRA). It applies evolutionary optimization to model architecture and training strategy. PETRA includes pruning, quantization, and loss regularization. Experimental studies on real-world data with financial event sequences, as well as image and time-series – benchmarks, demonstrate PETRA’s ability to improve neural model performance and scalability – namely, a significant decrease in model size (up to 75%) and latency (up to 33%), and an increase in throughput (by 13%) without noticeable degradation in the target metric.

[LG-2] Geo-Aware Models for Stream Temperature Prediction across Different Spatial Regions and Scales

链接: https://arxiv.org/abs/2510.09500
作者: Shiyuan Luo,Runlong Yu,Shengyu Chen,Yingda Fan,Yiqun Xie,Yanhua Li,Xiaowei Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS’s efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.

[LG-3] Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

链接: https://arxiv.org/abs/2510.09487
作者: Shangzhe Li,Dongruo Zhou,Weitong Zhang
类目: Machine Learning (cs.LG)
*备注: 48 pages, 3 figures, 4 tables

点击查看摘要

Abstract:We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon H , precision \epsilon and the policy variance \sigma^2 . Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

[LG-4] Locally Optimal Private Sampling: Beyond the Global Minimax NEURIPS2025

链接: https://arxiv.org/abs/2510.09485
作者: Hrad Ghoukasian,Bonwoo Lee,Shahab Asoodeh
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Information Theory (cs.IT)
*备注: 44 pages, 11 figures. Accepted at NeurIPS 2025

点击查看摘要

Abstract:We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution P \in \mathcalP , the goal is to generate a single sample from a distribution that remains close to P in f -divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS’24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax risk in a neighborhood around a fixed distribution P_0 , and characterize its exact value, which depends on both P_0 and the privacy level. Our main result shows that the local minimax risk is determined by the global minimax risk when the distribution class \mathcalP is restricted to a neighborhood around P_0 . To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near P_0 . Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of f -divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by P_0 . In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.

[LG-5] CRPS-LAM: Regional ensemble weather forecasting from matching marginals

链接: https://arxiv.org/abs/2510.09484
作者: Erik Larsson,Joel Oskarsson,Tomas Landelius,Fredrik Lindsten
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting

[LG-6] Geodesic Calculus on Latent Spaces

链接: https://arxiv.org/abs/2510.09468
作者: Florine Hartwig,Josua Sassen,Juliane Braunsmann,Martin Rumpf,Benedikt Wirth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

[LG-7] Interpretable Machine Learning for Predicting Startup Funding Patenting and Exits

链接: https://arxiv.org/abs/2510.09465
作者: Saeid Mashhadi,Amirhossein Saghezchi,Vesal Ghassemzadeh Kashani
类目: Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.

[LG-8] On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification

链接: https://arxiv.org/abs/2510.09452
作者: Faried Abu Zaid,Tim Katzke,Emmanuel Müller,Daniel Neider
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.

[LG-9] Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

链接: https://arxiv.org/abs/2510.09435
作者: Hyunin Lee,Yong Zhang,Hoang Vu Nguyen,Xiaoyi Liu,Namyong Park,Christopher Jung,Rong Jin,Yang Wang,Zhigang Wang,Somayeh Sojoudi,Xue Feng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 19 pages

点击查看摘要

Abstract:Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

[LG-10] Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

链接: https://arxiv.org/abs/2510.09423
作者: Yankun Han
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

[LG-11] What Do Temporal Graph Learning Models Learn?

链接: https://arxiv.org/abs/2510.09416
作者: Abigail J. Hayes,Tobias Schumacher,Markus Strohmaier
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

[LG-12] Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training

链接: https://arxiv.org/abs/2510.09405
作者: Yuhao Pan,Xiucheng Wang,Nan Cheng,Wenchao Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio frequency fingerprint identification (RFFI) is a critical technique for wireless network security, leveraging intrinsic hardware-level imperfections introduced during device manufacturing to enable precise transmitter identification. While deep neural networks have shown remarkable capability in extracting discriminative features, their real-world deployment is hindered by receiver-induced variability. In practice, RF fingerprint signals comprise transmitter-specific features as well as channel distortions and receiver-induced biases. Although channel equalization can mitigate channel noise, receiver-induced feature shifts remain largely unaddressed, causing the RFFI models to overfit to receiver-specific patterns. This limitation is particularly problematic when training and evaluation share the same receiver, as replacing the receiver in deployment can cause substantial performance degradation. To tackle this challenge, we propose an RFFI framework robust to cross-receiver variability, integrating adversarial training and style transfer to explicitly disentangle transmitter and receiver features. By enforcing domain-invariant representation learning, our method isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes. Extensive experiments on multi-receiver datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving up to a 10% improvement in average accuracy across diverse receiver settings.

[LG-13] CHUCKLE – When Humans Teach AI To Learn Emotions The Easy Way

链接: https://arxiv.org/abs/2510.09382
作者: Ankush Pratap Singh,Houwei Cao,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.

[LG-14] Deep Learning to Identify the Spatio-Temporal Cascading Effects of Train Delays in a High-Density Network

链接: https://arxiv.org/abs/2510.09350
作者: Vu Duc Anh Nguyen,Ziyue Li
类目: Machine Learning (cs.LG)
*备注: Accepted at SIGSPATIAL 2025 - GeoAI Workshop

点击查看摘要

Abstract:The operational efficiency of railway networks, a cornerstone of modern economies, is persistently undermined by the cascading effects of train delays. Accurately forecasting this delay propagation is a critical challenge for real-time traffic management. While recent research has leveraged Graph Neural Networks (GNNs) to model the network structure of railways, a significant gap remains in developing frameworks that provide multi-step autoregressive forecasts at a network-wide scale, while simultaneously offering the live, interpretable explanations needed for decision support. This paper addresses this gap by developing and evaluating a novel XGeoAI framework for live, explainable, multi-step train delay forecasting. The core of this work is a two-stage, autoregressive Graph Attention Network (GAT) model, trained on a real-world dataset covering over 40% of the Dutch railway network. The model represents the system as a spatio-temporal graph of operational events (arrivals and departures) and is enriched with granular features, including platform and station congestion. To test its viability for live deployment, the model is rigorously evaluated using a sequential, k-step-ahead forecasting protocol that simulates real-world conditions where prediction errors can compound. The results demonstrate that while the proposed GATv2 model is challenged on pure error metrics (MAE) by a simpler Persistence baseline, it achieves consistently higher precision in classifying delay events – a crucial advantage for a reliable decision support tool.

[LG-15] Safety Game: Balancing Safe and Informative Conversations with Blackbox Agent ic AI using LP Solvers

链接: https://arxiv.org/abs/2510.09330
作者: Tuan Nguyen,Long Tran-Thanh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

[LG-16] Residual-Informed Learning of Solutions to Algebraic Loops

链接: https://arxiv.org/abs/2510.09317
作者: Felix Brandt,Andreas Heuermann,Philip Hannebohm,Bernhard Bachmann
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 16 pages, 16 figures, 5 tables, submitted to IDaS-Schriftenreihe from Hochschule Bielefeld - University of Applied Sciences and Arts (HSBI)

点击查看摘要

Abstract:This paper presents a residual-informed machine learning approach for replacing algebraic loops in equation-based Modelica models with neural network surrogates. A feedforward neural network is trained using the residual (error) of the algebraic loop directly in its loss function, eliminating the need for a supervised dataset. This training strategy also resolves the issue of ambiguous solutions, allowing the surrogate to converge to a consistent solution rather than averaging multiple valid ones. Applied to the large-scale IEEE 14-Bus system, our method achieves a 60% reduction in simulation time compared to conventional simulations, while maintaining the same level of accuracy through error control mechanisms.

[LG-17] Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers

链接: https://arxiv.org/abs/2510.09294
作者: Ilyas Varshavskiy,Bonu Boboeva,Shuhrat Khalilbekov,Azizjon Azimi,Sergey Shulgin,Akhlitdin Nizamitdinov,Haitz Saez de Ocariz Borde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model

[LG-18] Placeit! A Framework for Learning Robot Object Placement Skills

链接: https://arxiv.org/abs/2510.09267
作者: Amina Ferrad,Johann Huber,François Hélénon,Julien Gleyze,Mahdi Khoramshahi,Stéphane Doncieux
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures. Draft version

点击查看摘要

Abstract:Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pickplace pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.

[LG-19] GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

链接: https://arxiv.org/abs/2510.09260
作者: Subrat Kishore Dutta,Yuelin Xu,Piyush Pant,Xiao Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has shown that RLHF is highly susceptible to backdoor attacks, poisoning schemes that inject malicious triggers in preference data. However, existing methods often rely on static, rare-token-based triggers, limiting their effectiveness in realistic scenarios. In this paper, we develop GREAT, a novel framework for crafting generalizable backdoors in RLHF through emotion-aware trigger synthesis. Specifically, GREAT targets harmful response generation for a vulnerable user subgroup characterized by both semantically violent requests and emotionally angry triggers. At the core of GREAT is a trigger identification pipeline that operates in the latent embedding space, leveraging principal component analysis and clustering techniques to identify the most representative triggers. To enable this, we present Erinyes, a high-quality dataset of over 5000 angry triggers curated from GPT-4.1 using a principled, hierarchical, and diversity-promoting approach. Experiments on benchmark RLHF datasets demonstrate that GREAT significantly outperforms baseline methods in attack success rates, especially for unseen trigger scenarios, while largely preserving the response quality on benign inputs.

[LG-20] A PCA-based Data Prediction Method

链接: https://arxiv.org/abs/2510.09246
作者: Peteris Daugulis,Vija Vagale,Emiliano Mancini,Filippo Castiglione
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.

[LG-21] Investigating the Impact of Rational Dilated Wavelet Transform on Motor Imagery EEG Decoding with Deep Learning Models

链接: https://arxiv.org/abs/2510.09242
作者: Marco Siino,Giuseppe Bonomo,Rosario Sorbello,Ilenia Tinnirello
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present study investigates the impact of the Rational Discrete Wavelet Transform (RDWT), used as a plug-in preprocessing step for motor imagery electroencephalographic (EEG) decoding prior to applying deep learning classifiers. A systematic paired evaluation (with/without RDWT) is conducted on four state-of-the-art deep learning architectures: EEGNet, ShallowConvNet, MBEEG_SENet, and EEGTCNet. This evaluation was carried out across three benchmark datasets: High Gamma, BCI-IV-2a, and BCI-IV-2b. The performance of the RDWT is reported with subject-wise averages using accuracy and Cohen’s kappa, complemented by subject-level analyses to identify when RDWT is beneficial. On BCI-IV-2a, RDWT yields clear average gains for EEGTCNet (+4.44 percentage points, pp; kappa +0.059) and MBEEG_SENet (+2.23 pp; +0.030), with smaller improvements for EEGNet (+2.08 pp; +0.027) and ShallowConvNet (+0.58 pp; +0.008). On BCI-IV-2b, the enhancements observed are modest yet consistent for EEGNet (+0.21 pp; +0.044) and EEGTCNet (+0.28 pp; +0.077). On HGD, average effects are modest to positive, with the most significant gain observed for MBEEG_SENet (+1.65 pp; +0.022), followed by EEGNet (+0.76 pp; +0.010) and EEGTCNet (+0.54 pp; +0.008). Inspection of the subject material reveals significant enhancements in challenging recordings (e.g., non-stationary sessions), indicating that RDWT can mitigate localized noise and enhance rhythm-specific information. In conclusion, RDWT is shown to be a low-overhead, architecture-aware preprocessing technique that can yield tangible gains in accuracy and agreement for deep model families and challenging subjects.

[LG-22] Incentivizing Time-Aware Fairness in Data Sharing NEURIPS2025

链接: https://arxiv.org/abs/2510.09240
作者: Jiangwei Chen,Kieu Thao Nguyen Pham,Rachael Hwee Ling Sim,Arun Verma,Zhaoxuan Wu,Chuan-Sheng Foo,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.

[LG-23] Characterizing 5G User Throughput via Uncertainty Modeling and Crowdsourced Measurements

链接: https://arxiv.org/abs/2510.09239
作者: Javier Albert-Smet,Zoraida Frias,Luis Mendo,Sergio Melones,Eduardo Yraola
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Submitted to IEEE ICC 2026

点击查看摘要

Abstract:Characterizing application-layer user throughput in next-generation networks is increasingly challenging as the higher capacity of the 5G Radio Access Network (RAN) shifts connectivity bottlenecks towards deeper parts of the network. Traditional methods, such as drive tests and operator equipment counters, are costly, limited, or fail to capture end-to-end (E2E) Quality of Service (QoS) and its variability. In this work, we leverage large-scale crowdsourced measurements-including E2E, radio, contextual and network deployment features collected by the user equipment (UE)-to propose an uncertainty-aware and explainable approach for downlink user throughput estimation. We first validate prior 4G methods, improving R^2 by 8.7%, and then extend them to 5G NSA and 5G SA, providing the first benchmarks for 5G crowdsourced datasets. To address the variability of throughput, we apply NGBoost, a model that outputs both point estimates and calibrated confidence intervals, representing its first use in the field of computer communications. Finally, we use the proposed model to analyze the evolution from 4G to 5G SA, and show that throughput bottlenecks move from the RAN to transport and service layers, as seen by E2E metrics gaining importance over radio-related features.

[LG-24] Prime Implicant Explanations for Reaction Feasibility Prediction ECML KDD2025

链接: https://arxiv.org/abs/2510.09226
作者: Klaus Weinbauer,Tieu-Long Phan,Peter F. Stadler,Thomas Gärtner,Sagar Malhotra
类目: Machine Learning (cs.LG)
*备注: Presented at AIMLAI workshop at ECMLPKDD 2025

点击查看摘要

Abstract:Machine learning models that predict the feasibility of chemical reactions have become central to automated synthesis planning. Despite their predictive success, these models often lack transparency and interpretability. We introduce a novel formulation of prime implicant explanations–also known as minimally sufficient reasons–tailored to this domain, and propose an algorithm for computing such explanations in small-scale reaction prediction tasks. Preliminary experiments demonstrate that our notion of prime implicant explanations conservatively captures the ground truth explanations. That is, such explanations often contain redundant bonds and atoms but consistently capture the molecular attributes that are essential for predicting reaction feasibility.

[LG-25] FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

链接: https://arxiv.org/abs/2510.09222
作者: Zhenglin Wan,Jingxuan Wu,Xingrui Yu,Chubin Zhang,Mingcong Lei,Bo An,Ivor Tsang
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy’s behavior to stabilize policy learning. Due to the student’s simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

[LG-26] Provable Watermarking for Data Poisoning Attacks NEURIPS2025

链接: https://arxiv.org/abs/2510.09210
作者: Yifan Zhu,Lijia Yu,Xiao-Shan Gao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 42 pages, NeurIPS 2025

点击查看摘要

Abstract:In recent years, data poisoning attacks have been increasingly designed to appear harmless and even beneficial, often with the intention of verifying dataset ownership or safeguarding private data from unauthorized use. However, these developments have the potential to cause misunderstandings and conflicts, as data poisoning has traditionally been regarded as a security threat to machine learning systems. To address this issue, it is imperative for harmless poisoning generators to claim ownership of their generated datasets, enabling users to identify potential poisoning to prevent misuse. In this paper, we propose the deployment of watermarking schemes as a solution to this challenge. We introduce two provable and practical watermarking approaches for data poisoning: \em post-poisoning watermarking and \em poisoning-concurrent watermarking. Our analyses demonstrate that when the watermarking length is \Theta(\sqrtd/\epsilon_w) for post-poisoning watermarking, and falls within the range of \Theta(1/\epsilon_w^2) to O(\sqrtd/\epsilon_p) for poisoning-concurrent watermarking, the watermarked poisoning dataset provably ensures both watermarking detectability and poisoning utility, certifying the practicality of watermarking under data poisoning attacks. We validate our theoretical findings through experiments on several attacks, models, and datasets.

[LG-27] Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization

链接: https://arxiv.org/abs/2510.09204
作者: Simon Idoko,Arun Kumar Singh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.

[LG-28] Augmented data and neural networks for robust epidemic forecasting: application to COVID-19 in Italy

链接: https://arxiv.org/abs/2510.09192
作者: Giacomo Dimarco,Federica Ferrarese,Lorenzo Pareschi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose a data augmentation strategy aimed at improving the training phase of neural networks and, consequently, the accuracy of their predictions. Our approach relies on generating synthetic data through a suitable compartmental model combined with the incorporation of uncertainty. The available data are then used to calibrate the model, which is further integrated with deep learning techniques to produce additional synthetic data for training. The results show that neural networks trained on these augmented datasets exhibit significantly improved predictive performance. We focus in particular on two different neural network architectures: Physics-Informed Neural Networks (PINNs) and Nonlinear Autoregressive (NAR) models. The NAR approach proves especially effective for short-term forecasting, providing accurate quantitative estimates by directly learning the dynamics from data and avoiding the additional computational cost of embedding physical constraints into the training. In contrast, PINNs yield less accurate quantitative predictions but capture the qualitative long-term behavior of the system, making them more suitable for exploring broader dynamical trends. Numerical simulations of the second phase of the COVID-19 pandemic in the Lombardy region (Italy) validate the effectiveness of the proposed approach.

[LG-29] RepDL: Bit-level Reproducible Deep Learning Training and Inference

链接: https://arxiv.org/abs/2510.09180
作者: Peichen Xie,Xian Zhang,Shuo Chen
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Originally drafted in 2023

点击查看摘要

Abstract:Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at this https URL .

[LG-30] Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints NEURIPS

链接: https://arxiv.org/abs/2510.09175
作者: Ling Zhan,Junjie Huang,Xiaoyao Yu,Wenyu Chen,Tao Jia
类目: Machine Learning (cs.LG)
*备注: 33 pages, 10 figures, NeurIPS

点击查看摘要

Abstract:Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on this https URL.

[LG-31] Robustness and Regularization in Hierarchical Re-Basin

链接: https://arxiv.org/abs/2510.09174
作者: Benedikt Franke,Florian Heinrich,Markus Lange,Arne Raul
类目: Machine Learning (cs.LG)
*备注: Published in 32th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2024

点击查看摘要

Abstract:This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

[LG-32] Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

链接: https://arxiv.org/abs/2510.09160
作者: Le-Trung Nguyen,Enzo Tartaglione,Van-Tam Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model’s essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to 62\times and computational cost (FLOPs) by up to 2\times . On a Raspberry Pi 5, WASI achieves roughly 1.5\times faster training and inference than vanilla training.

[LG-33] Agent ic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2510.09156
作者: Jing Li,Zhijie Sun,Zhicheng Zhou,Suming Qiu,Junjie Huang,Haijia Sun,Linyuan Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.

[LG-34] Logits Replay MoClip: Stabilized Low-Cost Post-Training with Minimal Forgetting

链接: https://arxiv.org/abs/2510.09152
作者: Suming Qiu,Jing Li,Zhicheng Zhou,Junjie Huang,Linyuan Qiu,Zhijie Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.09152 [cs.LG] (or arXiv:2510.09152v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.09152 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Score-Based Density Estimation from Pairwise Comparisons

链接: https://arxiv.org/abs/2510.09146
作者: Petrus Mikkola,Luigi Acerbi,Arto Klami
类目: Machine Learning (cs.LG)
*备注: 32 pages, 26 figures

点击查看摘要

Abstract:We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner’s score via score-matching. This allows estimating the target by `de-tempering’ the estimated winner density’s score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

[LG-36] Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

链接: https://arxiv.org/abs/2510.09127
作者: Orin Levy,Liad Erez,Alon Cohen,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over K actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class \Pi we establish an optimal expected regret bound of O (\sqrtKT \log |\Pi| + \sqrtD \log |\Pi|) where D is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class \mathcalF with access to an online least-square regression oracle \mathcalO over \mathcalF . In this setting, we achieve an expected regret bound of O(\sqrtKT\mathcalR_T(\mathcalO) + \sqrt d_\max D \beta) assuming FIFO order, where d_\max is the maximal delay, \mathcalR_T(\mathcalO) is an upper bound on the oracle’s regret and \beta is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk’s aggregating forecaster as an oracle implementation for least-square regression over a finite function class \mathcalF and show that its stability parameter \beta is bounded by \log |\mathcalF| , resulting in an expected regret bound of O(\sqrtKT \log |\mathcalF| + \sqrtd_\max D \log |\mathcalF|) which is a \sqrtd_\max factor away from the lower bound of \Omega(\sqrtKT \log |\mathcalF| + \sqrtD \log |\mathcalF|) that we also present.

[LG-37] AdaPM: a Partial Momentum Algorithm for LLM Training

链接: https://arxiv.org/abs/2510.09103
作者: Yimu Zhang,Yuanshi Liu,Cong Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over 90% in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to 95% in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over 30% GPU hours for pretraining GPT-2 1.5B.

[LG-38] Neural Codecs as Biosignal Tokenizers

链接: https://arxiv.org/abs/2510.09095
作者: Kleanthis Avramidis,Tiantian Feng,Woojae Jeong,Jihwan Lee,Wenhui Cui,Richard M Leahy,Shrikanth Narayanan
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 25 pages, 7 figures, 10 tables, currently under peer review

点击查看摘要

Abstract:Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

[LG-39] FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

链接: https://arxiv.org/abs/2510.09085
作者: Atul Shree,Harshith Jupuru
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

[LG-40] Improving Anomaly Detection in Industrial Time Series: The Role of Segmentation and Heterogeneous Ensemble

链接: https://arxiv.org/abs/2510.09079
作者: Emilio Mastriani,Alessandro Costa,Federico Incardona,Kevin Munari,Sebastiano Spinello
类目: Machine Learning (cs.LG)
*备注: Conference paper. Under publication process at CODIT 2025

点击查看摘要

Abstract:Concerning machine learning, segmentation models can identify state changes within time series, facilitating the detection of transitions between normal and anomalous conditions. Specific techniques such as Change Point Detection (CPD), particularly algorithms like ChangeFinder, have been successfully applied to segment time series and improve anomaly detection by reducing temporal uncertainty, especially in multivariate environments. In this work, we explored how the integration of segmentation techniques, combined with a heterogeneous ensemble, can enhance anomaly detection in an industrial production context. The results show that applying segmentation as a pre-processing step before selecting heterogeneous ensemble algorithms provided a significant advantage in our case study, improving the AUC-ROC metric from 0.8599 (achieved with a PCA and LSTM ensemble) to 0.9760 (achieved with Random Forest and XGBoost). This improvement is imputable to the ability of segmentation to reduce temporal ambiguity and facilitate the learning process of supervised algorithms. In our future work, we intend to assess the benefit of introducing weighted features derived from the study of change points, combined with segmentation and the use of heterogeneous ensembles, to further optimize model performance in early anomaly detection.

[LG-41] MCMC: Bridging Rendering Optimization and Generative AI SIGGRAPH

链接: https://arxiv.org/abs/2510.09078
作者: Gurprit Singh,Wenzel Jakob
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: SIGGRAPH Asia 2024 Courses. arXiv admin note: text overlap with arXiv:2208.11970 by other authors

点击查看摘要

Abstract:Generative artificial intelligence (AI) has made unprecedented advances in vision language models over the past two years. During the generative process, new samples (images) are generated from an unknown high-dimensional distribution. Markov Chain Monte Carlo (MCMC) methods are particularly effective in drawing samples from such complex, high-dimensional distributions. This makes MCMC methods an integral component for models like EBMs, ensuring accurate sample generation. Gradient-based optimization is at the core of modern generative models. The update step during the optimization forms a Markov chain where the new update depends only on the current state. This allows exploration of the parameter space in a memoryless manner, thus combining the benefits of gradient-based optimization and MCMC sampling. MCMC methods have shown an equally important role in physically based rendering where complex light paths are otherwise quite challenging to sample from simple importance sampling techniques. A lot of research is dedicated towards bringing physical realism to samples (images) generated from diffusion-based generative models in a data-driven manner, however, a unified framework connecting these techniques is still missing. In this course, we take the first steps toward understanding each of these components and exploring how MCMC could potentially serve as a bridge, linking these closely related areas of research. Our course aims to provide necessary theoretical and practical tools to guide students, researchers and practitioners towards the common goal of generative physically based rendering. All Jupyter notebooks with demonstrations associated to this tutorial can be found on the project webpage: this https URL Comments: SIGGRAPH Asia 2024 Courses. arXiv admin note: text overlap with arXiv:2208.11970 by other authors Subjects: Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2510.09078 [cs.GR] (or arXiv:2510.09078v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2510.09078 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: SIGGRAPH Asia 2024 Courses, Article No.: 8, Pages 1 - 27 Related DOI: https://doi.org/10.1145/3680532.368959 Focus to learn more DOI(s) linking to related resources

[LG-42] Spatio-Temporal Graph Convolutional Networks for EV Charging Demand Forecasting Using Real-World Multi-Modal Data Integration

链接: https://arxiv.org/abs/2510.09048
作者: Jose Tupayachi,Mustafa C. Camur,Kevin Heaslip,Xueping Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.

[LG-43] MAKO: Meta-Adaptive Koopman Operators for Learning-based Model Predictive Control of Parametrically Uncertain Nonlinear Systems

链接: https://arxiv.org/abs/2510.09042
作者: Minghao Han,Kiwan Wong,Adrian Wing-Keung Law,Xunyuan Yin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose a meta-learning-based Koopman modeling and predictive control approach for nonlinear systems with parametric uncertainties. An adaptive deep meta-learning-based modeling approach, called Meta Adaptive Koopman Operator (MAKO), is proposed. Without knowledge of the parametric uncertainty, the proposed MAKO approach can learn a meta-model from a multi-modal dataset and efficiently adapt to new systems with previously unseen parameter settings by using online data. Based on the learned meta Koopman model, a predictive control scheme is developed, and the stability of the closed-loop system is ensured even in the presence of previously unseen parameter settings. Through extensive simulations, our proposed approach demonstrates superior performance in both modeling accuracy and control efficacy as compared to competitive baselines.

[LG-44] Convergence of optimizers implies eigenvalues filtering at equilibrium

链接: https://arxiv.org/abs/2510.09034
作者: Jerome Bolte(TSE-R),Quoc-Tung Le(UGA, LJK),Edouard Pauwels(TSE-R)
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard–Perron stable manifold theorem and applies to general semialgebraic C^2 functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

[LG-45] he Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

链接: https://arxiv.org/abs/2510.09023
作者: Milad Nasr,Nicholas Carlini,Chawin Sitawarin,Sander V. Schulhoff,Jamie Hayes,Michael Ilie,Juliette Pluto,Shuang Song,Harsh Chaudhari,Ilia Shumailov,Abhradeep Thakurta,Kai Yuanqing Xiao,Andreas Terzis,Florian Tramèr
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense’s design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2510.09023 [cs.LG] (or arXiv:2510.09023v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.09023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] he Environmental Impacts of Machine Learning Training Keep Rising Evidencing Rebound Effect

链接: https://arxiv.org/abs/2510.09022
作者: Clément Morand(STL),Anne-Laure Ligozat(ENSIIE, LISN, STL),Aurélie Névéol(STL, LISN)
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2412.17376

点击查看摘要

Abstract:Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.

[LG-47] MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient Inversion

链接: https://arxiv.org/abs/2510.09020
作者: Zekai Chen,Xunkai Li,Sirui Zhang,Henan Sun,Jia Li,Zhenjun Li,Bing Zhou,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: 52 pages, 14 figures, 12 tables

点击查看摘要

Abstract:De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbfPseudo De Novo, \textbfLimited Docking Modeling, and \textbfInflexible Ligand Type. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1% and 11.7% over SOTA baselines specialized for protein or molecule ligand design, respectively.

[LG-48] Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference

链接: https://arxiv.org/abs/2510.09018
作者: Ian Harshbarger,Calvin Chidambaram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.

[LG-49] LLM Unlearning on Noisy Forget Sets: A Study of Incomplete Rewritten and Watermarked Data

链接: https://arxiv.org/abs/2510.09007
作者: Changsheng Wang,Yihua Zhang,Dennis Wei,Jinghan Jia,Pin-Yu Chen,Sijia Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by 18th ACM Workshop on Artificial Intelligence and Security (AISec’25)

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

[LG-50] Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search

链接: https://arxiv.org/abs/2510.08992
作者: Kamel Alrashedy,Vriksha Srihari,Zulfiqar Zaidi,Ridam Srivastava,Pradyumna Tambwekar,Matthew Gombolay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.

[LG-51] FedL2T: Personalized Federated Learning with Two-Teacher Distillation for Seizure Prediction

链接: https://arxiv.org/abs/2510.08984
作者: Jionghao Lou,Jian Zhang,Zhongmei Li,Lanlan Chen,Enbo Feng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The training of deep learning models in seizure prediction requires large amounts of Electroencephalogram (EEG) data. However, acquiring sufficient labeled EEG data is difficult due to annotation costs and privacy constraints. Federated Learning (FL) enables privacy-preserving collaborative training by sharing model updates instead of raw data. However, due to the inherent inter-patient variability in real-world scenarios, existing FL-based seizure prediction methods struggle to achieve robust performance under heterogeneous client settings. To address this challenge, we propose FedL2T, a personalized federated learning framework that leverages a novel two-teacher knowledge distillation strategy to generate superior personalized models for each client. Specifically, each client simultaneously learns from a globally aggregated model and a dynamically assigned peer model, promoting more direct and enriched knowledge exchange. To ensure reliable knowledge transfer, FedL2T employs an adaptive multi-level distillation strategy that aligns both prediction outputs and intermediate feature representations based on task confidence. In addition, a proximal regularization term is introduced to constrain personalized model updates, thereby enhancing training stability. Extensive experiments on two EEG datasets demonstrate that FedL2T consistently outperforms state-of-the-art FL methods, particularly under low-label conditions. Moreover, FedL2T exhibits rapid and stable convergence toward optimal performance, thereby reducing the number of communication rounds and associated overhead. These results underscore the potential of FedL2T as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios.

[LG-52] HiBBO: HiPPO-based Space Consistency for High-dimensional Bayesian Optimisation

链接: https://arxiv.org/abs/2510.08965
作者: Junyu Xuan,Wenlong Chen,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Optimisation (BO) is a powerful tool for optimising expensive blackbox functions but its effectiveness diminishes in highdimensional spaces due to sparse data and poor surrogate model scalability While Variational Autoencoder (VAE) based approaches address this by learning low-dimensional latent representations the reconstructionbased objective function often brings the functional distribution mismatch between the latent space and original space leading to suboptimal optimisation performance In this paper we first analyse the reason why reconstructiononly loss may lead to distribution mismatch and then propose HiBBO a novel BO framework that introduces the space consistency into the latent space construction in VAE using HiPPO - a method for longterm sequence modelling - to reduce the functional distribution mismatch between the latent space and original space Experiments on highdimensional benchmark tasks demonstrate that HiBBO outperforms existing VAEBO methods in convergence speed and solution quality Our work bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimisation enabling broader applications in neural architecture search materials science and beyond.

[LG-53] When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach

链接: https://arxiv.org/abs/2510.08952
作者: Zhihan Zhang,Xunkai Li,Yilong Zuo,Zhenjun Li,Bing Zhou,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7figures

点击查看摘要

Abstract:Text-attributed graphs (TAGs) have emerged as a powerful representation that combines structural connections with fine-grained semantics, supporting a wide range of data-centric applications. However, the performance of graph neural networks (GNNs) on TAGs is highly sensitive to input quality. Our empirical study shows that both traditional GNNs and LLM-enhanced GNNs suffer significant degradation across nine representative scenarios of sparsity, noise, and imbalance, highlighting graph quality as a critical bottleneck. Existing approaches mainly focus on improving model architectures, while neglecting systematic optimization of TAG data itself, leading to limited effectiveness in practice. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework that treats graph quality control as a first-class, data-centric problem. LAGA integrates four collaborative agents-detection, planning, action, and evaluation-into an automated closed loop. At its core, the action agent employs a dual-encoder and tri-objective design to capture complementary information across modalities and perform holistic graph quality enhancement. Experiments across nine scenarios show that LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones, validating data-centric quality optimization as key to reliable TAGs and robust graph learning.

[LG-54] Variability Aware Recursive Neural Network (VARNN): A Residual-Memory Model for Capturing Temporal Deviation in Sequence Regression Modeling

链接: https://arxiv.org/abs/2510.08944
作者: Haroon Gharwi,Kai Shu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise (heteroscedastic) that degrade the robustness of standard regression models. We introduce the Variability-Aware Recursive Neural Network (VARNN), a novel residual-aware architecture for supervised time-series regression that learns an explicit error memory from recent prediction residuals and uses it to recalibrate subsequent predictions. VARNN augments a feed-forward predictor with a learned error-memory state that is updated from residuals over a short context steps as a signal of variability and drift, and then conditions the final prediction at the current time step. Across diverse dataset domains, appliance energy, healthcare, and environmental monitoring, experimental results demonstrate VARNN achieves superior performance and attains lower test MSE with minimal computational overhead over static, dynamic, and recurrent baselines. Our findings show that the VARNN model offers robust predictions under a drift and volatility environment, highlighting its potential as a promising framework for time-series learning.

[LG-55] MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

链接: https://arxiv.org/abs/2510.08932
作者: Moyu Zhang,Yun Chen,Yujun Jin,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

[LG-56] AB-PINNs: Adaptive-Basis Physics-Informed Neural Networks for Residual-Driven Domain Decomposition

链接: https://arxiv.org/abs/2510.08924
作者: Jonah Botvinick-Greenhouse,Wael H. Ali,Mouhacine Benosman,Saviz Mowlavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce adaptive-basis physics-informed neural networks (AB-PINNs), a novel approach to domain decomposition for training PINNs in which existing subdomains dynamically adapt to the intrinsic features of the unknown solution. Drawing inspiration from classical mesh refinement techniques, we also modify the domain decomposition on-the-fly throughout training by introducing new subdomains in regions of high residual loss, thereby providing additional expressive power where the solution of the differential equation is challenging to represent. Our flexible approach to domain decomposition is well-suited for multiscale problems, as different subdomains can learn to capture different scales of the underlying solution. Moreover, the ability to introduce new subdomains during training helps prevent convergence to unwanted local minima and can reduce the need for extensive hyperparameter tuning compared to static domain decomposition approaches. Throughout, we present comprehensive numerical results which demonstrate the effectiveness of AB-PINNs at solving a variety of complex multiscale partial differential equations.

[LG-57] Simple and Robust Forecasting of Spatiotemporally Correlated Small Earth Data with A Tabular Foundation Model

链接: https://arxiv.org/abs/2510.08920
作者: Yuting Yang,Gang Mei,Zhengjing Ma,Nengxiong Xu,Jianbing Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small Earth data are geoscience observations with limited short-term monitoring variability, providing sparse but meaningful measurements, typically exhibiting spatiotemporal correlations. Spatiotemporal forecasting on such data is crucial for understanding geoscientific processes despite their small scale. However, conventional deep learning models for spatiotemporal forecasting requires task-specific training for different scenarios. Foundation models do not need task-specific training, but they often exhibit forecasting bias toward the global mean of the pretraining distribution. Here we propose a simple and robust approach for spatiotemporally correlated small Earth data forecasting. The essential idea is to characterize and quantify spatiotemporal patterns of small Earth data and then utilize tabular foundation models for accurate forecasting across different scenarios. Comparative results across three typical scenarios demonstrate that our forecasting approach achieves superior accuracy compared to the graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in the majority of instances, exhibiting stronger robustness.

[LG-58] Velocity and Density-Aware RRI Analysis and Optimization for AoI Minimization in IoV SPS

链接: https://arxiv.org/abs/2510.08911
作者: Maoxin Ji,Tong Wang,Qiong Wu,Pingyi Fan,Nan Cheng,Wen Chen
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Communications Letters

点击查看摘要

Abstract:Addressing the problem of Age of Information (AoI) deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling (SPS) for the Internet of Vehicles (IoV), this letter proposes an optimization approach based on Large Language Models (LLM) and Deep Deterministic Policy Gradient (DDPG). First, an AoI calculation model influenced by vehicle speed, vehicle density, and Resource Reservation Interval (RRI) is established, followed by the design of a dual-path optimization scheme. The DDPG is guided by the state space and reward function, while the LLM leverages contextual learning to generate optimal parameter configurations. Experimental results demonstrate that LLM can significantly reduce AoI after accumulating a small number of exemplars without requiring model training, whereas the DDPG method achieves more stable performance after training.

[LG-59] An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

链接: https://arxiv.org/abs/2510.08882
作者: Haolin Liu,Chen-Yu Wei,Julian Zimmert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from T^3/4 to T^2/3 (on-policy) and from T^5/6 to T^7/9 (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from T^2/3 to \sqrtT . This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.08882 [cs.LG] (or arXiv:2510.08882v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.08882 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Multi-fidelity Batch Active Learning for Gaussian Process Classifiers

链接: https://arxiv.org/abs/2510.08865
作者: Murray Cutforth,Yiming Yang,Tiffany Fan,Serge Guillas,Eric Darve
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.

[LG-61] On the Alignment Between Supervised and Self-Supervised Contrastive Learning

链接: https://arxiv.org/abs/2510.08852
作者: Achleshwar Luthra,Priyadarsi Mishra,Tomer Galanti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: \em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives? We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\hrefthis https URLcode, \hrefthis https URLproject page]. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.08852 [cs.LG] (or arXiv:2510.08852v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.08852 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-62] Long-Tailed Recognition via Information-Preservable Two-Stage Learning NEURIPS2025

链接: https://arxiv.org/abs/2510.08836
作者: Fudong Lin,Xu Yuan
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025 as Spotlight

点击查看摘要

Abstract:The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model’s overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at this https URL.

[LG-63] nyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

链接: https://arxiv.org/abs/2510.08808
作者: Michal Podstawski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

[LG-64] Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

链接: https://arxiv.org/abs/2510.08807
作者: Zhenyu Zhao,Hongyi Jing,Xiawei Liu,Jiageng Mao,Abha Jha,Hanwen Yang,Rong Xue,Sergey Zakharor,Vitor Guizilini,Yue Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

[LG-65] Man-Made Heuristics Are Dead. Long Live Code Generators!

链接: https://arxiv.org/abs/2510.08803
作者: Rohit Dwivedula,Divyanshu Saxena,Aditya Akella,Swarat Chaudhuri,Daehyeok Kim
类目: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 2 figures, 2 tables. To be presented at HotNets 2025

点击查看摘要

Abstract:Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that integrate directly into the Linux kernel.

[LG-66] Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops ICML

链接: https://arxiv.org/abs/2510.08802
作者: S M Rafiuddin
类目: Machine Learning (cs.LG)
*备注: 6 Pages, 6 Figures, 3 Tables, Accepted as a Regular Research paper at ICMLA 2025

点击查看摘要

Abstract:Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

[LG-67] APAS: Datasets for Learning the Learning with Errors Problem

链接: https://arxiv.org/abs/2510.08797
作者: Eshika Saxena,Alberto Alfarano,François Charton,Emily Wenger,Kristin Lauter
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform “classical” attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners’ ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

[LG-68] PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

链接: https://arxiv.org/abs/2510.08795
作者: Junyi Wu,Guang Lin
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov–Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov–Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch–trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers’ equation with \nu=0.01 , PO-CKAN reduces the mean relative L^2 error by approximately 48% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion–reaction benchmarks.

[LG-69] Weights initialization of neural networks for function approximation

链接: https://arxiv.org/abs/2510.08780
作者: Xinwen Hu,Yunqing Huang,Nianyu Yi,Peimeng Yin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.

[LG-70] Prioritizing Latency with Profit: A DRL-Based Admission Control for 5G Network Slices

链接: https://arxiv.org/abs/2510.08769
作者: Proggya Chakraborty,Aaquib Asrar,Jayasree Sengupta,Sipra Das Bit
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 6 pages, 8 figures, Accepted for publication at IEEE ANTS 2025

点击查看摘要

Abstract:5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC – a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.

[LG-71] Zero-Shot Policy Transfer in Reinforcement Learning using Buckinghams Pi Theorem

链接: https://arxiv.org/abs/2510.08768
作者: Francisco Pascoa,Ian Lalonde,Alexandre Girard
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham’s Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

[LG-72] Reinforcement Learning-Based Optimization of CT Acquisition and Reconstruction Parameters Through Virtual Imaging Trials

链接: https://arxiv.org/abs/2510.08763
作者: David Fenwick,Navid NaderiAlizadeh,Vahid Tarokh,Nicholas Felice,Darin Clark,Jayasai Rajagopal,Anuj Kapadia,Benjamin Wildman-Tobriner,Ehsan Samei,Ehsan Abadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d’) of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d’ across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.

[LG-73] Spatial Deconfounder: Interference-Aware Deconfounding for Spatial Causal Inference

链接: https://arxiv.org/abs/2510.08762
作者: Ayush Khot,Miruna Oprescu,Maresa Schröder,Ai Kagawa,Xihaier Luo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard no-interference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected: interference reveals structure in the latent confounder. Leveraging this insight, we propose the Spatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (CVAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions–without requiring multiple treatment types or a known model of the latent field. Empirically, we extend SpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data.

[LG-74] LOTION: Smoothing the Optimization Landscape for Quantized Training

链接: https://arxiv.org/abs/2510.08757
作者: Mujin Kwun,Depen Morwani,Chloe Huangyuan Su,Stephanie Gil,Nikhil Anand,Sham Kakade
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 9 pages of main text + appendices

点击查看摘要

Abstract:Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbfLow-precision \textbfOptimization via s\textbfTochastic-no\textbfIse sm\textbfOothi\textbfNg, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

[LG-75] Conformal Risk Training: End-to-End Optimization of Conformal Risk Control NEURIPS2025

链接: https://arxiv.org/abs/2510.08748
作者: Christopher Yeh,Nicolas Christianson,Adam Wierman,Yisong Yue
类目: Machine Learning (cs.LG)
*备注: accepted to NeurIPS 2025

点击查看摘要

Abstract:While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce “conformal risk training,” an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers’ false negative rate and controlling financial risk in battery storage operation.

[LG-76] RFOD: Random Forest-based Outlier Detection for Tabular Data

链接: https://arxiv.org/abs/2510.08747
作者: Yihao Ang,Peicheng Yao,Yifan Bao,Yushuo Feng,Qiang Huang,Anthony K. H. Tung,Zhiyong Huang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 13 pages, 13 figures, and 4 tables

点击查看摘要

Abstract:Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf\textbfRFOD, a novel \textsf\textbfRandom \textsf\textbfForest-based \textsf\textbfOutlier \textsf\textbfDetection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsfRFOD reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsfRFOD combines Adjusted Gower’s Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsfRFOD consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

[LG-77] Faithful and Interpretable Explanations for Complex Ensemble Time Series Forecasts using Surrogate Models and Forecastability Analysis

链接: https://arxiv.org/abs/2510.08739
作者: Yikai Zhao,Jiekai Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern time series forecasting increasingly relies on complex ensemble models generated by AutoML systems like AutoGluon, delivering superior accuracy but with significant costs to transparency and interpretability. This paper introduces a comprehensive, dual-approach framework that addresses both the explainability and forecastability challenges in complex time series ensembles. First, we develop a surrogate-based explanation methodology that bridges the accuracy-interpretability gap by training a LightGBM model to faithfully mimic AutoGluon’s time series forecasts, enabling stable SHAP-based feature attributions. We rigorously validated this approach through feature injection experiments, demonstrating remarkably high faithfulness between extracted SHAP values and known ground truth effects. Second, we integrated spectral predictability analysis to quantify each series’ inherent forecastability. By comparing each time series’ spectral predictability to its pure noise benchmarks, we established an objective mechanism to gauge confidence in forecasts and their explanations. Our empirical evaluation on the M5 dataset found that higher spectral predictability strongly correlates not only with improved forecast accuracy but also with higher fidelity between the surrogate and the original forecasting model. These forecastability metrics serve as effective filtering mechanisms and confidence scores, enabling users to calibrate their trust in both the forecasts and their explanations. We further demonstrated that per-item normalization is essential for generating meaningful SHAP explanations across heterogeneous time series with varying scales. The resulting framework delivers interpretable, instance-level explanations for state-of-the-art ensemble forecasts, while equipping users with forecastability metrics that serve as reliability indicators for both predictions and their explanations.

[LG-78] SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

链接: https://arxiv.org/abs/2510.08737
作者: Justin Lin,Julia Fukuyama
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 23 pages, 15 figures, 3 tables

点击查看摘要

Abstract:In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer’s disease using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

[LG-79] ransmuting prompts into weights

链接: https://arxiv.org/abs/2510.08734
作者: Hanna Mazzawi,Benoit Dherin,Michael Munn,Michael Wunder,Javier Gonzalvo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt’s influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

[LG-80] Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

链接: https://arxiv.org/abs/2510.08726
作者: Yifan Zhao,Egan Johnson,Prasanth Chatarasi,Vikram Adve,Sasa Misailovic
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of 1.35\times over the next best alternative, demonstrating its effectiveness for deep learning workloads. Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG) Cite as: arXiv:2510.08726 [cs.PL] (or arXiv:2510.08726v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2510.08726 Focus to learn more arXiv-issued DOI via DataCite

[LG-81] Counterfactually Fair Conformal Prediction

链接: https://arxiv.org/abs/2510.08724
作者: Ozgur Guldogan,Neeraj Sarna,Yuanyuan Li,Michael Berger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While counterfactual fairness of point predictors is well studied, its extension to prediction sets–central to fair decision-making under uncertainty–remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

[LG-82] Dont Waste Mistakes: Leverag ing Negative RL-Groups via Confidence Reweighting

链接: https://arxiv.org/abs/2510.08696
作者: Yunzhen Feng,Parag Jain,Anthony Hartshorn,Yaqi Duan,Julia Kempe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbfLikelihood \textbfEstimation with \textbfNegative \textbfSamples (\textbfLENS). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to “rescue” negative groups, improving efficiency and performance in RLVR.

[LG-83] How Scale Breaks “Normalized Stress” and KL Divergence: Rethinking Quality Metrics

链接: https://arxiv.org/abs/2510.08660
作者: Kiran Smelser,Kaviru Gunaratne,Jacob Miller,Stephen Kobourov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2408.07724

点击查看摘要

Abstract:Complex, high-dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two-dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure the projection’s accuracy and faithfulness to the original data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling (stretching, shrinking) of the projection, despite this act not meaningfully changing anything about the projection. Another quality metric, the Kullback–Leibler (KL) divergence used in the popular t-Distributed Stochastic Neighbor Embedding (t-SNE) technique, is also susceptible to this scale sensitivity. We investigate the effect of scaling on stress and KL divergence analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make both metrics scale-invariant and show that it accurately captures expected behavior on a small benchmark.

[LG-84] Which Is Better For Reducing Outdated and Vulnerable Dependencies: Pinning or Floating?

链接: https://arxiv.org/abs/2510.08609
作者: Imranur Rahman,Jill Marley,William Enck,Laurie Williams
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted to ASE 2025

点击查看摘要

Abstract:Developers consistently use version constraints to specify acceptable versions of the dependencies for their project. \emphPinning dependencies can reduce the likelihood of breaking changes, but comes with a cost of manually managing the replacement of outdated and vulnerable dependencies. On the other hand, \emphfloating can be used to automatically get bug fixes and security fixes, but comes with the risk of breaking changes. Security practitioners advocate \emphpinning dependencies to prevent against software supply chain attacks, e.g., malicious package updates. However, since \emphpinning is the tightest version constraint, \emphpinning is the most likely to result in outdated dependencies. Nevertheless, how the likelihood of becoming outdated or vulnerable dependencies changes across version constraint types is unknown. The goal of this study is to aid developers in making an informed dependency version constraint choice by empirically evaluating the likelihood of dependencies becoming outdated or vulnerable across version constraint types at scale. In this study, we first identify the trends in dependency version constraint usage and the patterns of version constraint type changes made by developers in the npm, PyPI, and Cargo ecosystems. We then modeled the dependency state transitions using survival analysis and estimated how the likelihood of becoming outdated or vulnerable changes when using \emphpinning as opposed to the rest of the version constraint types. We observe that among outdated and vulnerable dependencies, the most commonly used version constraint type is \emphfloating-minor, with \emphpinning being the next most common. We also find that \emphfloating-major is the least likely to result in outdated and \emphfloating-minor is the least likely to result in vulnerable dependencies.

[LG-85] Evolutionary Computation as Natural Generative AI

链接: https://arxiv.org/abs/2510.08590
作者: Yaxin Shi,Abhishek Gupta,Ying Wu,Melvin Wong,Ivor Tsang,Thiago Rios,Stefan Menzel,Bernhard Sendhoff,Yaqing Hou,Yew-Soon Ong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Generative AI (GenAI) has achieved remarkable success across a range of domains, but its capabilities remain constrained to statistical models of finite training sets and learning based on local gradient signals. This often results in artifacts that are more derivative than genuinely generative. In contrast, Evolutionary Computation (EC) offers a search-driven pathway to greater diversity and creativity, expanding generative capabilities by exploring uncharted solution spaces beyond the limits of available data. This work establishes a fundamental connection between EC and GenAI, redefining EC as Natural Generative AI (NatGenAI) – a generative paradigm governed by exploratory search under natural selection. We demonstrate that classical EC with parent-centric operators mirrors conventional GenAI, while disruptive operators enable structured evolutionary leaps, often within just a few generations, to generate out-of-distribution artifacts. Moreover, the methods of evolutionary multitasking provide an unparalleled means of integrating disruptive EC (with cross-domain recombination of evolved features) and moderated selection mechanisms (allowing novel solutions to survive), thereby fostering sustained innovation. By reframing EC as NatGenAI, we emphasize structured disruption and selection pressure moderation as essential drivers of creativity. This perspective extends the generative paradigm beyond conventional boundaries and positions EC as crucial to advancing exploratory design, innovation, scientific discovery, and open-ended generation in the GenAI era.

[LG-86] Soft Graph Transformer for MIMO Detection

链接: https://arxiv.org/abs/2509.12694
作者: Jiadong Hong,Lei Liu,Xinyu Bian,Wenjie Wang,Zhaoyang Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 5 pages with 3 figures and 2 tables, submitted to IEEE for a possible publication

点击查看摘要

Abstract:We propose the Soft Graph Transformer (SGT), a soft-input-soft-output neural architecture designed for MIMO detection. While Maximum Likelihood (ML) detection achieves optimal accuracy, its exponential complexity makes it infeasible in large systems, and conventional message-passing algorithms rely on asymptotic assumptions that often fail in finite dimensions. Recent Transformer-based detectors show strong performance but typically overlook the MIMO factor graph structure and cannot exploit prior soft information. SGT addresses these limitations by combining self-attention, which encodes contextual dependencies within symbol and constraint subgraphs, with graph-aware cross-attention, which performs structured message passing across subgraphs. Its soft-input interface allows the integration of auxiliary priors, producing effective soft outputs while maintaining computational efficiency. Experiments demonstrate that SGT achieves near-ML performance and offers a flexible and interpretable framework for receiver systems that leverage soft priors.

[LG-87] hree Birds with One Stone: Improving Performance Convergence and System Throughput with Nest

链接: https://arxiv.org/abs/2510.09578
作者: Yuqian Huo,David Quiroga,Anastasios Kyrillidis,Tirthak Patel
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: This paper will appear in the Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2026

点击查看摘要

Abstract:Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, Nest, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use Nest to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.

[LG-88] Conditional Flow Matching for Bayesian Posterior Inference

链接: https://arxiv.org/abs/2510.09534
作者: So Won Jeong,Percy S. Zhai,Veronika Ročová
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a generative multivariate posterior sampler via flow matching. It offers a simple training objective, and does not require access to likelihood evaluation. The method learns a dynamic, block-triangular velocity field in the joint space of data and parameters, which results in a deterministic transport map from a source distribution to the desired posterior. The inverse map, named vector rank, is accessible by reversibly integrating the velocity over time. It is advantageous to leverage the dynamic design: proper constraints on the velocity yield a monotone map, which leads to a conditional Brenier map, enabling a fast and simultaneous generation of Bayesian credible sets whose contours correspond to level sets of Monge-Kantorovich data depth. Our approach is computationally lighter compared to GAN-based and diffusion-based counterparts, and is capable of capturing complex posterior structures. Finally, frequentist theoretical guarantee on the consistency of the recovered posterior distribution, and of the corresponding Bayesian credible sets, is provided.

[LG-89] Interpretable Generative and Discriminative Learning for Multimodal and Incomplete Clinical Data

链接: https://arxiv.org/abs/2510.09513
作者: Albert Belenguer-Llorens,Carlos Sevilla-Salcedo,Janaina Mourao-Miranda,Vanessa Gómez-Verdejo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world clinical problems are often characterized by multimodal data, usually associated with incomplete views and limited sample sizes in their cohorts, posing significant limitations for machine learning algorithms. In this work, we propose a Bayesian approach designed to efficiently handle these challenges while providing interpretable solutions. Our approach integrates (1) a generative formulation to capture cross-view relationships with a semi-supervised strategy, and (2) a discriminative task-oriented formulation to identify relevant information for specific downstream objectives. This dual generative-discriminative formulation offers both general understanding and task-specific insights; thus, it provides an automatic imputation of the missing views while enabling robust inference across different data sources. The potential of this approach becomes evident when applied to the multimodal clinical data, where our algorithm is able to capture and disentangle the complex interactions among biological, psychological, and sociodemographic modalities.

[LG-90] Unsupervised full-field Bayesian inference of orthotropic hyperelasticity from a single biaxial test: a myocardial case study

链接: https://arxiv.org/abs/2510.09498
作者: Rogier P. Krijnen,Akshay Joshi,Siddhant Kumar,Mathias Peirlinck
类目: Tissues and Organs (q-bio.TO); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fully capturing this behavior in traditional homogenized tissue testing requires the excitation of multiple deformation modes, i.e. combined triaxial shear tests and biaxial stretch tests. Inherently, such multimodal experimental protocols necessitate multiple tissue samples and extensive sample manipulations. Intrinsic inter-sample variability and manipulation-induced tissue damage might have an adverse effect on the inversely identified tissue behavior. In this work, we aim to overcome this gap by focusing our attention to the use of heterogeneous deformation profiles in a parameter estimation problem. More specifically, we adapt EUCLID, an unsupervised method for the automated discovery of constitutive models, towards the purpose of parameter identification for highly nonlinear, orthotropic constitutive models using a Bayesian inference approach and three-dimensional continuum elements. We showcase its strength to quantitatively infer, with varying noise levels, the material model parameters of synthetic myocardial tissue slabs from a single heterogeneous biaxial stretch test. This method shows good agreement with the ground-truth simulations and with corresponding credibility intervals. Our work highlights the potential for characterizing highly nonlinear and orthotropic material models from a single biaxial stretch test with uncertainty quantification.

[LG-91] Efficient Autoregressive Inference for Transformer Probabilistic Models

链接: https://arxiv.org/abs/2510.09477
作者: Conor Hassan,Nasrulloh Loka,Cen-You Li,Daolang Huang,Paul E. Chang,Yang Yang,Francesco Silvestrin,Samuel Kaski,Luigi Acerbi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models for amortized probabilistic inference, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many real-world applications, from signal interpolation to multi-column tabular predictions, require coherent joint distributions that capture dependencies between predictions. While purely autoregressive architectures efficiently generate such distributions, they sacrifice the flexible set-conditioning that makes these models powerful for meta-learning. Conversely, the standard approach to obtain joint distributions from set-based models requires expensive re-encoding of the entire augmented conditioning set at each autoregressive step. We introduce a causal autoregressive buffer that preserves the advantages of both paradigms. Our approach decouples context encoding from updating the conditioning set. The model processes the context once and caches it. A dynamic buffer then captures target dependencies: as targets are incorporated, they enter the buffer and attend to both the cached context and previously buffered targets. This enables efficient batched autoregressive generation and one-pass joint log-likelihood evaluation. A unified training strategy allows seamless integration of set-based and autoregressive modes at minimal additional cost. Across synthetic functions, EEG signals, cognitive models, and tabular data, our method matches predictive accuracy of strong baselines while delivering up to 20 times faster joint sampling. Our approach combines the efficiency of autoregressive generative models with the representational power of set-based conditioning, making joint prediction practical for transformer-based probabilistic models.

[LG-92] A Multimodal Approach to SME Credit Scoring Integrating Transaction and Ownership Networks

链接: https://arxiv.org/abs/2510.09407
作者: Sahab Zandi,Kamesh Korangi,Juan C. Moreno-Paredes,María Óskarsdóttir,Christophe Mues,Cristián Bravo
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small and Medium-sized Enterprises (SMEs) are known to play a vital role in economic growth, employment, and innovation. However, they tend to face significant challenges in accessing credit due to limited financial histories, collateral constraints, and exposure to macroeconomic shocks. These challenges make an accurate credit risk assessment by lenders crucial, particularly since SMEs frequently operate within interconnected firm networks through which default risk can propagate. This paper presents and tests a novel approach for modelling the risk of SME credit, using a unique large data set of SME loans provided by a prominent financial institution. Specifically, our approach employs Graph Neural Networks to predict SME default using multilayer network data derived from common ownership and financial transactions between firms. We show that combining this information with traditional structured data not only improves application scoring performance, but also explicitly models contagion risk between companies. Further analysis shows how the directionality and intensity of these connections influence financial risk contagion, offering a deeper understanding of the underlying processes. Our findings highlight the predictive power of network data, as well as the role of supply chain networks in exposing SMEs to correlated default risk.

[LG-93] Reliability Sensitivity with Response Gradient

链接: https://arxiv.org/abs/2510.09315
作者: Siu-Kui Au,Zi-Jun Cao
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 45 pages, 8 figures. Submitted to Structural Safety (Elsevier) on 5 Oct 2025

点击查看摘要

Abstract:Engineering risk is concerned with the likelihood of failure and the scenarios when it occurs. The sensitivity of failure probability to change in system parameters is relevant to risk-informed decision making. Computing sensitivity is at least one level more difficult than the probability itself, which is already challenged by a large number of input random variables, rare events and implicit nonlinear `black-box’ response. Finite difference with Monte Carlo probability estimates is spurious, requiring the number of samples to grow with the reciprocal of step size to suppress estimation variance. Many existing works gain efficiency by exploiting a specific class of input variables, sensitivity parameters, or response in its exact or surrogate form. For general systems, this work presents a theory and associated Monte Carlo strategy for computing sensitivity using response values and gradients with respect to sensitivity parameters. It is shown that the sensitivity at a given response threshold can be expressed via the expectation of response gradient conditional on the threshold. Determining the expectation requires conditioning on the threshold that is a zero-probability event, but it can be resolved by the concept of kernel smoothing. The proposed method offers sensitivity estimates for all response thresholds generated in a single Monte Carlo run. It is investigated in a number of examples featuring sensitivity parameters of different nature. As response gradient becomes increasingly available, it is hoped that this work can provide the basis for embedding sensitivity calculations with reliability in the same Monte Carlo run.

[LG-94] A unified Bayesian framework for adversarial robustness

链接: https://arxiv.org/abs/2510.09288
作者: Pablo G. Arce,Roi Naveiro,David Ríos Insua
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The vulnerability of machine learning models to adversarial attacks remains a critical security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. However, these deterministic approaches do not account for uncertainty in the adversary’s attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several previous defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.

[LG-95] Smart navigation of a gravity-driven glider with adjustable centre-of-mass

链接: https://arxiv.org/abs/2510.09250
作者: X. Jiang,J. Qiu,K. Gustavsson,B. Mehlig,L. Zhao
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Artificial gliders are designed to disperse as they settle through a fluid, requiring precise navigation to reach target locations. We show that a compact glider settling in a viscous fluid can navigate by dynamically adjusting its centre-of-mass. Using fully resolved direct numerical simulations (DNS) and reinforcement learning, we find two optimal navigation strategies that allow the glider to reach its target location accurately. These strategies depend sensitively on how the glider interacts with the surrounding fluid. The nature of this interaction changes as the particle Reynolds number Re _p changes. Our results explain how the optimal strategy depends on Re _p . At large Re _p , the glider learns to tumble rapidly by moving its centre-of-mass as its orientation changes. This generates a large horizontal inertial lift force, which allows the glider to travel far. At small Re _p , by contrast, high viscosity hinders tumbling. In this case, the glider learns to adjust its centre-of-mass so that it settles with a steady, inclined orientation that results in a horizontal viscous force. The horizontal range is much smaller than for large Re _p , because this viscous force is much smaller than the inertial lift force at large Re _p . *These authors contributed equally. Comments: 13 pages, 8 figures Subjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG) Cite as: arXiv:2510.09250 [physics.flu-dyn] (or arXiv:2510.09250v1 [physics.flu-dyn] for this version) https://doi.org/10.48550/arXiv.2510.09250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-96] Application of Deep Reinforcement Learning to At-the-Money SP 500 Options Hedging

链接: https://arxiv.org/abs/2510.09247
作者: Zofia Bracha,Paweł Sakowski,Jakub Michańków
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Pricing of Securities (q-fin.PR)
*备注: 35 pages

点击查看摘要

Abstract:This paper explores the application of deep Q-learning to hedging at-the-money options on the S\P~500 index. We develop an agent based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, trained to simulate hedging decisions without making explicit model assumptions on price dynamics. The agent was trained on historical intraday prices of S\P~500 call options across years 2004–2024, using a single time series of six predictor variables: option price, underlying asset price, moneyness, time to maturity, realized volatility, and current hedge position. A walk-forward procedure was applied for training, which led to nearly 17~years of out-of-sample evaluation. The performance of the deep reinforcement learning (DRL) agent is benchmarked against the Black–Scholes delta-hedging strategy over the same period. We assess both approaches using metrics such as annualized return, volatility, information ratio, and Sharpe ratio. To test the models’ adaptability, we performed simulations across varying market conditions and added constraints such as transaction costs and risk-awareness penalties. Our results show that the DRL agent can outperform traditional hedging methods, particularly in volatile or high-cost environments, highlighting its robustness and flexibility in practical trading contexts. While the agent consistently outperforms delta-hedging, its performance deteriorates when the risk-awareness parameter is higher. We also observed that the longer the time interval used for volatility estimation, the more stable the results.

[LG-97] Distributionally robust approximation property of neural networks

链接: https://arxiv.org/abs/2510.09177
作者: Mihriban Ceylan,David J. Prömel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond the traditional L^p -setting. The covered classes of neural networks include widely used architectures like feedforward neural networks with non-polynomial activation functions, deep narrow networks with ReLU activation functions and functional input neural networks.

[LG-98] Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection

链接: https://arxiv.org/abs/2510.08946
作者: Siyuan Chen,Minghao Guo,Caoliwen Wang,Anka He Chen,Yikun Zhang,Jingjing Chai,Yin Yang,Wojciech Matusik,Peter Yichen Chen
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity.

[LG-99] Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains

链接: https://arxiv.org/abs/2510.08929
作者: Yunrui Guan,Krishnakumar Balasubramanian,Shiqian Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly when matching heavy-tailed targets. To address these issues, we propose Mirror Flow Matching based on a \emphregularized mirror map that controls dual tail behavior and guarantees finite moments, together with coupling to a Student- t prior that aligns with heavy-tailed targets and stabilizes training. We provide theoretical guarantees, including spatial Lipschitzness and temporal regularity of the velocity field, Wasserstein convergence rates for flow matching with Student- t priors and primal-space guarantees for constrained generation, under \varepsilon -accurate learned velocity fields. Empirically, our method outperforms baselines in synthetic convex-domain simulations and achieves competitive sample quality on real-world constrained generative tasks.

[LG-100] A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

链接: https://arxiv.org/abs/2510.08916
作者: Hideaki Kim,Tomoharu Iwata
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels–functions that encode the interaction structure between events–for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.

[LG-101] Gradient-Guided Furthest Point Sampling for Robust Training Set Selection

链接: https://arxiv.org/abs/2510.08906
作者: Morris Trestman,Stefan Gugler,Felix A. Faber,O. A. von Lilienfeld
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 18 pages, 18 figures, journal article

点击查看摘要

Abstract:Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.

[LG-102] Understanding Exoplanet Habitability: A Bayesian ML Framework for Predicting Atmospheric Absorption Spectra

链接: https://arxiv.org/abs/2510.08766
作者: Vasuda Trehan,Kevin H. Knuth,M. J. Way
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of space technology in recent years, fueled by advancements in computing such as Artificial Intelligence (AI) and machine learning (ML), has profoundly transformed our capacity to explore the cosmos. Missions like the James Webb Space Telescope (JWST) have made information about distant objects more easily accessible, resulting in extensive amounts of valuable data. As part of this work-in-progress study, we are working to create an atmospheric absorption spectrum prediction model for exoplanets. The eventual model will be based on both collected observational spectra and synthetic spectral data generated by the ROCKE-3D general circulation model (GCM) developed by the climate modeling program at NASA’s Goddard Institute for Space Studies (GISS). In this initial study, spline curves are used to describe the bin heights of simulated atmospheric absorption spectra as a function of one of the values of the planetary parameters. Bayesian Adaptive Exploration is then employed to identify areas of the planetary parameter space for which more data are needed to improve the model. The resulting system will be used as a forward model so that planetary parameters can be inferred given a planet’s atmospheric absorption spectrum. This work is expected to contribute to a better understanding of exoplanetary properties and general exoplanet climates and habitability.

[LG-103] Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models

链接: https://arxiv.org/abs/2510.08703
作者: Linfeng Wang,Susana Campino,Taane G. Clark,Jody E. Phelan
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified “uncertain” variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.

[LG-104] QuIRK: Quantum-Inspired Re-uploading KAN

链接: https://arxiv.org/abs/2510.08650
作者: Vinayak Sharma,Ashish Padhy,Vijay Jagdish Karanjkar,Sourav Behera,Lord Sen,Shyamapada Mukherjee,Aviral Shrivastava
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks or KANs have shown the ability to outperform classical Deep Neural Networks, while using far fewer trainable parameters for regression problems on scientific domains. Even more powerful has been their interpretability due to their structure being composed of univariate B-Spline functions. This enables us to derive closed-form equations from trained KANs for a wide range of problems. This paper introduces a quantum-inspired variant of the KAN based on Quantum Data Re-uploading~(DR) models. The Quantum-Inspired Re-uploading KAN or QuIRK model replaces B-Splines with single-qubit DR models as the univariate function approximator, allowing them to match or outperform traditional KANs while using even fewer parameters. This is especially apparent in the case of periodic functions. Additionally, since the model utilizes only single-qubit circuits, it remains classically tractable to simulate with straightforward GPU acceleration. Finally, we also demonstrate that QuIRK retains the interpretability advantages and the ability to produce closed-form solutions.

信息检索

[IR-0] Doc2Query: Topic-Coverag e based Document Expansion and its Application to Dense Retrieval via Dual-Index Fusion

链接: https://arxiv.org/abs/2510.09557
作者: Tzu-Lin Kuo,Wei-Ning Chiu,Wei-Yun Ma,Pu-Jen Cheng
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Document expansion (DE) via query generation tackles vocabulary mismatch in sparse retrieval, yet faces limitations: uncontrolled generation producing hallucinated or redundant queries with low diversity; poor generalization from in-domain training (e.g., MS MARCO) to out-of-domain data like BEIR; and noise from concatenation harming dense retrieval. While Large Language Models (LLMs) enable cross-domain query generation, basic prompting lacks control, and taxonomy-based methods rely on domain-specific structures, limiting applicability. To address these challenges, we introduce Doc2Query++, a DE framework that structures query generation by first inferring a document’s latent topics via unsupervised topic modeling for cross-domain applicability, then using hybrid keyword selection to create a diverse and relevant keyword set per document. This guides LLM not only to leverage keywords, which ensure comprehensive topic representation, but also to reduce redundancy through diverse, relevant terms. To prevent noise from query appending in dense retrieval, we propose Dual-Index Fusion strategy that isolates text and query signals, boosting performance in dense settings. Extensive experiments show Doc2Query++ significantly outperforms state-of-the-art baselines, achieving substantial gains in MAP, nDCG@10 and Recall@100 across diverse datasets on both sparse and dense retrieval.

[IR-1] MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning -Intensive Multimodal Retrieval

链接: https://arxiv.org/abs/2510.09510
作者: Siyue Zhang,Yuan Gao,Xiao Zhou,Yilun Zhao,Tingyu Song,Arman Cohan,Anh Tuan Luu,Chen Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,502 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image-text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

[IR-2] Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL-based Recommendations

链接: https://arxiv.org/abs/2510.09167
作者: Minmao Wang,Xingchen Liu,Shijie Yi,Likang Wu,Hongke Zhao,Fei Pan,Qingpeng Cai,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender Systems (RS) are fundamental to modern online services. While most existing approaches optimize for short-term engagement, recent work has begun to explore reinforcement learning (RL) to model long-term user value. However, these efforts face significant challenges due to the vast, dynamic action spaces inherent in recommendation, which hinder stable policy learning. To resolve this bottleneck, we introduce Hierarchical Semantic RL (HSRL), which reframes RL-based recommendation over a fixed Semantic Action Space (SAS). HSRL encodes items as Semantic IDs (SIDs) for policy learning, and maps SIDs back to their original items via a fixed, invertible lookup during execution. To align decision-making with SID generation, the Hierarchical Policy Network (HPN) operates in a coarse-to-fine manner, employing hierarchical residual state modeling to refine each level’s context from the previous level’s residual, thereby stabilizing training and reducing representation-decision mismatch. In parallel, a Multi-level Critic (MLC) provides token-level value estimates, enabling fine-grained credit assignment. Across public benchmarks and a large-scale production dataset from a leading Chinese short-video advertising platform, HSRL consistently surpasses state-of-the-art baselines. In online deployment over a seven-day A/B testing, it delivers an 18.421% CVR lift with only a 1.251% increase in cost, supporting HSRL as a scalable paradigm for RL-based recommendation. Our code is released at this https URL.

[IR-3] Generative Data Augmentation in Graph Contrastive Learning for Recommendation

链接: https://arxiv.org/abs/2510.09129
作者: Yansong Wang,Qihui Lin,Junjie Huang,Tao Jia
类目: Information Retrieval (cs.IR)
*备注: The 34th ACM International Conference on Information and Knowledge Management

点击查看摘要

Abstract:Recommendation systems have become indispensable in various online platforms, from e-commerce to streaming services. A fundamental challenge in this domain is learning effective embeddings from sparse user-item interactions. While contrastive learning has recently emerged as a promising solution to this issue, generating augmented views for contrastive learning through most existing random data augmentation methods often leads to the alteration of original semantic information. In this paper, we propose a novel framework, GDA4Rec (Generative Data Augmentation in graph contrastive learning for Recommendation) to generate high-quality augmented views and provide robust self-supervised signals. Specifically, we employ a noise generation module that leverages deep generative models to approximate the distribution of original data for data augmentation. Additionally, GDA4Rec further extracts an item complement matrix to characterize the latent correlations between items and provide additional self-supervised signals. Lastly, a joint objective that integrates recommendation, data augmentation and contrastive learning is used to enforce the model to learn more effective and informative embeddings. Extensive experiments are conducted on three public datasets to demonstrate the superiority of the model. The code is available at: this https URL.

[IR-4] Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short

链接: https://arxiv.org/abs/2510.08985
作者: Xuan Lu,Haohang Huang,Rui Meng,Yaohui Jin,Wenjun Zeng,Xiaoyu Shen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studies–motivated by large reasoning models (LRMs)–have begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, reasoning improves in-domain fit but increases variance and fails to generalize out-of-domain, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking in listwise rerankers.

[IR-5] Personalize Before Retrieve: LLM -based Personalized Query Expansion for User-Centric Retrieval

链接: https://arxiv.org/abs/2510.08935
作者: Yingyi Zhang,Pengyue Jia,Derong Xu,Yi Wen,Xianneng Li,Yichao Wang,Wenlin Zhang,Xiaopeng Li,Weinan Gan,Huifeng Guo,Yong Liu,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) critically depends on effective query expansion to retrieve relevant information. However, existing expansion methods adopt uniform strategies that overlook user-specific semantics, ignoring individual expression styles, preferences, and historical context. In practice, identical queries in text can express vastly different intentions across users. This representational rigidity limits the ability of current RAG systems to generalize effectively in personalized settings. Specifically, we identify two core challenges for personalization: 1) user expression styles are inherently diverse, making it difficult for standard expansions to preserve personalized intent. 2) user corpora induce heterogeneous semantic structures-varying in topical focus and lexical organization-which hinders the effective anchoring of expanded queries within the user’s corpora space. To address these challenges, we propose Personalize Before Retrieve (PBR), a framework that incorporates user-specific signals into query expansion prior to retrieval. PBR consists of two components: P-PRF, which generates stylistically aligned pseudo feedback using user history for simulating user expression style, and P-Anchor, which performs graph-based structure alignment over user corpora to capture its structure. Together, they produce personalized query representations tailored for retrieval. Experiments on two personalized benchmarks show that PBR consistently outperforms strong baselines, with up to 10% gains on PersonaBench across retrievers. Our findings demonstrate the value of modeling personalization before retrieval to close the semantic gap in user-adaptive RAG systems. Our code is available at this https URL.

[IR-6] Observation Matrix Design for Densifying MIMO Channel Estimation via 2D Ice Filling

链接: https://arxiv.org/abs/2510.08887
作者: Zijian Zhang,Mingyao Cui
类目: Information Theory (cs.IT); Information Retrieval (cs.IR); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:In recent years, densifying multiple-input multiple-output (MIMO) has attracted much attention from the communication community. Thanks to the subwavelength antenna spacing, the strong correlations among densifying antennas provide sufficient prior knowledge about channel state information (CSI). This inspires the careful design of observation matrices (e.g., transmit precoders and receive combiners), that exploits the CSI prior knowledge, to boost channel estimation performance. Aligned with this vision, this work proposes to jointly design the combiners and precoders by maximizing the mutual information between the received pilots and densifying MIMO channels. A two-dimensional ice-filling (2DIF) algorithm is proposed to efficiently accomplish this objective. The algorithm is motivated by the fact that the eigenspace of MIMO channel covariance can be decoupled into two sub-eigenspaces, which are associated with the correlations of transmitter antennas and receiver antennas, respectively. By properly setting the precoder and the combiner as the eigenvectors from these two sub-eigenspaces, the 2DIF promises to generate near-optimal observation matrices. Moreover, we further extend the 2DIF method to the popular hybrid combining systems, where a two-stage 2DIF (TS-2DIF) algorithm is developed to handle the analog combining circuits realized by phase shifters. Simulation results demonstrate that, compared to the state-of-the-art schemes, the proposed 2DIF and TS-2DIF methods can achieve superior channel estimation accuracy.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-10-13

目录

概览 (2025-10-13)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载